Skip to content

CLIP

Contrastive Language-Image Pre-training (CLIP) is a self-supervised model with two neural networks: one for text encoding (Transformer) and one for image encoding (ViT). Both are embedded in the same latent space to find which images are correlated with which textual content.

The model associate the strength of the correlation between the text and the image.

Resources