CLIP

Contrastive Language-Image Pre-training (CLIP) is a self-supervised model with two neural networks: one for text encoding (Transformer) and one for image encoding (ViT). Both are embedded in the same latent space to find which images are correlated with which textual content.

The model associate the strength of the correlation between the text and the image.

Resources

How AI 'Understands' Images (CLIP) – Computerphile, Youtube
arXiv:2103.00020 – Learning Transferable Visual Models From Natural Language Supervision