Skip to content

Vision Transformer

A Vision Transformer (ViT) is a transformer designed for Computer Vision, used as an alternative to traditional CNNs. Its dispatches the image into vectorized patches rather than text to tokens.

  • With a global self-attention system, a ViT is particulary useful to capture distant features and understand a broad context