Skip to content

Transformer

Transformer are [deep learning](deep learning.md) architectures mainly used for NLP, LLMs and other [generative AI](generative AI.md) tasks (e.g. image, sound, etc).

How does it work

Transformer is based on a self-attention mechanism to avoid the need to process the data sequentially. It is composed of the 2 main parts:

  • Encoder: takes the input and generate a hidden vectorial representation
  • Decoder: use the hidden representation and generate output sequence

Each layer has:

  • Multi-head attention: learns to assign weights to token according their relevance
  • Feed-forward: applies a fully connected neural network to each position in the sequence independently. It ensures that local and global information are both utilized effectively for downstream tasks.

Autoregression

An [autoregressive model](autoregressive model.md) will apply regression (i.e. generate output according to inputs) based on its own previous generated output.

Models

GPT

[Generative pre-trained transformer](Generative pre-trained transformer.md) (GPT) is a dominant architecture choice for a lot of LLMs.