Transformer Architecture
🌊

Transformer Architecture

Tags
Computer Science
Machine Learning
Tech
Published
December 17, 2024
Author
Junkai Ji

Introduction

The Transformer architecture, introduced by Vaswani et al. in 2017, is one of the most popular models in machine learning, especially for tasks involving natural language processing (NLP) like translation, summarization, and text generation. It’s powerful because of its ability to handle long-range dependencies and parallelize computations. Let’s break it down simply.
 
notion image

Key Concepts

  1. Self-Attention: Helps the model focus on relevant parts of the input sequence.
  1. Multi-Head Attention: Allows the model to focus on different parts of the input simultaneously.
  1. Positional Encoding: Since transformers don't inherently understand the order of words, positional encoding is added to give each word a sense of position in the sequence.
  1. Feedforward Neural Networks: After attention, the model applies a simple neural network to process the data.
  1. Layer Normalization: A technique to improve training stability and convergence.

Transformer Components

A transformer model consists of two main parts:
  • Encoder: Reads the input sequence and generates a context representation.
  • Decoder: Uses the context representation from the encoder to produce the output sequence.

1. Self-Attention Mechanism

The self-attention mechanism helps the model decide which parts of the input are important for a given word. In simple terms, imagine you are reading a sentence, and you want to understand the meaning of a word based on the other words around it.

How Self-Attention Works

For each word in the input sequence, self-attention calculates three vectors:
  • Query (Q)
  • Key (K)
  • Value (V)
The basic idea is to calculate the similarity between each word’s query vector and all the key vectors to decide how much attention should be given to each word.

Formula

The attention score between a query Q and key K is calculated using:
Where:
  • is the query vector (learned representation of the current word).
  • is the key vector (learned representation of all words).
  • is the dimension of the key vectors (to normalize the values).
The attention score is then used to compute a weighted sum of the values , which produces the output for that word:

Simple Example

Consider a simple sentence: "I love ice cream."
Let’s say we are focusing on the word "love". The query, key, and value vectors for each word are computed. The word "love" will have a high attention score with words like "I" and "ice" because they provide useful context. As a result, the model will give more weight to these words when representing the word "love".

2. Multi-Head Attention

Instead of having just one attention mechanism, the transformer uses multiple attention heads. Each head learns different attention patterns, allowing the model to focus on various aspects of the sequence simultaneously.

Formula for Multi-Head Attention

If we have hh attention heads, the output of each head is computed independently, and the results are concatenated together:
Where:
  • for each attention head.
  • is a learned weight matrix to project the concatenated output.

3. Positional Encoding

Since transformers do not process sequences in order like RNNs (Recurrent Neural Networks), they need positional encoding to understand the order of words in a sentence.
The positional encoding vector for each word is added to its word embedding to give the model a sense of position.

Formula

The positional encoding for a position and dimension ii is given by:
Where:
  • is the position of the word in the sequence (e.g., 1st, 2nd, etc.).
  • is the dimension of the positional encoding.
  • is the total dimension of the embedding.
This formula helps capture different frequency patterns that relate to positions in a sequence.

4. Feedforward Neural Networks

After the attention layers, the transformer uses simple feedforward neural networks to further process the data.

Formula

The output from the attention layer goes through a feedforward neural network with two layers:
Where:
  • is the input from the attention layer.
  • , are learned weight matrices.
  • , are bias terms.
  • is the ReLU activation function.

5. Layer Normalization

Layer normalization is used to stabilize and accelerate training. It normalizes the input to each layer, so the model learns more efficiently.

Formula

The output of a layer after normalization is:
Where:
  • is the input to the layer.
  • is the mean of .
  • is the standard deviation of .
  • and are learnable parameters for scaling and shifting the output.

Example of a Simple Sentence Transformation

Let’s walk through a very simple example using the sentence:
"The dog runs fast."
  1. Embedding: Each word is converted into a vector using a word embedding. For example:
      • "The" -> [0.1, 0.2, 0.3]
      • "dog" -> [0.4, 0.5, 0.6]
      • "runs" -> [0.7, 0.8, 0.9]
      • "fast" -> [1.0, 1.1, 1.2]
  1. Positional Encoding: Add positional encoding to the word embeddings, so each word has a sense of its position.
  1. Self-Attention: For each word, calculate attention scores based on its relation to other words. For example, when processing the word "dog", the attention mechanism might give high attention to "runs" because they are closely related.
  1. Multi-Head Attention: Multiple attention heads process different aspects of the relationships between words.
  1. Feedforward Neural Networks: After the attention mechanism, the model applies feedforward networks to refine the output.
  1. Final Output: The decoder generates the output sequence, which could be something like a translation or summary of the input.

Conclusion

The Transformer architecture has revolutionized the way we handle sequential data, such as text, because of its efficient attention mechanisms, parallelization, and ability to capture long-range dependencies. By breaking down the sentence into smaller chunks and processing them in parallel, transformers have made NLP tasks much faster and more accurate.