Back to blogs

Part 2: Assembling the Full Architecture - From Attention to a Working Model

May 17, 2024

Part 2: Assembling the Full Architecture - From Attention to a Working Model

Alright, let's cut to the chase. In Part 1, we dissected Multi-Head Attention – the true game-changer, the mechanism that finally allowed us to look at sequences without the crippling limitations of RNNs. But attention, by itself, is just a sophisticated lookup. To build anything truly intelligent, to even begin mimicking the complex information processing of a biological brain (my ultimate goal with MoE architectures), we need to weave these sophisticated lookup mechanisms into a larger, coherent system. That system, for sequence modeling, is the Transformer.

I'm not here to just skim the surface or paste some LangChain abstraction that hides the actual computation. We're going to build this thing component by component, focusing on the raw mechanics, the data flow, and why each piece is absolutely critical. Performance, precision, and understanding the tensor gymnastics are paramount.

Why We Need to Assemble a Beast

My fascination with AI isn't just about building cool tech; it's about understanding and ultimately replicating intelligence. The human brain, with its vast network of specialized regions, its ability to parallel process, and its incredible capacity for context, is the ultimate inspiration. Recurrent Neural Networks (RNNs) tried to sequentialize this, forcing information through a bottleneck, struggling with long-term dependencies – a fundamental flaw if you're trying to model true cognition.

The Transformer, particularly its attention mechanism, offered a paradigm shift: parallel processing of context. It's a foundational step towards building the kind of modular, expert-driven (MoE) architectures I envision for something truly brain-like. We're not just predicting the next word; we're creating a robust, scalable system for parallel information synthesis. And to get there, we need to understand every single bolt and wire.

The Encoder Block: Processing Input Sequences

The Encoder's job is to take an input sequence (e.g., a sentence, a series of observations) and transform it into a rich, contextual representation. Each Encoder block is a stack of several key components.

1. Input Embeddings & Positional Encodings

First, our raw input tokens are mapped to dense vectors: Input Tokens (batch_size, seq_len) -> Input Embeddings (batch_size, seq_len, d_model). This is standard practice.

The crucial addition is Positional Encoding (PE). Since Attention processes all tokens in parallel, it loses the inherent order of the sequence. PE injects this positional information. We use a sinusoidal function, allowing the model to learn relative positions and generalize to longer sequences.

// Assuming 'embeddings' is your tensor of shape (batch_size, seq_len, d_model)
// Let's use pseudo-TypeScript for clarity, implying raw tensor operations.
 
class PositionalEncoding {
    private positionTable: Float32Array; // Precomputed table
 
    constructor(d_model: number, max_seq_len: number = 2048) {
        // Initialize sin/cos table based on d_model and max_seq_len
        // (Simplified for brevity, actual calculation involves 1/10000^(2i/d_model))
        this.positionTable = new Float32Array(max_seq_len * d_model);
        // ... Populate positionTable with sinusoidal values ...
    }
 
    apply(embeddings: Tensor): Tensor { // Tensor: (batch_size, seq_len, d_model)
        const [batchSize, seqLen, d_model] = embeddings.shape;
        if (seqLen > this.positionTable.length / d_model) {
            console.warn("Sequence length exceeds max_seq_len for PE. Dynamic resizing or error handling needed.");
        }
        
        // Slice relevant part of precomputed PE table
        const peSlice = new Tensor(
            this.positionTable.slice(0, seqLen * d_model), // Get (seq_len, d_model)
            [seqLen, d_model]
        );
 
        // Add PE to embeddings. Broadcasting handles batch_size.
        // Tensor operation: embeddings + peSlice
        return embeddings.add(peSlice); 
    }
}

Tensor Shape Check:

  • Input Embeddings: (batch_size, seq_len, d_model)
  • Positional Encoding: (1, seq_len, d_model) (broadcasts across batch)
  • Output: (batch_size, seq_len, d_model)

2. Multi-Head Attention (MHA)

This is the core. As discussed in Part 1, MHA allows the model to attend to different parts of the input sequence, capturing various aspects of context simultaneously.

// Assuming a MultiHeadAttention class from Part 1.
// Inputs: query, key, value - all of shape (batch_size, seq_len, d_model)
// (In the Encoder, Q, K, V all come from the same source)
// Outputs: (batch_size, seq_len, d_model)
 
class MultiHeadAttention {
    // ... constructor and selfAttention method from Part 1 ...
    // selfAttention(q: Tensor, k: Tensor, v: Tensor, mask: Tensor | null = null): Tensor { ... }
}

Tensor Shape Check:

  • Input (Q, K, V): (batch_size, seq_len, d_model)
  • Output: (batch_size, seq_len, d_model)

3. Feed-Forward Network (FFN)

After attention, each position in the sequence is independently processed by a simple two-layer fully connected network. This introduces non-linearity and allows the model to process the "attended" information.

class FeedForward {
    private layer1: LinearLayer; // d_model -> d_ff (usually 4 * d_model)
    private layer2: LinearLayer; // d_ff -> d_model
 
    constructor(d_model: number, d_ff: number, activation: (x: Tensor) => Tensor = relu) {
        this.layer1 = new LinearLayer(d_model, d_ff);
        this.layer2 = new LinearLayer(d_ff, d_model);
        this.activation = activation;
    }
 
    forward(input: Tensor): Tensor { // input: (batch_size, seq_len, d_model)
        // layer1: (batch_size, seq_len, d_model) -> (batch_size, seq_len, d_ff)
        let output = this.activation(this.layer1.forward(input)); 
        // layer2: (batch_size, seq_len, d_ff) -> (batch_size, seq_len, d_model)
        output = this.layer2.forward(output);
        return output;
    }
}

Tensor Shape Check:

  • Input: (batch_size, seq_len, d_model)
  • Layer 1 Output (pre-activation): (batch_size, seq_len, d_ff)
  • Layer 2 Output: (batch_size, seq_len, d_model)

4. Residual Connections and Layer Normalization

These are the unsung heroes. Without them, training deep networks like the Transformer would be a nightmare of vanishing/exploding gradients.

  • Residual Connections (Add&Norm): They allow gradients to flow directly through the network, mitigating the vanishing gradient problem and enabling much deeper models. The output of each sub-layer (MHA, FFN) is added to its input: sublayer_output + input.
  • Layer Normalization: Normalizes activations across the feature dimension for each example independently. This stabilizes training, especially with residual connections, preventing activations from becoming too large or too small. Crucially, LayerNorm works per example, per time step unlike BatchNorm which works across the batch. This is vital for variable sequence lengths and batching.
// Let's assume a Tensor library with `add` and `mean`, `variance`, `div`, `mul`, `sub` operations.
// LayerNorm applies across the last dimension (d_model).
 
class LayerNormalization {
    private gamma: Tensor; // trainable scale parameter (d_model)
    private beta: Tensor;  // trainable shift parameter (d_model)
    private epsilon: number = 1e-6; // small value for numerical stability
 
    constructor(d_model: number) {
        this.gamma = Tensor.ones([d_model]);   // Initialize with ones
        this.beta = Tensor.zeros([d_model]); // Initialize with zeros
    }
 
    forward(input: Tensor): Tensor { // input: (batch_size, seq_len, d_model)
        const [batchSize, seqLen, d_model] = input.shape;
 
        // Calculate mean and variance across the last dimension (d_model)
        // Keep dimensions for broadcasting
        const mean = input.mean(2, true); // (batch_size, seq_len, 1)
        const variance = input.variance(2, true); // (batch_size, seq_len, 1)
 
        // Normalize: (input - mean) / sqrt(variance + epsilon)
        const normalized = input.sub(mean).div(variance.add(this.epsilon).sqrt());
 
        // Apply learnable gamma and beta: gamma * normalized + beta
        // gamma and beta are (d_model), will broadcast to (batch_size, seq_len, d_model)
        return normalized.mul(this.gamma).add(this.beta);
    }
}
 
// How Add&Norm looks in practice:
// x = input;
// attention_output = this.attention.forward(this.norm1.forward(x));
// x = x.add(attention_output); // Residual connection
// ffn_output = this.ffn.forward(this.norm2.forward(x));
// x = x.add(ffn_output); // Residual connection

Tensor Shape Check:

  • Input: (batch_size, seq_len, d_model)
  • Mean/Variance: (batch_size, seq_len, 1) (after keep_dims=true)
  • Normalized Output: (batch_size, seq_len, d_model)

Assembling the Encoder Block

A single Encoder block encapsulates these components:

class EncoderBlock {
    private selfAttention: MultiHeadAttention;
    private feedForward: FeedForward;
    private norm1: LayerNormalization;
    private norm2: LayerNormalization;
 
    constructor(d_model: number, n_heads: number, d_ff: number) {
        this.selfAttention = new MultiHeadAttention(d_model, n_heads);
        this.feedForward = new FeedForward(d_model, d_ff);
        this.norm1 = new LayerNormalization(d_model);
        this.norm2 = new LayerNormalization(d_model);
    }
 
    forward(input: Tensor, mask: Tensor | null = null): Tensor { // input: (batch_size, seq_len, d_model)
        // Sub-layer 1: Multi-Head Self-Attention + Add&Norm
        let attentionInput = this.norm1.forward(input); // (batch_size, seq_len, d_model)
        let attentionOutput = this.selfAttention.selfAttention(
            attentionInput, attentionInput, attentionInput, mask
        ); // (batch_size, seq_len, d_model)
        let output = input.add(attentionOutput); // Residual: (batch_size, seq_len, d_model)
 
        // Sub-layer 2: Feed-Forward + Add&Norm
        let ffnInput = this.norm2.forward(output); // (batch_size, seq_len, d_model)
        let ffnOutput = this.feedForward.forward(ffnInput); // (batch_size, seq_len, d_model)
        output = output.add(ffnOutput); // Residual: (batch_size, seq_len, d_model)
 
        return output;
    }
}

Data Flow through Encoder Block: Input (batch_size, seq_len, d_model) -> LayerNorm -> Multi-Head Self-Attention -> Add (Residual) -> LayerNorm -> Feed-Forward Network -> Add (Residual) -> Output (batch_size, seq_len, d_model)

The Decoder Block: Generating Output Sequences

The Decoder's role is to generate an output sequence, one token at a time, based on the encoded input and the previously generated tokens. It has two attention layers.

1. Masked Multi-Head Self-Attention

This is identical to the Encoder's self-attention, but with one critical difference: masking. To prevent the decoder from "cheating" by attending to future tokens during training, we apply a causal (look-ahead) mask. This ensures that the prediction for position i only depends on positions 0 to i-1.

// When calling selfAttention:
// this.selfAttention.selfAttention(attentionInput, attentionInput, attentionInput, causalMask);
// causalMask is a triangular matrix of negative infinities for future positions.

Tensor Shape Check:

  • Input (Q, K, V): (batch_size, target_seq_len, d_model)
  • Output: (batch_size, target_seq_len, d_model)

2. Encoder-Decoder Multi-Head Attention (Cross-Attention)

This is where the Decoder interacts with the Encoder's output.

  • Query (Q) comes from the Decoder's masked self-attention output.
  • Key (K) and Value (V) come from the Encoder's output.

This allows the Decoder to focus on relevant parts of the input sequence when generating each token of the output sequence.

// Assuming `encoderOutput` is from the final Encoder layer.
// And `decoderSelfAttentionOutput` is from the masked MHA layer in the decoder.
// this.crossAttention.selfAttention(
//     decoderSelfAttentionOutput, // Query
//     encoderOutput,              // Key
//     encoderOutput               // Value
// );

Tensor Shape Check:

  • Q: (batch_size, target_seq_len, d_model)
  • K, V: (batch_size, source_seq_len, d_model)
  • Output: (batch_size, target_seq_len, d_model)

Assembling the Decoder Block

class DecoderBlock {
    private maskedSelfAttention: MultiHeadAttention;
    private crossAttention: MultiHeadAttention;
    private feedForward: FeedForward;
    private norm1: LayerNormalization;
    private norm2: LayerNormalization;
    private norm3: LayerNormalization;
 
    constructor(d_model: number, n_heads: number, d_ff: number) {
        this.maskedSelfAttention = new MultiHeadAttention(d_model, n_heads);
        this.crossAttention = new MultiHeadAttention(d_model, n_heads);
        this.feedForward = new FeedForward(d_model, d_ff);
        this.norm1 = new LayerNormalization(d_model);
        this.norm2 = new LayerNormalization(d_model);
        this.norm3 = new LayerNormalization(d_model);
    }
 
    forward(
        targetInput: Tensor, // (batch_size, target_seq_len, d_model)
        encoderOutput: Tensor, // (batch_size, source_seq_len, d_model)
        targetMask: Tensor | null = null,
        sourceMask: Tensor | null = null // For padding in encoder output
    ): Tensor {
        // Sub-layer 1: Masked Multi-Head Self-Attention + Add&Norm
        let selfAttentionInput = this.norm1.forward(targetInput);
        let selfAttentionOutput = this.maskedSelfAttention.selfAttention(
            selfAttentionInput, selfAttentionInput, selfAttentionInput, targetMask
        );
        let output = targetInput.add(selfAttentionOutput);
 
        // Sub-layer 2: Multi-Head Cross-Attention (Encoder-Decoder Attention) + Add&Norm
        let crossAttentionInput = this.norm2.forward(output);
        let crossAttentionOutput = this.crossAttention.selfAttention(
            crossAttentionInput, // Query from decoder
            encoderOutput,       // Key from encoder
            encoderOutput,       // Value from encoder
            sourceMask           // Mask for encoder padding
        );
        output = output.add(crossAttentionOutput);
 
        // Sub-layer 3: Feed-Forward + Add&Norm
        let ffnInput = this.norm3.forward(output);
        let ffnOutput = this.feedForward.forward(ffnInput);
        output = output.add(ffnOutput);
 
        return output;
    }
}

Data Flow through Decoder Block: Target Input (batch_size, target_seq_len, d_model) -> LayerNorm -> Masked Multi-Head Self-Attention -> Add (Residual) -> LayerNorm -> Multi-Head Cross-Attention (Q=decoder, K=V=encoder) -> Add (Residual) -> LayerNorm -> Feed-Forward Network -> Add (Residual) -> Output (batch_size, target_seq_len, d_model)

The Full Transformer: Encoder-Decoder Model

Finally, we stack N Encoder blocks and N Decoder blocks.

class Transformer {
    private encoder: Encoder; // Stack of EncoderBlocks
    private decoder: Decoder; // Stack of DecoderBlocks
    private targetOutputLayer: LinearLayer; // To project to vocabulary size
 
    constructor(
        vocab_size: number,
        d_model: number,
        n_heads: number,
        d_ff: number,
        n_encoder_layers: number,
        n_decoder_layers: number,
        max_seq_len: number
    ) {
        this.encoder = new Encoder(d_model, n_heads, d_ff, n_encoder_layers, max_seq_len);
        this.decoder = new Decoder(d_model, n_heads, d_ff, n_decoder_layers, max_seq_len);
        this.targetOutputLayer = new LinearLayer(d_model, vocab_size);
    }
 
    forward(
        sourceInput: Tensor, // (batch_size, source_seq_len) - raw tokens
        targetInput: Tensor, // (batch_size, target_seq_len) - raw tokens
        sourceMask: Tensor | null = null, // for padding in source
        targetMask: Tensor | null = null  // causal mask for target, and padding
    ): Tensor {
        // Embeddings and Positional Encodings are handled within Encoder/Decoder for brevity
        // (Typically, a shared embedding layer is used, with separate PEs)
 
        // Encoder pass
        const encoderOutput = this.encoder.forward(sourceInput, sourceMask); // (batch_size, source_seq_len, d_model)
 
        // Decoder pass
        let decoderOutput = this.decoder.forward(
            targetInput,
            encoderOutput,
            targetMask,
            sourceMask
        ); // (batch_size, target_seq_len, d_model)
 
        // Project decoder output to vocabulary size for token prediction
        // (batch_size, target_seq_len, d_model) -> (batch_size, target_seq_len, vocab_size)
        const finalLogits = this.targetOutputLayer.forward(decoderOutput);
 
        return finalLogits; // Raw logits before softmax
    }
}

Final Output Shape: (batch_size, target_seq_len, vocab_size)

What I Learned (And What I'm Chasing)

  1. The Power of Modularity: The Transformer is a testament to how simple, powerful components (Attention, FFN) can be stacked and connected with robust mechanisms (Residuals, LayerNorm) to create highly complex and effective models. This modularity is directly relevant to my pursuit of MoE architectures; if we can build an "expert" with a Transformer block, we can then orchestrate many such experts.
  2. Tensor Shape Discipline is Everything: Seriously. One mismatch and you're debugging for hours. Tracing (batch_size, seq_len, d_model) through every operation is not just good practice, it's essential for understanding the data flow and how information is transformed.
  3. The Unsung Heroes: Residual connections and Layer Normalization aren't glamorous, but they are absolutely non-negotiable for stable training of deep networks. They're the robust scaffolding that allows the heavy lifting of attention to actually work. This is a critical lesson for any complex system design: the 'glue' can be as important as the 'components'.
  4. Performance Mindset: This raw API approach, avoiding framework overhead, means we have direct control over tensor operations. In a real-world scenario, this is where you'd reach for highly optimized CUDA kernels or WebGPU compute shaders, ensuring your matrix multiplications and element-wise ops are blazing fast. No room for unnecessary abstractions.
  5. Towards the Brain: This architecture, with its parallel processing and ability to integrate information from multiple contexts, feels like a significant step beyond simple sequential models. It's a foundational understanding. My next challenge, and what truly excites me, is how to take these robust building blocks and assemble them into a system that allocates computational resources dynamically, mimicking how a brain might activate specific regions (experts) based on the task at hand. That's where MoE truly shines, and the Transformer, understood from the ground up, is the perfect starting point for building those 'experts.' The journey to replicate even a fraction of human intelligence is long, but every line of this "raw" code gets us closer.