Transformer 架构原理

Transformer 是一种基于自注意力机制的深度学习架构，是现代大语言模型的基础。

整体架构

Transformer 由编码器和解码器两部分组成：

输入
  ↓
Embedding + Positional Encoding
  ↓
┌─────────────────────────────────────┐
│         Encoder (N层)              │
│  ┌───────────────────────────────┐  │
│  │ Multi-Head Attention          │  │
│  └───────────────────────────────┘  │
│  ┌───────────────────────────────┐  │
│  │ Feed Forward Network          │  │
│  └───────────────────────────────┘  │
└─────────────────────────────────────┘
  ↓
┌─────────────────────────────────────┐
│         Decoder (N层)              │
│  ┌───────────────────────────────┐  │
│  │ Masked Multi-Head Attention   │  │
│  └───────────────────────────────┘  │
│  ┌───────────────────────────────┐  │
│  │ Encoder-Decoder Attention     │  │
│  └───────────────────────────────┘  │
│  ┌───────────────────────────────┐  │
│  │ Feed Forward Network          │  │
│  └───────────────────────────────┘  │
└─────────────────────────────────────┘
  ↓
Linear + Softmax
  ↓
输出

编码器

编码器层结构

python

class EncoderLayer(torch.nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = torch.nn.Sequential(
            torch.nn.Linear(d_model, d_ff),
            torch.nn.ReLU(),
            torch.nn.Linear(d_ff, d_model)
        )
        self.norm1 = torch.nn.LayerNorm(d_model)
        self.norm2 = torch.nn.LayerNorm(d_model)
        self.dropout = torch.nn.Dropout(dropout)
    
    def forward(self, x, mask=None):
        # 自注意力
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))
        
        # 前馈网络
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ff_output))
        
        return x

解码器

解码器层结构

python

class DecoderLayer(torch.nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.masked_self_attn = MultiHeadAttention(d_model, num_heads)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = torch.nn.Sequential(
            torch.nn.Linear(d_model, d_ff),
            torch.nn.ReLU(),
            torch.nn.Linear(d_ff, d_model)
        )
        self.norm1 = torch.nn.LayerNorm(d_model)
        self.norm2 = torch.nn.LayerNorm(d_model)
        self.norm3 = torch.nn.LayerNorm(d_model)
        self.dropout = torch.nn.Dropout(dropout)
    
    def forward(self, x, enc_output, src_mask=None, tgt_mask=None):
        # 掩码自注意力
        attn_output = self.masked_self_attn(x, x, x, tgt_mask)
        x = self.norm1(x + self.dropout(attn_output))
        
        # 编码器-解码器注意力
        cross_output = self.cross_attn(x, enc_output, enc_output, src_mask)
        x = self.norm2(x + self.dropout(cross_output))
        
        # 前馈网络
        ff_output = self.feed_forward(x)
        x = self.norm3(x + self.dropout(ff_output))
        
        return x

位置编码

正弦位置编码

python

class PositionalEncoding(torch.nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        
        position = torch.arange(max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))
        
        pe = torch.zeros(max_len, 1, d_model)
        pe[:, 0, 0::2] = torch.sin(position * div_term)
        pe[:, 0, 1::2] = torch.cos(position * div_term)
        
        self.register_buffer('pe', pe)
    
    def forward(self, x):
        x = x + self.pe[:x.size(0)]
        return x

超参数配置

参数	说明	典型值
d_model	模型维度	512, 1024
num_heads	注意力头数	8, 16
d_ff	前馈网络维度	2048, 4096
num_layers	层数	6, 12, 24
dropout	丢弃率	0.1

架构优势

并行化: 相比 RNN，可以并行处理输入序列
长依赖: 自注意力机制可以捕捉长距离依赖
灵活性: 多头注意力可以学习不同类型的关系
可扩展性: 易于扩展到更大的模型

Transformer 架构原理 ​

整体架构 ​

编码器 ​

编码器层结构 ​

解码器 ​

解码器层结构 ​

位置编码 ​

正弦位置编码 ​

超参数配置 ​

架构优势 ​

Transformer 架构原理

整体架构

编码器

编码器层结构

解码器

解码器层结构

位置编码

正弦位置编码

超参数配置

架构优势