Transformer 架构原理
Transformer 是一种基于自注意力机制的深度学习架构,是现代大语言模型的基础。
整体架构
Transformer 由编码器和解码器两部分组成:
输入
↓
Embedding + Positional Encoding
↓
┌─────────────────────────────────────┐
│ Encoder (N层) │
│ ┌───────────────────────────────┐ │
│ │ Multi-Head Attention │ │
│ └───────────────────────────────┘ │
│ ┌───────────────────────────────┐ │
│ │ Feed Forward Network │ │
│ └───────────────────────────────┘ │
└─────────────────────────────────────┘
↓
┌─────────────────────────────────────┐
│ Decoder (N层) │
│ ┌───────────────────────────────┐ │
│ │ Masked Multi-Head Attention │ │
│ └───────────────────────────────┘ │
│ ┌───────────────────────────────┐ │
│ │ Encoder-Decoder Attention │ │
│ └───────────────────────────────┘ │
│ ┌───────────────────────────────┐ │
│ │ Feed Forward Network │ │
│ └───────────────────────────────┘ │
└─────────────────────────────────────┘
↓
Linear + Softmax
↓
输出编码器
编码器层结构
python
class EncoderLayer(torch.nn.Module):
def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
super().__init__()
self.self_attn = MultiHeadAttention(d_model, num_heads)
self.feed_forward = torch.nn.Sequential(
torch.nn.Linear(d_model, d_ff),
torch.nn.ReLU(),
torch.nn.Linear(d_ff, d_model)
)
self.norm1 = torch.nn.LayerNorm(d_model)
self.norm2 = torch.nn.LayerNorm(d_model)
self.dropout = torch.nn.Dropout(dropout)
def forward(self, x, mask=None):
# 自注意力
attn_output = self.self_attn(x, x, x, mask)
x = self.norm1(x + self.dropout(attn_output))
# 前馈网络
ff_output = self.feed_forward(x)
x = self.norm2(x + self.dropout(ff_output))
return x解码器
解码器层结构
python
class DecoderLayer(torch.nn.Module):
def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
super().__init__()
self.masked_self_attn = MultiHeadAttention(d_model, num_heads)
self.cross_attn = MultiHeadAttention(d_model, num_heads)
self.feed_forward = torch.nn.Sequential(
torch.nn.Linear(d_model, d_ff),
torch.nn.ReLU(),
torch.nn.Linear(d_ff, d_model)
)
self.norm1 = torch.nn.LayerNorm(d_model)
self.norm2 = torch.nn.LayerNorm(d_model)
self.norm3 = torch.nn.LayerNorm(d_model)
self.dropout = torch.nn.Dropout(dropout)
def forward(self, x, enc_output, src_mask=None, tgt_mask=None):
# 掩码自注意力
attn_output = self.masked_self_attn(x, x, x, tgt_mask)
x = self.norm1(x + self.dropout(attn_output))
# 编码器-解码器注意力
cross_output = self.cross_attn(x, enc_output, enc_output, src_mask)
x = self.norm2(x + self.dropout(cross_output))
# 前馈网络
ff_output = self.feed_forward(x)
x = self.norm3(x + self.dropout(ff_output))
return x位置编码
正弦位置编码
python
class PositionalEncoding(torch.nn.Module):
def __init__(self, d_model, max_len=5000):
super().__init__()
position = torch.arange(max_len).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))
pe = torch.zeros(max_len, 1, d_model)
pe[:, 0, 0::2] = torch.sin(position * div_term)
pe[:, 0, 1::2] = torch.cos(position * div_term)
self.register_buffer('pe', pe)
def forward(self, x):
x = x + self.pe[:x.size(0)]
return x超参数配置
| 参数 | 说明 | 典型值 |
|---|---|---|
| d_model | 模型维度 | 512, 1024 |
| num_heads | 注意力头数 | 8, 16 |
| d_ff | 前馈网络维度 | 2048, 4096 |
| num_layers | 层数 | 6, 12, 24 |
| dropout | 丢弃率 | 0.1 |
架构优势
- 并行化: 相比 RNN,可以并行处理输入序列
- 长依赖: 自注意力机制可以捕捉长距离依赖
- 灵活性: 多头注意力可以学习不同类型的关系
- 可扩展性: 易于扩展到更大的模型