Happy-LLM:从零开始的大语言模型原理与实践教程.pdf P24
Decoder Layer
一个Decoder Layer内的数据流动顺序为:
Input (x)
↓
LayerNorm 1
↓
Masked Self-Attention
↓
x + Attention(x)
↓
LayerNorm 2
↓
Cross-Attention (with enc_out)
↓
h = x + Attention(x, enc_out)
↓
LayerNorm 3
↓
MLP (Feed-Forward Network)
↓
out = h + MLP(h)
该实现依然与标准transformer不一样,以下代码里,先归一化再残差连接(Pre-LayerNorm),而标准transformer则相反
代码
class DecoderLayer(nn.Module):
def __init__(self, args):
super().__init__()
self.attention_norm_1 = LayerNorm(args.n_embd)
self.mask_attention = MultiHeadAttention(args, is_causal=True)
self.attention_norm_2 = LayerNorm(args.n_embd)
self.attention = MultiHeadAttention(args, is_causal=False)
self.ffn_norm = LayerNorm(args.n_embd)
self.feed_forward = MLP(args)
def forward(self, x, enc_out):
# Layer Norm
norm_x = self.attention_norm_1(x)
# 掩码自注意力
x = x + self.mask_attention.forward(norm_x, norm_x, norm_x)
# 多头注意力
norm_x = self.attention_norm_2(x)
h = x + self.attention.forward(norm_x, enc_out, enc_out)
# 经过前馈神经网络
out = h + self.feed_forward.forward(self.ffn_norm(h))
return out
初始化了三个子层(Masked Multi-Head Attention, Cross Attention, Feed Forward),每个子层都含有一个Layer Norm,三个层归一化的函数相同,均为LayerNorm(args.n_embd),仅名称不同
代码逻辑实质上是三个重复的
搭建Decoder
class Decoder(nn.Module):
'''解码器'''
def __init__(self, args):
super(Decoder, self).__init__()
# ⼀个 Decoder 由 N 个 Decoder Layer 组成
self.layers = nn.ModuleList([DecoderLayer(args) for _ in range(args.n_layer)])
self.norm = LayerNorm(args.n_embd)
def forward(self, x, enc_out):
"Pass the input (and mask) through each layer in turn."
for layer in self.layers:
x = layer(x, enc_out)
return self.norm(x)
[DecoderLayer(args) for _ in range(args.n_layer)] 通过循环生成 args.n_layer 个 DecoderLayer 实例;nn.ModuleList 将生成的 DecoderLayer 实例列表包装为 nn.ModuleList,动态创建并注册多个解码器层,构建符合 Transformer 架构的解码器
代码末尾的 self.norm(x) 是对所有层处理后的最终输出进行归一化
参考文章
Happy-LLM:从零开始的大语言模型原理与实践教程.pdf P24