研读论文《Attention Is All You Need》（3）-EW帮帮网

原文 6

2 Background

The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU, ByteNet and ConvS2S, all of which use convolutional neural networks as basic building block, computing hidden representations in parallel for all input and output positions. In these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes it more difficult to learn dependencies between distant positions. In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as described in section 3.2.

翻译

2 研究背景

降低序列计算复杂度的目标同样是扩展神经 GPU（Extended Neural GPU）、ByteNet 和 ConvS2S 等模型的核心思想，这些模型均采用卷积神经网络作为基础构建模块，可并行计算所有输入输出位置的隐藏表征。然而在这些模型中，关联两个任意输入/输出位置信号所需的计算量会随位置间距增长—— ConvS2S 呈线性增长，ByteNet 呈对数增长。这导致模型难以学习远距离位置间的依赖关系。Transformer 将此计算量降至常数级别，尽管注意力加权位置的平均化会降低有效分辨率（我们通过 3.2 节所述的多头注意力机制缓解该问题）。

重点句子解析

The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU, ByteNet and ConvS2S, all of which use convolutional neural networks as basic building block, computing hidden representations in parallel for all input and output positions.

【解析】

这句话的结构是：主句+非限制性定语从句(all of which…)+现在分词短语(computing hidden representations… )。主句的主干是：The goal forms the foundation. 其中，forms是谓语，前后分别是主语和宾语。原句中的介词短语of reducing sequential computation做后置定语，修饰the goal；副词also修饰动词forms，做状语；介词短语of the Extended Neural GPU, ByteNet and ConvS2S做后置定语，修饰foundation。逗号后边的all of which use…as…是一个非限制性定语从句，其中的which指代the Extended Neural GPU, ByteNet and ConvS2S。现在分词短语computing hidden representations…做伴随状语，表示这一动作和定语从句的谓语动词use同时发生或伴随发生。

【参考翻译】

降低序列计算复杂度的目标同样是扩展神经 GPU、ByteNet 和 ConvS2S 等模型的核心思想，这些模型均采用卷积神经网络作为基础构建模块，可并行计算所有输入输出位置的隐藏表征。

技术解读

1. Extended Neural GPU

Extended Neural GPU（一般译为“扩展神经 GPU”或“增强型神经 GPU”）是传统 Neural GPU（神经 GPU）的扩展或改进版本，Neural GPU 本身是一种结合了卷积神经网络（CNN）和门控机制（如 LSTM 或 GRU）的深度学习架构，专为处理序列数据（如数字、文本等）而设计。

Neural GPU：由谷歌大脑团队提出（2016），用于解决传统神经网络在长序列计算任务（如二进制加法、乘法）中的局限性。

使用卷积层捕捉局部模式（类似 CNN）。
引入门控机制（如 GRU）传递隐藏状态，增强长程依赖建模能力。
通过迭代计算逐步处理输入（类似 RNN，但并行性更强）。

Extended Neural GPU 的扩展方向：

更深的架构：增加层数或改进残差连接，提升模型容量。
增强的门控机制：例如引入更复杂的注意力机制（类似 Transformer）。
多任务学习：支持多种计算任务（如算术、逻辑运算等）的联合训练。
动态计算：根据输入长度动态调整计算步骤（类似神经图灵机）。

2. ByteNet

ByteNet（可译为“字节网络”）是一种由 DeepMind 提出的神经网络架构，主要用于序列到序列（Seq2Seq）任务。

ByteNet 的核心特点

完全卷积架构：一般 Seq2Seq 模型（如 LSTM、Transformer）依赖递归（RNN）或自注意力（Self-Attention），而 ByteNet 使用卷积神经网络（CNN）处理序列。
扩张卷积：类似 WaveNet（DeepMind 的语音生成模型），ByteNet 使用扩张卷积来扩大感受野（Receptive Field），从而捕捉长距离依赖关系。
动态梯度机制：传统 Seq2Seq 模型（如 LSTM）需要固定计算步骤，而 ByteNet 可以动态调整计算深度，使短序列计算更快，长序列更准确。

ByteNet 的优势

特性	ByteNet	RNN (LSTM)	Transformer
并行计算	✅ 是	❌ 否	✅ 是
长序列处理	✅ 优秀（扩张卷积）	❌ 较差（梯度消失）	✅ 优秀（自注意力）
动态计算	✅ 可变计算深度	❌ 固定计算步骤	❌ 固定计算步骤
训练速度	⚡ 快（CNN并行）	🐢 慢（序列依赖）	⚡ 快（但内存消耗大）

ByteNet 与 Transformer 的比较

ByteNet 基于 CNN，适合长序列高效计算。
Transformer 基于 Self-Attention，更适合全局依赖建模，但计算复杂度高（O(n²)）。
现代发展：Transformer 更流行，但 ByteNet 的思想（如扩张卷积）影响了后续模型（如 ConvSeq2Seq）。

ByteNet 是一种基于 CNN 的高效序列建模架构，适用于长序列任务，但后来被 Transformer 取代。不过，它的扩张卷积和动态计算思想仍影响现代深度学习。

3. ConvS2S

ConvS2S（Convolutional Sequence to Sequence，ConvSeq2Seq，译为“卷积序列到序列模型：）是由 Facebook AI Research (FAIR)提出的一种基于卷积神经网络（CNN）的 Seq2Seq（序列到序列）模型，是 ByteNet 的改进版本。

ConvS2S 的核心特点

纯卷积架构
多层卷积 + 跳跃连接
- 采用多层堆叠的卷积层来捕捉不同粒度的特征。
- 引入残差连接（Residual Connections）避免梯度消失（类似 ResNet）。
位置嵌入：由于 CNN 本身不具备序列顺序信息，ConvS2S 使用位置编码来引入序列位置信息（类似 Transformer）。
线性注意力：在解码器部分，使用 GLU（Gated Linear Unit）增强信息流动，替代传统的注意力机制（降低计算量）。

ConvS2S 与 ByteNet、Transformer 的对比

模型	核心机制	计算复杂度	主要优势	主要劣势
ByteNet	扩张卷积（Dilated CNN）	(O(n))	线性计算	实现复杂，社区支持少
ConvS2S	多层卷积 + GLU	(O(n \log n))	训练快，适合长序列	全局依赖弱于 Transformer
Transformer	自注意力（Self-Attention）	(O(n^2))	全局建模能力强	内存消耗大

尽管 ConvS2S 在训练速度和长序列处理上有优势，但 Transformer 仍占据主导地位，原因包括：

自注意力的全局建模能力更强，尤其在复杂语义任务（如翻译、生成）上表现更好。
Transformer 的扩展性更强，催生了 BERT、GPT 等革命性模型。
硬件优化更成熟，Transformer 的矩阵运算更适合 GPU/TPU 加速。
社区和生态支持，Transformer 有更多开源实现和预训练模型。

Transformer 的通用性和扩展性使其成为主流，但 ConvS2S 仍在某些特定场景（如实时翻译）中有应用价值。

In these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet.

【解析】

句子的开头是介词短语(In these models)做状语，交代具体情境；接下来的主干是：the number grows. 介词短语of operations做后置定语，修饰the number；过去分词required和不定式短语to relate signals from two arbitrary input or output positions都是后置定语，修饰operations。这个不定式短语中的介词短语from two arbitrary…positions做后置定语，修饰signals。介词短语in the distance between positions做地点状语，修饰grows；linearly for ConvS2S and logarithmically for ByteNet是由and连接的两个并列成分做状语，都是用副词表示程度，用介词短语for…表示对象。

【参考翻译】

然而在这些模型中，关联两个任意输入/输出位置信号所需的计算量会随位置间距增长——ConvS2S呈线性增长，ByteNet呈对数增长。

This makes it more difficult to learn dependencies between distant positions.

【解析】

这个句子的结构是：主语(This )+谓语(makes)+宾语(it)+宾补(more difficult)。其中，宾语it是形式宾语，真正的宾语是不定式to learn dependencies between distant positions。这个不定式中的介词短语between distant positions是后置定语，修饰dependencies。more difficult是形容词的比较级，修饰宾语it，做宾语补足语，简称宾补。

【参考翻译】

这导致模型难以学习远距离位置间的依赖关系。

In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as described in section 3.2.

【解析】

句子的整体结构是：介词短语+主句+让步状语+插入语/同位语。其中句首的介词短语In the Transformer做状语，交代语境；主句是：this is reduced to a constant number of operations，其中的constant number表示“常数”，介词短语of operations是后置定语，修饰constant number。albeit at the cost of reduced effective resolution是让步状语，意思是：尽管是以降低有效分辨率为代价；due to averaging attention-weighted positions是原因状语，意思是：由于注意力加权位置的平均化。an effect…是插入语，也可以看作同位语，解释说明前边的the cost of reduced effective resolution。后边的we counteract with Multi-Head Attention是省略了引导词that/which的定语从句，修饰an effect。省略的that/which在定语从句中一方面指代an effect，一方面做动词counteract的逻辑宾语。as described in section 3.2属于“连词(as)+过去分词短语(described in section 3.2)”，这种结构相当于一个被动语态的定语从句，即：as is described in section 3.2，意思是：正如在3.2节所述的那样。

【参考翻译】

Transformer 将此计算量降至常数级别，尽管注意力加权位置的平均化会降低有效分辨率（我们通过3.2节所述的多头注意力机制缓解该问题）。

研读论文《Attention Is All You Need》（3）

原文 6

翻译

重点句子解析

技术解读

网站公告

今日签到

热门文章

最新发布