二 Transformer--论文理解：transformer 结构详解 _模块

本系列传送门：
(一)–论文翻译： Is All You Need 中文版
(二)–论文理解：结构详解
(三)–论文实现：代码实现
BERT系列传送门：
BERT(一)–论文翻译：BERT: Pre- of Deepfor
BERT(二)–论文理解：BERT 模型结构详解
文章目录2.3 模块3：ADD2.4 模块4：Layer .5 模块5：Feed.6 模块6： Multi-Head .7 模块7: Multi-Head .8 模块8：.9 模块9： 3. 在机器翻译任务中的使用4相关的其它问题
1. 的基本结构
2. 模块详解 2.1 模块1：
P E PE PE模块的主要做用是把位置信息加入到输入向量中，使模型知道每个字的位置信息。对于每个位置的 P E PE PE是固定的，不会因为输入的句子不同而不同，且每个位置的 P E PE PE大小为 1 ? n 1 *n 1?n(n为word的dim size)，中使用正余弦波来计算 P E PE PE，具体如下：
P E ( p o s , 2 i ) = s i n ( p o s / 1000 0 2 i / d m o d e l ) P E ( p o s , 2 i + 1 ) = c o s ( p o s / 1000 0 2 i / d m o d e l ) PE_{(pos,2i)} = sin(pos/10000^{2i/d_{model}}) \\ PE_{(pos,2i+1)} = cos(pos/10000^{2i/d_{model}}) PE(pos,2i)?=sin(pos//?)PE(pos,2i+1)?=cos(pos//?)
class PositionalEncoding(nn.Module):"Implement the PE function."def __init__(self, d_model, dropout, max_len=5000):super(PositionalEncoding, self).__init__()self.dropout = nn.Dropout(p=dropout)# Compute the positional encodings once in log space.pe = torch.zeros(max_len, d_model).float()position = torch.arange(0, max_len).unsqueeze(1).float()div_term = torch.exp(torch.arange(0, d_model, 2).float() *-(math.log(10000.0) / d_model)).float()pe[:, 0::2] = torch.sin(position * div_term)pe[:, 1::2] = torch.cos(position * div_term)pe = pe.unsqueeze(0)self.register_buffer('pe', pe)def forward(self, x):x = x + Variable(self.pe[:, :x.size(1)],requires_grad=False)return self.dropout(x)
至于为什么选择这种方式，论文中给出的解释是：
我们之所以选择这个函数，是因为我们假设它可以让模型很容易地通过相对位置来学习,因为对任意确定的偏移 k k k,P E p o s + k PE_{pos+k} PEpos+k?可以表示为 P E p o s PE_{pos} PEpos?的线性函数。
理解：
由 s i n ( α + β ) = s i n α c o s β + s i n β c o s α c o s ( α + β ) = c o s α c o s β ? s i n β s i n α sin(\alpha+\beta)=sin\alpha cos\beta + sin\beta cos\alpha\\ cos(\alpha+\beta)=cos\alpha cos\beta - sin\beta sin\alpha sin(α+β)=sinαcosβ+sinβcosαcos(α+β)=cosαcosβ?sinβsinα
可得：
P E ( p o s + k , 2 i ) = s i n ( ( p o s + k ) / 1000 0 2 i / d m o d e l ) = s i n ( p o s / 1000 0 2 i / d m o d e l ) c o s ( k / 1000 0 2 i / d m o d e l ) + s i n ( k / 1000 0 2 i / d m o d e l ) c o s ( p o s / 1000 0 2 i / d m o d e l ) \begin{} PE(pos+k,2i)&=sin((pos + k)/10000^{2i/d_{model}})\\ &=sin(pos/10000^{2i/d_{model}}) cos(k/10000^{2i/d_{model}}) \\&+ sin(k/10000^{2i/d_{model}}) cos(pos/10000^{2i/d_{model}}) \end{} PE(pos+k,2i)?=sin((pos+k)//?)=sin(pos//?)cos(k//?)+sin(k//?)cos(pos//?)?
把下面的式子代入上式，
P E ( p o s , 2 i ) = s i n ( p o s / 1000 0 2 i / d m o d e l ) P E ( p o s , 2 i + 1 ) = c o s ( p o s / 1000 0 2 i / d m o d e l ) PE_{(pos,2i)} = sin(pos/10000^{2i/d_{model}}) \\ PE_{(pos,2i+1)} = cos(pos/10000^{2i/d_{model}}) PE(pos,2i)?=sin(pos//?)PE(pos,2i+1)?=cos(pos//?)
推出：
P E ( p o s + k , 2 i ) = P E ( p o s , 2 i ) P E ( k , 2 i + 1 ) + P E ( k , 2 i ) P E ( p o s , 2 i + 1 ) PE(pos+k,2i)=PE(pos,2i)PE(k,2i+1)+PE(k,2i)PE(pos,2i+1) PE(pos+k,2i)=PE(pos,2i)PE(k,2i+1)+PE(k,2i)PE(pos,2i+1)
同理可得：
P E ( p o s + k , 2 i + 1 ) = c o s ( ( p o s + k ) / 1000 0 2 i / d m o d e l ) = c o s ( p o s / 1000 0 2 i / d m o d e l ) c o s ( k / 1000 0 2 i / d m o d e l ) ? s i n ( p o s / 1000 0 2 i / d m o d e l ) s i n ( k / 1000 0 2 i / d m o d e l ) = P E ( p o s , 2 i + 1 ) P E ( k , 2 i + 1 ) ? P E ( p o s , 2 i ) P E ( k , 2 i ) \begin{} PE(pos+k,2i+1)&=cos((pos + k)/10000^{2i/d_{model}})\\ &=cos(pos/10000^{2i/d_{model}}) cos(k/10000^{2i/d_{model}}) \\& -sin(pos/10000^{2i/d_{model}}) sin(k/10000^{2i/d_{model}})\\ &=PE(pos,2i+1)PE(k,2i+1)-PE(pos,2i)PE(k,2i) \end{} PE(pos+k,2i+1)?=cos((pos+k)//?)=cos(pos//?)cos(k//?)?sin(pos//?)sin(k//?)=PE(pos,2i+1)PE(k,2i+1)?PE(pos,2i)PE(k,2i)?