二 Transformer--论文理解:transformer 结构详解

本系列传送门:
(一)–论文翻译: Is All You Need 中文版
(二)–论文理解: 结构详解
(三)–论文实现:代码实现
BERT系列传送门:
BERT(一)–论文翻译:BERT: Pre- of Deepfor
BERT(二)–论文理解:BERT 模型结构详解
文章目录2.3 模块3:ADD2.4 模块4:Layer .5 模块5:Feed.6 模块6: Multi-Head .7 模块7: Multi-Head .8 模块8:.9 模块9: 3. 在机器翻译任务中的使用4相关的其它问题
1. 的基本结构
2. 模块详解 2.1 模块1:
P E PE PE模块的主要做用是把位置信息加入到输入向量中,使模型知道每个字的位置信息 。对于每个位置的 P E PE PE是固定的,不会因为输入的句子不同而不同,且每个位置的 P E PE PE大小为 1 ? n 1 *n 1?n(n为word的dim size),中使用正余弦波来计算 P E PE PE,具体如下:
P E ( p o s , 2 i ) = s i n ( p o s / 1000 0 2 i / d m o d e l ) P E ( p o s , 2 i + 1 ) = c o s ( p o s / 1000 0 2 i / d m o d e l ) PE_{(pos,2i)} = sin(pos/10000^{2i/d_{model}}) \\ PE_{(pos,2i+1)} = cos(pos/10000^{2i/d_{model}}) PE(pos,2i)?=sin(pos//?)PE(pos,2i+1)?=cos(pos//?)
class PositionalEncoding(nn.Module):"Implement the PE function."def __init__(self, d_model, dropout, max_len=5000):super(PositionalEncoding, self).__init__()self.dropout = nn.Dropout(p=dropout)# Compute the positional encodings once in log space.pe = torch.zeros(max_len, d_model).float()position = torch.arange(0, max_len).unsqueeze(1).float()div_term = torch.exp(torch.arange(0, d_model, 2).float() *-(math.log(10000.0) / d_model)).float()pe[:, 0::2] = torch.sin(position * div_term)pe[:, 1::2] = torch.cos(position * div_term)pe = pe.unsqueeze(0)self.register_buffer('pe', pe)def forward(self, x):x = x + Variable(self.pe[:, :x.size(1)],requires_grad=False)return self.dropout(x)
至于为什么选择这种方式,论文中给出的解释是:
我们之所以选择这个函数,是因为我们假设它可以让模型很容易地通过相对位置来学习,因为对任意确定的偏移 k k k,P E p o s + k PE_{pos+k} PEpos+k?可以表示为 P E p o s PE_{pos} PEpos?的线性函数 。
理解:
由 s i n ( α + β ) = s i n α c o s β + s i n β c o s α c o s ( α + β ) = c o s α c o s β ? s i n β s i n α sin(\alpha+\beta)=sin\alpha cos\beta + sin\beta cos\alpha\\ cos(\alpha+\beta)=cos\alpha cos\beta - sin\beta sin\alpha sin(α+β)=sinαcosβ+sinβcosαcos(α+β)=cosαcosβ?sinβsinα
可得:
P E ( p o s + k , 2 i ) = s i n ( ( p o s + k ) / 1000 0 2 i / d m o d e l ) = s i n ( p o s / 1000 0 2 i / d m o d e l ) c o s ( k / 1000 0 2 i / d m o d e l ) + s i n ( k / 1000 0 2 i / d m o d e l ) c o s ( p o s / 1000 0 2 i / d m o d e l ) \begin{} PE(pos+k,2i)&=sin((pos + k)/10000^{2i/d_{model}})\\ &=sin(pos/10000^{2i/d_{model}}) cos(k/10000^{2i/d_{model}}) \\&+ sin(k/10000^{2i/d_{model}}) cos(pos/10000^{2i/d_{model}}) \end{} PE(pos+k,2i)?=sin((pos+k)//?)=sin(pos//?)cos(k//?)+sin(k//?)cos(pos//?)?
把下面的式子代入上式,
P E ( p o s , 2 i ) = s i n ( p o s / 1000 0 2 i / d m o d e l ) P E ( p o s , 2 i + 1 ) = c o s ( p o s / 1000 0 2 i / d m o d e l ) PE_{(pos,2i)} = sin(pos/10000^{2i/d_{model}}) \\ PE_{(pos,2i+1)} = cos(pos/10000^{2i/d_{model}}) PE(pos,2i)?=sin(pos//?)PE(pos,2i+1)?=cos(pos//?)
推出:
P E ( p o s + k , 2 i ) = P E ( p o s , 2 i ) P E ( k , 2 i + 1 ) + P E ( k , 2 i ) P E ( p o s , 2 i + 1 ) PE(pos+k,2i)=PE(pos,2i)PE(k,2i+1)+PE(k,2i)PE(pos,2i+1) PE(pos+k,2i)=PE(pos,2i)PE(k,2i+1)+PE(k,2i)PE(pos,2i+1)
同理可得:
P E ( p o s + k , 2 i + 1 ) = c o s ( ( p o s + k ) / 1000 0 2 i / d m o d e l ) = c o s ( p o s / 1000 0 2 i / d m o d e l ) c o s ( k / 1000 0 2 i / d m o d e l ) ? s i n ( p o s / 1000 0 2 i / d m o d e l ) s i n ( k / 1000 0 2 i / d m o d e l ) = P E ( p o s , 2 i + 1 ) P E ( k , 2 i + 1 ) ? P E ( p o s , 2 i ) P E ( k , 2 i ) \begin{} PE(pos+k,2i+1)&=cos((pos + k)/10000^{2i/d_{model}})\\ &=cos(pos/10000^{2i/d_{model}}) cos(k/10000^{2i/d_{model}}) \\& -sin(pos/10000^{2i/d_{model}}) sin(k/10000^{2i/d_{model}})\\ &=PE(pos,2i+1)PE(k,2i+1)-PE(pos,2i)PE(k,2i) \end{} PE(pos+k,2i+1)?=cos((pos+k)//?)=cos(pos//?)cos(k//?)?sin(pos//?)sin(k//?)=PE(pos,2i+1)PE(k,2i+1)?PE(pos,2i)PE(k,2i)?