Transformer Model - Self Attention - Implementation with In-Depth Details

Implementation

Following is the basic implementation of Self-Attention using Pytorch. The main goal of this implementation is to make it easier to understand that how the attention score is computed and is used on values to generate the final output. The optimized implementation of multiheaded-attention with einsum() will follow in next one.

Now, we’ll try to understand the above code in details and see which part implements what.

Five steps of implementation

We can split the above implementation into following five steps-

Create Query, Key, and Value using input vectors.
Compute attention scores using Query and Key (transpose).
Convert attention scores to probability distribution using SoftMax.
Compute weighted values by multiplying attention scores to corresponding values.

Add-up the weighted values, computed using the scores of a particular query.

Implementation Details

Imports and Class initialization.

In the above init() method, the emb_dim is the embedding dimension of the input at each position say word-embeddings in a sentence. And the model_dim is the dimensionality of the query and Key and Value weight matrices. To make the output predictable we have updated weights in the Linear layers with predefined tensors.

Written on November 12, 2021