Select Connection: INPUT[inlineListSuggester(optionQuery(#permanent_note), optionQuery(#literature_note), optionQuery(#fleeting_note)):connections]

Table of Contents

Transformer

Self-Attention

Multi-Head Attention

Transformer Architecture

Further improvements

Transformer

A transformer is a deep learning architecture developed by researchers at Google and based on the multi-head attention mechanism, proposed in a 2017 paper “Attention Is All You Need”. Text is converted to numerical representations called tokens, and each token is converted into a vector via looking up from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism allowing the signal for key tokens to be amplified and less important tokens to be diminished.

LSTM and GRU improved RNN’s performance on long sequences, but at some costs:

increased complexity
sequential model (creates bottlenecks) The transformer architecture allows to be parallel.

Transformer is basically combining Attention with CNN.

Self-Attention

attention-based representations for each of the words in the input sequence

$A (q, K, V)$ = attention-based vector representation of a word, which is calculated for each word $A^{< 1 >}, ..., A^{< T_{x} >}$ . The representation will look at surrounding words to choose the appropriate embedding.

A (q, K, V) = i \sum \frac{exp ( q \cdot k ^{< i >} )}{\sum _{j} exp ( q \cdot k ^{< j >} )} v^{< i >}

where:

$q$ is called Query and $q^{< k >} = W^{Q} x^{< k >}$
$k$ is called Key and $k^{< k >} = W^{k} x^{< k >}$
$v$ is called Value and $v^{< k >} = W^{v} x^{< k >}$ Names are due to Databases analogy, for example if we have the phrase:

Jane visite l’Afrique en septembre

For word 3 (l’Afrique), the query might be “What’s happening there?” and the key for word 2 might be “action”, which is the most similar, so we get the value corresponding to the key.

Multi-Head Attention

Repeat the self-attention step for all words and we get:

A tt e n t i o n (W_{1}^{Q} Q, W_{1}^{K} K, W_{1}^{V} V)

Do the previous step multiple times with different matrices $W_{i}$ with $i \in [0, h)$ , you can see it like asking different questions or having different features. Then:

M u lt i He a d (Q, K, V) = co n c a t (h e a d_{1}, h e a d_{2}, ...) W_{0}

Transformer Architecture

In the encoder, the NN identifies most important sections. In the decoder, the NN tries to predict the next word in the translation.

The decoder inputs a partial phrase and predicts the next word.

Further improvements

Positional Encoding:

$2 i$ gives a sinusoid while $2 i + 1$ gives the correspondent cosinusoid. Each word will have a different $p$ vector with fixed length $d$ , and each position will correspond to a point in the sinusoid or cosinusoid.

Add & Norm: At each layer there is an add and normalize block

Decoder output: Linear and softmax layers

Masker Multi-Head: assumes that the first words are perfect translations and masks the rest. In this way we can verify if the next word translated corresponds

🌱 Enrico's Digital Garden

Explorer

Transformer

Transformer

Self-Attention

Multi-Head Attention

Transformer Architecture

Further improvements

Graph View

Table of Contents