Select Connection: INPUT[inlineListSuggester(optionQuery(#permanent_note), optionQuery(#literature_note), optionQuery(#fleeting_note)):connections]
Table of Contents
Transformer
A transformer is a deep learning architecture developed by researchers at Google and based on the multi-head attention mechanism, proposed in a 2017 paper “Attention Is All You Need”. Text is converted to numerical representations called tokens, and each token is converted into a vector via looking up from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism allowing the signal for key tokens to be amplified and less important tokens to be diminished.
LSTM and GRU improved RNN’s performance on long sequences, but at some costs:
- increased complexity
- sequential model (creates bottlenecks) The transformer architecture allows to be parallel.
Transformer is basically combining Attention with CNN.
Self-Attention
attention-based representations for each of the words in the input sequence
= attention-based vector representation of a word, which is calculated for each word . The representation will look at surrounding words to choose the appropriate embedding.
where:
- is called Query and
- is called Key and
- is called Value and Names are due to Databases analogy, for example if we have the phrase:
Jane visite l’Afrique en septembre
For word 3 (l’Afrique), the query might be “What’s happening there?” and the key for word 2 might be “action”, which is the most similar, so we get the value corresponding to the key.
Multi-Head Attention
Repeat the self-attention step for all words and we get:
Do the previous step multiple times with different matrices with , you can see it like asking different questions or having different features. Then:
Transformer Architecture
In the encoder, the NN identifies most important sections. In the decoder, the NN tries to predict the next word in the translation.
The decoder inputs a partial phrase and predicts the next word.
Further improvements
Positional Encoding:
gives a sinusoid while gives the correspondent cosinusoid. Each word will have a different vector with fixed length , and each position will correspond to a point in the sinusoid or cosinusoid.
Add & Norm: At each layer there is an add and normalize block
Decoder output: Linear and softmax layers
Masker Multi-Head: assumes that the first words are perfect translations and masks the rest. In this way we can verify if the next word translated corresponds