Sequence to Sequence Architectures

Select Connection: INPUT[inlineListSuggester(optionQuery(#permanent_note), optionQuery(#literature_note), optionQuery(#fleeting_note)):connections]

Table of Contents

Basic Model

Picking the Most Likely Sentence

Beam Search

Length Normalization

Discussion

Error Analysis

Bleu Score

Attention Model

Computing attention

Examples in Audio Data

Speech Recognition

Sequence Models have been motivated by the analysis of sequential data such text sentences, time-series and other discrete sequences data. These models are especially designed to handle sequential information while Convolutional Neural Network are more adapted for process spatial information.

Basic Model

Suppose a translation from French to English example. A first RNN could take care of encoding the input, this outputs a vector which will be fed to the decoder with a second RNN

Components:

Encoding network
Decoding network

This basic model could also work for Image captioning:

learn an encoding with a convolutional network
concatenate a RNN to generate an output which is a set of words

→ in the next section we will explore how to get not a random output (like in this basic model), but the most likely

Picking the Most Likely Sentence

The basic model described above is very similar to a Language Model, with the only difference that the input is a vector processed by another network. For this reason we will call the “machine translation” model Conditional Language Model, which compute:

P (y^{< 1 >}, ..., y^{< T_{y} >} ∣ x^{< 1 >}, ..., x^{< T_{x} >})

What we want now is $ar g max_{y^{< 1 >}, ..., y^{< T_{y} >}} P (y^{< 1 >}, ..., y^{< T_{y} >} ∣ x)$ To do it we will use an algorithm called Beam Search.

A greedy approach will not work, will return less optimal sentences.

Beam Search

It has an hyperparameter $B$ (Beam Width), which regulates how many words we should keep. At the first step, the algorithm takes the $B$ most likely words, at the second step it takes the $B$ most likely words for each of the words taken at the last step, continues like this for every position.

For the decoder, at each position it will output:

P (y^{< 2 >} ∣ x, y^{< 1 >})

Some refinements:

Length Normalization

instead of maximizing the product:

ar g max t = 1 \prod T_{y} P (y^{< t >} ∣ x, y^{< 1 >}, .., y^{< t - 1 >})

which could give very small numbers and lead to errors due to numerical rounding of floating point numbers, in practice the log scale is used like this:

ar g max t = 1 \sum T_{y} lo g P (y^{< t >} ∣ x, y^{< 1 >}, .., y^{< t - 1 >})

Also, to avoid that longer sequences get small probabilities, we multiply the factor $1/ T_{y}^{α}$ , usually with $α = 0.7$ .

Discussion

Beam width $B$ :

large : better result, slower
small : worse result, faster
usually around 10, 100 is considered very large

Beam search runs faster than other search algorithms, but does not guarantee to find the exact maximum.

Error Analysis

human: $y^{*}$ algorithm: $\overset{y}{^}$

Cases:

$P (y^{*} ∣ x) > P (\overset{y}{^} ∣) x)$ , beam search is at fault
$P (y^{*} ∣ x) \leq P (\overset{y}{^} ∣ x)$ , $y^{*}$ is a better translation than $\overset{y}{^}$ , so RNN model is at fault

for each dev set example, do this comparison and see who is at fault. If you find Beam Search is at fault → increase beam width else → deeper analysis to see if apply regularization and other ML techniques

Bleu Score

When you have multiple great answers, how do you measure accuracy?

BLEU (BiLingual Evaluation Understanding) could be a substitute to having humans evaluate each output.

As an example on Bigrams (pairs of words), consider the following references: Reference 1: The cat is on the mat Reference 2: There is a cat on the mat and the algorithm output: Output: The cat the cat on the mat

	count	count_{clip}
the cat	2	1
cat the	1	0
cat on	1	1
on the	1	1
the mat	1	1
so the precision is $\sum co u n t / \sum co u n t_{c l i p}$ ,

$co u n t$ , total number of bigrams in the output
$co u n t_{c l i p}$ , maximum number of bigrams of all references

on unigrams:

p_{1} = \frac{\sum _{u ni \in \overset{y}{^}} C o u n t _{c l i p} ( u ni )}{\sum _{u ni \in \overset{y}{^}} C o u n t ( u ni )}

more generally, on n-grams:

p_{n} = \frac{\sum _{n g r am \in \overset{y}{^}} C o u n t _{c l i p} ( u ni )}{\sum _{n g r am \in \overset{y}{^}} C o u n t ( u ni )}

We can also define the Combined Blue score on multiple ngrams:

BP \cdot exp (\frac{1}{n} k = 1 \sum n p_{k})

$BP$ is the brevity penalty, defined as:

BP = {1 exp (1 - reference_length / output_length) if output_length > reference_length otherwise

which penalizes shorter translations.

Attention Model

Usually, with short sentences and sentences that are too long the model performs poorly. The intuition is that a human translator translates a small part, than reads the next and translates and goes on until it is done.

Computes a set of attention weights $α^{< t, t^{'} >}$ , which says how much attention word $t^{'}$ needs with respect to word $t$ . This repeats for each word $t$ of the input.

the context $c$ will be the weighted sum:

c^{< 1 >} = t^{'} \sum α^{< 1, t^{'} >} a^{< t^{'} >}

with the constraint that $\sum_{t^{'}} α^{< 1, t^{'} >} = 1$ .

Computing attention

α^{< t, t^{'} >} = \frac{exp ( e ^{< t, t^{'} >} )}{\sum _{t^{'} = 1}^{T_{x}} exp ( e ^{< t, t^{'} >} )}

parameters $e^{< t, t^{'} >}$ can be computed with a small Neural Network with inputs $s^{< t - 1 >}$ and $a^{< t^{'} >}$ .

One downside is the quadratic time of the algorithm.

🌱 Enrico's Digital Garden

Explorer

Basic Model

Picking the Most Likely Sentence

Beam Search

Length Normalization

Discussion

Error Analysis

Bleu Score

Attention Model

Computing attention

Examples in Audio Data

Speech Recognition

Graph View

Table of Contents

Backlinks