Deep learning Guide 5: Sequence Models, Language Models

type

status

date

slug

password

summary

Learning Language Models

A distribution over sequences satisfies the Markov property of first order if . Higher orders correspond to longer dependencies. This leads to a number of approximations that we could apply to model a sequence:

Word Frequency

Laplace Smoothing

A common strategy is to perform some form of Laplace smoothing. The solution is to add a small constant to all counts.

Here , and are hyperparameters.

Take as an example:

when , no smoothing is applied;
when approaches positive infinity,
approaches the uniform probability .

Perplexity

Next, let’s discuss about how to measure the quality of the language model, which we will then use to evaluate our models in the subsequent sections. One way is to check how surprising the text is. A good language model is able to predict, with high accuracy, the tokens that come next. Consider the following continuations of the phrase “It is raining”, as proposed by different language models:

“It is raining outside”

“It is raining banana tree”

“It is raining piouw;kcj pwepoiut”

Perplexity can be best understood as the reciprocal of the geometric mean of the number of real choices that we have when deciding which token to pick next. Let’s look at a number of cases:

In the best case scenario, the model always perfectly estimates the probability of the target token as 1. In this case the perplexity of the model is 1.

In the worst case scenario, the model always predicts the probability of the target token as 0. In this situation, the perplexity is positive infinity.

At the baseline, the model predicts a uniform distribution over all the available tokens of the vocabulary. In this case, the perplexity equals the number of unique tokens of the vocabulary. In fact, if we were to store the sequence without any compression, this would be the best we could do for encoding it. Hence, this provides a nontrivial upper bound that any useful model must beat.

困惑度可以最好地理解為我們在決定下一步選擇哪個標記時所擁有的實際選擇數量的幾何平均值的倒數。我們來看幾個案例：

在最好的情況下，模型總是完美地將目標標記的機率估計為 1。

在最壞的情況下，模型總是預測目標標記的機率為 0。

在基線處，模型預測詞彙表中所有可用標記的均勻分佈。在這種情況下，困惑度等於詞彙表中唯一標記的數量。事實上，如果我們要在不進行任何壓縮的情況下儲存序列，這將是我們對其進行編碼的最佳選擇。因此，這提供了任何有用模型都必須超越的重要上限。

Learning Language Models

Word Frequency

Laplace Smoothing

Perplexity

Converting Raw Text into Sequence Data

tom-ci

交流頻道

加入我們的社群討論分享