Modeling a Natural Language
March 7th, 2017
A natural language is any language that has evolved naturally over time in humans. Typically, a natural language has a finite set of words *V*, also called vocabulary. Based on vocabulary, a language then has a finite set of sentences *Vt*.
A language modeling problem is about learning a probability function P training sentence examples *Vt*, such that
(image: eq_0.png)
This function is expected to report high probabilities of frequent sentences and low or zero probabilities of rare sentences.
## Applications
Before we look into possible approaches to model a language, it is important to mention what we can achieve with a good language model. That matter of fact is there are a lots of powerful applications of language models in following areas;
- speech recognition
- optical character recognition
- hand writing recognition
- natural language generation
- and many more ...
In remaining part of article, we will look into two possible ways to model a natural language;
## 1. Naive Language Model
Naive language model is the simplest way to find statistical distribution of sequences of words (i.e. sentences) in a language. Though generally simpler is better, but following Naive language model does not seem to be an effective solution
(image: eq_1.png)
The biggest problem with Naive language models is zero probabilities of rarely occurring sentences.
Trigram language models (described below) provide us a way to overcome the deficiency of Naive language models.
## 2. Trigram Language Models
In ideal world, we would like to find following joint distribution;
(image: eq_2.png)
That said, finding such distribution is computationally intractable, particularly with increase in vocabulary size and sentence length. That is why, we apply trigram language models (i.e. modeling sentences as combinations of overlapping sets of 3-words) to find possibility of happening of a sentence.
(image: eq_3.png)
Trigram language models are direct application of second order Markov process. A Markov process is a series of steps or actions taken to attain a target state. Through probabilities of individual trigrams (sequence of 3 words) we would like to learn probability of a complete sentence. In trigram models, the value of a parameter depends on the values of two previous random variables. The trigram models provide us a process with which we can generate values of parameters.
A typical Trigram language model has two key aspects, feature representation and parameter estimation, as discussed below:
### 1. Feature (Sentence) Representation
As we know, sentences in any language are not fixed in length. Therefore, we need a scheme to represent and model variable length sentences. Trigram models often use a very simple solution: STOP end of sentence symbol at Xn position, and star(\*) symbol for non-existent words at Xo and X-1 positions. In fact this scheme of dealing with variable length sentences is an extension of Markov process itself. At position 1, the values of previous two variables are always \*. When it hits STOP symbol the process simply outputs values of all parameters.
### 2. Parameter Estimation Problem
A trigram model consists of
1. a finite set of words i.e. vocabulary V
2. a parameter q(*w* | *u*, *v*) for each trigram *u, v, w*
For parameter estimate, trigram models rely on maximum likelihood method. Maximum likelihood(ML) estimation of trigram parameters is very natural and intuitive, as following
(image: eq_4.png)
**Sparse Data Problem**
Data sparsity is one of major challenges in language modeling. Maximum likelihood estimation cannot be used as it is for trigram parameter estimation. The denominator count can be zero for rarely seen sentences in maximum likelihood equation. The good news is there are at least two methods with which we can overcome problem of sparse data.
**a) Linear Interpolation**
In nutshell, linear interpolation is a trade-off bias and variance. That is, it relies on following three maximum likelihood estimators
(image: ml_estimate.png)
Each of them has strengths and weaknesses: trigram estimate takes context into account but need large training data to converge, unigram converges quickly but fails to capture context information, and bigram lies between trigram and unigram i.e. share strengths and weaknesses. Through linear interpolation, we try to find an optimal trade-off between strengths and weaknesses of three estimators:
(image: lin.png)
In other words, value of trigram parameter is a weighted average of 3 maximum likelihood estimates.
**b) Discounting Method**
This method is based on 'discounted' counts. That, it generates left-over probability mass by taking away a fraction of mass (from high to low frequency terms) and then distributes it among unlucky (zero count) terms.