How to Calculate N Gram Probability

N-gram probability is a fundamental concept in natural language processing (NLP) that measures how likely a sequence of words (n-grams) appears in a given text corpus. This guide explains how to calculate n-gram probabilities, including unigrams, bigrams, and higher-order n-grams, and demonstrates their practical applications.

What is N-Gram Probability?

An n-gram is a contiguous sequence of n items from a sample of text. In NLP, these items are typically words, characters, or syllables. The probability of an n-gram measures how frequently that sequence appears in a given text corpus relative to all possible sequences of the same length.

N-grams are classified by their length (n):

Unigrams (n=1): Single words (e.g., "the")
Bigrams (n=2): Pairs of words (e.g., "the cat")
Trigrams (n=3): Triplets of words (e.g., "the cat sat")
N-grams (n≥4): Longer sequences of words

N-gram probabilities are essential for tasks like text generation, machine translation, speech recognition, and spam detection.

How to Calculate N-Gram Probability

Calculating n-gram probability involves counting occurrences of the n-gram in a corpus and dividing by the total number of possible n-grams of that length. Here's the step-by-step process:

Define the corpus: Choose a text corpus (a collection of documents) to analyze.
Preprocess the text: Tokenize the text into words, convert to lowercase, and remove punctuation.
Count n-gram occurrences: Count how many times each n-gram appears in the corpus.
Calculate probabilities: Divide the count of each n-gram by the total number of possible n-grams of that length.

Note: For higher-order n-grams (n>1), you may need to apply smoothing techniques to handle unseen n-grams.

N-Gram Probability Formula

The probability of an n-gram w₁ w₂ ... w_n is calculated as:

P(w₁ w₂ ... wₙ) = Count(w₁ w₂ ... wₙ) / Count(w₁ w₂ ... wₙ₋₁)

Where:

Count(w₁ w₂ ... wₙ) is the number of times the n-gram appears in the corpus.
Count(w₁ w₂ ... wₙ₋₁) is the number of times the (n-1)-gram appears in the corpus.

For unigrams (n=1), the probability is simply the count of the word divided by the total number of words in the corpus.

Worked Example

Let's calculate the probability of the bigram "the cat" in a small corpus:

Corpus: "the cat sat on the mat. the cat slept on the mat."

Tokenize and count bigrams:
- "the cat" appears 2 times
- "cat sat" appears 1 time
- "sat on" appears 1 time
- "on the" appears 2 times
- "the mat" appears 2 times
- "cat slept" appears 1 time
Count the trigram "the cat sat": 1 occurrence
Count the bigram "the cat": 2 occurrences
Calculate probability:
P("the cat sat") = Count("the cat sat") / Count("the cat") = 1 / 2 = 0.5

The probability of the bigram "the cat" is 0.5, meaning it appears half as often as the unigram "the cat" in this corpus.

Applications of N-Gram Probability

N-gram probabilities are used in various NLP applications:

Text generation: Predicting the next word in a sequence based on previous words.
Machine translation: Estimating the likelihood of word sequences in different languages.
Speech recognition: Converting spoken language to text by analyzing word probabilities.
Spam detection: Identifying spammy text patterns based on n-gram frequencies.
Autocomplete: Suggesting the next word in search queries or messaging apps.

FAQ

What is the difference between unigrams and bigrams?

Unigrams are single words, while bigrams are pairs of words. Unigrams measure the probability of individual words, while bigrams measure the probability of word pairs given the previous word.

How do you handle unseen n-grams?

Unseen n-grams can be handled using smoothing techniques like Laplace smoothing, which adds a small constant to all counts to avoid zero probabilities.

What is the difference between n-gram probability and n-gram frequency?

N-gram frequency is the raw count of how often an n-gram appears, while n-gram probability is the normalized count relative to all possible n-grams of that length.

How do you choose the value of n for n-grams?

The value of n depends on the application. For text generation, higher-order n-grams (n=3 or 4) often work well. For simpler tasks, unigrams or bigrams may suffice.