How to Calculate N Gram Probabilities in Python
N-grams are contiguous sequences of n items from a sample of text. They are fundamental in natural language processing (NLP) for understanding word patterns and predicting text. This guide explains how to calculate n-gram probabilities in Python, including implementation details and practical examples.
What Are N-Grams?
An n-gram is a contiguous sequence of n items from a sample of text. Common types include:
- Unigrams (n=1): Single words (e.g., "the")
- Bigrams (n=2): Pairs of words (e.g., "the cat")
- Trigrams (n=3): Triplets of words (e.g., "the cat sat")
N-grams help in:
- Text prediction and autocomplete
- Language modeling
- Machine translation
- Spelling correction
N-grams are case-sensitive by default. For case-insensitive analysis, convert text to lowercase first.
Calculating N-Gram Probabilities
The probability of an n-gram is calculated as the count of the n-gram divided by the count of its preceding (n-1)-gram. For example, the probability of the bigram "the cat" is:
For unigrams, the probability is simply the count of the word divided by the total number of words in the corpus.
Smoothing Techniques
To handle unseen n-grams, smoothing techniques like Laplace (add-one) smoothing are used:
This prevents zero probabilities for unseen words.
Python Implementation
Here's a Python implementation to calculate n-gram probabilities:
Key Steps
- Tokenize the text into words
- Count occurrences of each n-gram and its context
- Calculate probabilities with optional smoothing
- Return a dictionary of n-gram probabilities
Example Calculation
Let's calculate bigram probabilities for the sentence "the cat sat on the mat":
| Bigram | Count | Context Count | Probability |
|---|---|---|---|
| "the cat" | 1 | 1 | 1.0 |
| "cat sat" | 1 | 1 | 1.0 |
| "sat on" | 1 | 1 | 1.0 |
| "on the" | 1 | 1 | 1.0 |
| "the mat" | 1 | 1 | 1.0 |
All bigrams in this example have a probability of 1.0 because each word appears exactly once in its context.