How to Calculate N Gram Probabilities in Python

N-grams are contiguous sequences of n items from a sample of text. They are fundamental in natural language processing (NLP) for understanding word patterns and predicting text. This guide explains how to calculate n-gram probabilities in Python, including implementation details and practical examples.

What Are N-Grams?

An n-gram is a contiguous sequence of n items from a sample of text. Common types include:

Unigrams (n=1): Single words (e.g., "the")
Bigrams (n=2): Pairs of words (e.g., "the cat")
Trigrams (n=3): Triplets of words (e.g., "the cat sat")

N-grams help in:

Text prediction and autocomplete
Language modeling
Machine translation
Spelling correction

N-grams are case-sensitive by default. For case-insensitive analysis, convert text to lowercase first.

Calculating N-Gram Probabilities

The probability of an n-gram is calculated as the count of the n-gram divided by the count of its preceding (n-1)-gram. For example, the probability of the bigram "the cat" is:

P("the cat") = Count("the cat") / Count("the")

For unigrams, the probability is simply the count of the word divided by the total number of words in the corpus.

Smoothing Techniques

To handle unseen n-grams, smoothing techniques like Laplace (add-one) smoothing are used:

P_smoothed(word) = (Count(word) + 1) / (Total words + Vocabulary size)

This prevents zero probabilities for unseen words.

Python Implementation

Here's a Python implementation to calculate n-gram probabilities:

from collections import defaultdict def calculate_ngram_probabilities(text, n=2, smoothing=False): words = text.split() ngrams = defaultdict(int) context_counts = defaultdict(int) # Count n-grams and their contexts for i in range(len(words) - n + 1): ngram = tuple(words[i:i+n]) context = tuple(words[i:i+n-1]) ngrams[ngram] += 1 context_counts[context] += 1 # Calculate probabilities probabilities = {} total_ngrams = sum(ngrams.values()) vocab_size = len(set(words)) for ngram, count in ngrams.items(): context = ngram[:-1] if smoothing: prob = (count + 1) / (context_counts[context] + vocab_size) else: prob = count / context_counts[context] probabilities[ngram] = prob return probabilities

Key Steps

Tokenize the text into words
Count occurrences of each n-gram and its context
Calculate probabilities with optional smoothing
Return a dictionary of n-gram probabilities

Example Calculation

Let's calculate bigram probabilities for the sentence "the cat sat on the mat":

Bigram	Count	Context Count	Probability
"the cat"	1	1	1.0
"cat sat"	1	1	1.0
"sat on"	1	1	1.0
"on the"	1	1	1.0
"the mat"	1	1	1.0

All bigrams in this example have a probability of 1.0 because each word appears exactly once in its context.

FAQ

What is the difference between n-grams and skip-grams?

N-grams are contiguous sequences of words, while skip-grams allow gaps between words. For example, a bigram is "the cat" while a skip-bigram might be "the sat".

How do I choose the right n-gram size?

Start with bigrams (n=2) for most tasks. For more context, try trigrams (n=3). Unigrams (n=1) are useful for simple word frequency analysis.

What are some common applications of n-grams?

N-grams are used in spell checkers, autocomplete systems, machine translation, sentiment analysis, and topic modeling.