How to Calculate N Gram Probabilities in Python Without Nltk

N-grams are contiguous sequences of n items from a sample of text or speech. They are fundamental in natural language processing (NLP) for understanding word patterns and predicting text. While NLTK provides convenient tools for working with n-grams, you can implement n-gram probability calculations in Python using basic data structures and algorithms.

Introduction

N-grams are sequences of n words or characters that appear together in a text. For example, in the sentence "The quick brown fox," the 2-grams (bigrams) would be "The quick," "quick brown," and "brown fox." N-grams help in various NLP tasks such as language modeling, text generation, and machine translation.

Calculating n-gram probabilities involves determining how likely a particular sequence of words is to appear in a given text. This is typically done by counting the occurrences of each n-gram and dividing by the total number of possible n-grams.

What Are N-Grams?

N-grams are contiguous sequences of n items from a sample of text or speech. They can be words, characters, or other linguistic units. The most common types of n-grams are:

Unigrams (1-grams): Single words or characters.
Bigrams (2-grams): Pairs of consecutive words or characters.
Trigrams (3-grams): Triplets of consecutive words or characters.
N-grams (n-grams): Sequences of n consecutive words or characters.

N-grams are used in various NLP applications, including:

Language modeling
Text generation
Machine translation
Speech recognition
Spelling correction

Calculating N-Gram Probabilities

The probability of an n-gram is calculated by dividing the count of the n-gram by the count of the preceding (n-1)-gram. This is known as the maximum likelihood estimate (MLE) of the n-gram probability.

N-gram probability formula:

P(w_n | w₁, w₂, ..., w_n-1) = Count(w₁, w₂, ..., w_n) / Count(w₁, w₂, ..., w_n-1)

For example, the probability of the bigram "quick brown" given the unigram "quick" is calculated by dividing the count of "quick brown" by the count of "quick."

To avoid zero probabilities for unseen n-grams, smoothing techniques such as Laplace smoothing or Kneser-Ney smoothing can be applied.

Python Implementation

You can implement n-gram probability calculations in Python using the collections module to count n-grams and calculate their probabilities. Here's a step-by-step guide:

Tokenize the text into words or characters.
Generate n-grams from the tokenized text.
Count the occurrences of each n-gram and the preceding (n-1)-gram.
Calculate the probability of each n-gram using the MLE formula.

from collections import defaultdict, Counter
import re

def generate_ngrams(text, n):
    words = re.findall(r'\w+', text.lower())
    ngrams = []
    for i in range(len(words) - n + 1):
        ngrams.append(tuple(words[i:i+n]))
    return ngrams

def calculate_ngram_probabilities(text, n):
    ngrams = generate_ngrams(text, n)
    ngram_counts = Counter(ngrams)
    context_counts = Counter()

    for ngram in ngrams:
        context = ngram[:-1]
        context_counts[context] += 1

    probabilities = {}
    for ngram in ngram_counts:
        context = ngram[:-1]
        probability = ngram_counts[ngram] / context_counts[context]
        probabilities[ngram] = probability

    return probabilities

This code defines two functions: generate_ngrams to generate n-grams from a text, and calculate_ngram_probabilities to calculate the probabilities of each n-gram.

Example

Let's calculate the bigram probabilities for the sentence "The quick brown fox jumps over the lazy dog."

text = "The quick brown fox jumps over the lazy dog."
n = 2
probabilities = calculate_ngram_probabilities(text, n)

for ngram, prob in probabilities.items():
    print(f"{ngram}: {prob:.4f}")

The output will show the probabilities of each bigram in the sentence. For example, the bigram "the quick" will have a probability of 1.0 because it appears only once in the context of "the."

This example demonstrates how to calculate n-gram probabilities in Python without using NLTK, using basic data structures and algorithms.

FAQ

What are n-grams used for?: N-grams are used in various NLP applications, including language modeling, text generation, machine translation, speech recognition, and spelling correction.
How do you calculate n-gram probabilities?: N-gram probabilities are calculated by dividing the count of the n-gram by the count of the preceding (n-1)-gram. This is known as the maximum likelihood estimate (MLE) of the n-gram probability.
What is smoothing in n-gram models?: Smoothing is a technique used to avoid zero probabilities for unseen n-grams. Common smoothing techniques include Laplace smoothing and Kneser-Ney smoothing.
Can you calculate n-gram probabilities without NLTK?: Yes, you can calculate n-gram probabilities in Python using the collections module to count n-grams and calculate their probabilities without using NLTK.