C Program to Calculate First and Follow Sets

This guide explains how to write a C program to calculate FIRST and FOLLOW sets, which are fundamental concepts in compiler design. We'll cover the theory, provide a complete implementation, and walk through a practical example.

Introduction

In compiler design, FIRST and FOLLOW sets are used to determine the validity of productions in a context-free grammar. These sets help in parsing and syntax analysis.

Key Concepts:

FIRST set of a non-terminal is the set of terminals that begin the strings derived from that non-terminal.
FOLLOW set of a non-terminal is the set of terminals that can appear immediately to the right of that non-terminal in any sentential form.

Calculating these sets is crucial for constructing parsing tables in top-down parsers like LL(1) parsers. The algorithm involves iterative computation until no more elements can be added to the sets.

Algorithm Overview

The algorithm for calculating FIRST and FOLLOW sets consists of two main parts:

FIRST Set Calculation

Initialize FIRST sets for all non-terminals to empty sets.
For each production rule:
- If the right-hand side starts with a terminal, add it to the FIRST set of the left-hand side non-terminal.
- If the right-hand side starts with a non-terminal, add all elements of its FIRST set to the left-hand side's FIRST set, excluding ε (epsilon).
- If all symbols in the right-hand side can derive ε, add ε to the FIRST set.
Repeat until no more changes occur in any FIRST set.

FOLLOW Set Calculation

Initialize FOLLOW sets for all non-terminals to empty sets, except for the start symbol which gets $ (end marker).
For each production rule:
- For each non-terminal B in the right-hand side, look at the symbol A following B:
  - If A is a terminal, add it to FOLLOW(B).
  - If A is a non-terminal, add FIRST(A) to FOLLOW(B), excluding ε.
  - If A can derive ε, add FOLLOW of the left-hand side non-terminal to FOLLOW(B).
Repeat until no more changes occur in any FOLLOW set.

C Program Implementation

The following C program implements the algorithm to calculate FIRST and FOLLOW sets for a given context-free grammar.

#include <stdio.h> #include <stdlib.h> #include <string.h> #include <stdbool.h> #define MAX_PRODUCTIONS 20 #define MAX_SYMBOLS 20 #define MAX_RULES 20 typedef struct { char lhs; char rhs[MAX_SYMBOLS]; } Production; typedef struct { char symbol; char first[MAX_SYMBOLS]; char follow[MAX_SYMBOLS]; } SymbolInfo; Production productions[MAX_PRODUCTIONS]; SymbolInfo symbols[MAX_SYMBOLS]; int numProductions = 0; int numSymbols = 0; bool isNonTerminal(char c) { return c >= 'A' && c <= 'Z'; } bool isTerminal(char c) { return c >= 'a' && c <= 'z'; } bool contains(char *set, char c) { for (int i = 0; set[i] != '\0'; i++) { if (set[i] == c) return true; } return false; } void addToSet(char *set, char c) { if (!contains(set, c)) { int len = strlen(set); set[len] = c; set[len+1] = '\0'; } } void computeFirst() { bool changed; do { changed = false; for (int i = 0; i < numProductions; i++) { char lhs = productions[i].lhs; char *rhs = productions[i].rhs; for (int j = 0; j < numSymbols; j++) { if (symbols[j].symbol == lhs) { int oldLen = strlen(symbols[j].first); // Rule 1: If rhs starts with terminal, add to FIRST(lhs) if (isTerminal(rhs[0])) { addToSet(symbols[j].first, rhs[0]); } // Rule 2: If rhs starts with non-terminal, add FIRST of that non-terminal else if (isNonTerminal(rhs[0])) { for (int k = 0; k < numSymbols; k++) { if (symbols[k].symbol == rhs[0]) { for (int m = 0; symbols[k].first[m] != '\0'; m++) { if (symbols[k].first[m] != 'ε') { addToSet(symbols[j].first, symbols[k].first[m]); } } break; } } } if (strlen(symbols[j].first) > oldLen) { changed = true; } break; } } } } while (changed); } void computeFollow() { bool changed; do { changed = false; for (int i = 0; i < numProductions; i++) { char lhs = productions[i].lhs; char *rhs = productions[i].rhs; for (int j = 0; rhs[j] != '\0'; j++) { if (isNonTerminal(rhs[j])) { int symbolIndex = -1; for (int k = 0; k < numSymbols; k++) { if (symbols[k].symbol == rhs[j]) { symbolIndex = k; break; } } if (symbolIndex == -1) continue; int oldLen = strlen(symbols[symbolIndex].follow); // Rule 1: If there's a terminal after the non-terminal if (j+1 < strlen(rhs) && isTerminal(rhs[j+1])) { addToSet(symbols[symbolIndex].follow, rhs[j+1]); } // Rule 2: If there's a non-terminal after the non-terminal else if (j+1 < strlen(rhs) && isNonTerminal(rhs[j+1])) { for (int k = 0; k < numSymbols; k++) { if (symbols[k].symbol == rhs[j+1]) { for (int m = 0; symbols[k].first[m] != '\0'; m++) { if (symbols[k].first[m] != 'ε') { addToSet(symbols[symbolIndex].follow, symbols[k].first[m]); } } break; } } } // Rule 3: If we're at the end of the production else if (j+1 == strlen(rhs)) { for (int k = 0; k < numSymbols; k++) { if (symbols[k].symbol == lhs) { for (int m = 0; symbols[k].follow[m] != '\0'; m++) { addToSet(symbols[symbolIndex].follow, symbols[k].follow[m]); } break; } } } if (strlen(symbols[symbolIndex].follow) > oldLen) { changed = true; } } } } } while (changed); } void printResults() { printf("FIRST and FOLLOW Sets:\n"); for (int i = 0; i < numSymbols; i++) { printf("Symbol: %c\n", symbols[i].symbol); printf(" FIRST: { "); for (int j = 0; symbols[i].first[j] != '\0'; j++) { printf("%c ", symbols[i].first[j]); } printf("}\n"); printf(" FOLLOW: { "); for (int j = 0; symbols[i].follow[j] != '\0'; j++) { printf("%c ", symbols[i].follow[j]); } printf("}\n\n"); } } int main() { // Example grammar: S -> aB | bA; A -> a | ε; B -> b productions[0].lhs = 'S'; strcpy(productions[0].rhs, "aB"); productions[1].lhs = 'S'; strcpy(productions[1].rhs, "bA"); productions[2].lhs = 'A'; strcpy(productions[2].rhs, "a"); productions[3].lhs = 'A'; strcpy(productions[3].rhs, "ε"); productions[4].lhs = 'B'; strcpy(productions[4].rhs, "b"); numProductions = 5; // Initialize symbols symbols[0].symbol = 'S'; symbols[1].symbol = 'A'; symbols[2].symbol = 'B'; numSymbols = 3; // Add $ to FOLLOW of start symbol addToSet(symbols[0].follow, '$'); computeFirst(); computeFollow(); printResults(); return 0; }

The program includes functions to:

Check if a symbol is terminal or non-terminal
Add elements to sets while avoiding duplicates
Compute FIRST sets using the iterative algorithm
Compute FOLLOW sets using the iterative algorithm
Print the results in a readable format

Worked Example

Let's walk through an example grammar to see how the program works:

Grammar:

S → aB | bA
A → a | ε
B → b

FIRST Set Calculation

Initialize FIRST sets to empty sets.
Process each production:
- S → aB: FIRST(S) = {a}
- S → bA: FIRST(S) = {a, b}
- A → a: FIRST(A) = {a}
- A → ε: FIRST(A) = {a, ε}
- B → b: FIRST(B) = {b}

FOLLOW Set Calculation

Initialize FOLLOW(S) = {$}
Process each production:
- S → aB: FOLLOW(B) = {$, a}
- S → bA: FOLLOW(A) = {$, a}
- A → a: (no change)
- A → ε: (no change)
- B → b: (no change)

After running the program with this grammar, you should get the following results:

Symbol: S FIRST: { a b } FOLLOW: { $ } Symbol: A FIRST: { a ε } FOLLOW: { $ a } Symbol: B FIRST: { b } FOLLOW: { $ a }

FAQ

What is the difference between FIRST and FOLLOW sets?

The FIRST set of a non-terminal contains all terminals that can appear as the first symbol in any string derived from that non-terminal. The FOLLOW set contains all terminals that can appear immediately after the non-terminal in any sentential form.

When would I use FIRST and FOLLOW sets in compiler design?

These sets are primarily used in constructing parsing tables for top-down parsers like LL(1) parsers. They help determine which production to use during parsing based on the current input symbol.

What does the ε symbol represent in FIRST sets?

The ε symbol represents the empty string. It appears in a FIRST set when the non-terminal can derive an empty string, meaning it's optional in some productions.

How does the algorithm handle left recursion?

The algorithm can handle left recursion, but it may require more iterations to converge. In some cases, left-recursive grammars need to be transformed before FIRST/FOLLOW sets can be computed.