Context-Free Grammars (CFG)

CS305 -- Formal Language Theory

Use ← → arrows to navigate

The Big Picture: Chomsky Hierarchy

Where do context-free grammars fit? Hover over each level to explore.

Key Idea

Each level is strictly more powerful than the one inside it. Today we jump from Type 3 (regular) to Type 2 (context-free). The new superpower: recursive production rules.

Analogy

Think of nested boxes like Russian dolls -- each outer doll contains everything the inner ones can do, plus more.

Motivation: Why DFAs Can't Count

What DFAs/NFAs CAN'T do

The Pumping Lemma showed us these are NOT regular:

  • { anbn | n ≥ 0 } -- equal a's then b's
  • Balanced parentheses -- (()(())), not (()(
  • Palindromes -- reads same forwards and backwards

Why DFAs fail

Finite automata have finite memory (just the current state). They can't "count" unboundedly or "match" things seen earlier.

Try it -- enter a string of a's and b's:

Analogy

A DFA is like counting on your fingers -- you run out. A CFG is like having a notepad: write down what to remember and check it later.

What is a Grammar?

Intuition: Grammars as Recipes

A grammar is a set of rewriting rules that tell you how to build strings step by step.

Recipe for a SENTENCE:
SENTENCE → SUBJECT VERB OBJECT
SUBJECT → "the dog" | "a cat"
VERB → "chased" | "ate"
OBJECT → "the ball" | "a fish"

Analogy

Think of variables as categories in a recipe book. "DESSERT" isn't food -- it expands into "chocolate cake" or "apple pie." Terminals are the actual food!

Key Terminology

  • Variables (non-terminals): Placeholders that get replaced. UPPERCASE: S, A, B
  • Terminals: Actual characters in the final string. lowercase: a, b, 0, 1
  • Productions: Rewriting rules. "A → w" means "replace A with w"
  • Start symbol: Where every derivation begins (usually S)

Context-Free means:

Every production has a single variable on the left: A → w. Variable A can be replaced regardless of what's around it. The context doesn't matter!

Formal Definition of a CFG

Definition

A context-free grammar is a 4-tuple G = (V, T, P, S)

ComponentDescription
VFinite set of variables
TFinite set of terminals
PProductions: A → w, where A ∈ V
SStart symbol, S ∈ V

Why "Context-Free"?

Left side is always a single variable. In context-sensitive grammars, surrounding symbols also matter: αAβ → αγβ.

Interactive Explorer: { anbn | n ≥ 0 }

G = (V, T, P, S)
V = { S }
T = { a, b }
P = { S → aSb, S → ε }
Start = S

Enter n to generate anbn:

Derivations: Leftmost vs. Rightmost

A derivation transforms S into a terminal string. Pick a mode and step through "aabb".

Grammar: S → AB,   A → aA | a,   B → bB | b
|
Current: S   (select a mode)

Key Idea

For an unambiguous grammar, different derivation orders produce the same parse tree. The order is just how you traverse the tree -- the tree itself is what matters.

Classic: Arithmetic Grammar

Grammar for Arithmetic:
E → E + T | T
T → T * F | F
F → ( E ) | id

E = Expression, T = Term, F = Factor

This grammar encodes precedence:

  • * binds tighter than +
  • Both are left-associative
  • Parentheses override everything

Analogy

E, T, F are layers of binding strength. F is tightest (atoms and parens). T groups multiplications. E groups additions. Like layers of an onion -- inner layers bind first.

Derive: id + id * id

E
E + T          (E → E + T)
T + T          (E → T)
F + T          (T → F)
id + T         (F → id)
⇒ id + T * F    (T → T * F)
⇒ id + F * F    (T → F)
⇒ id + id * F   (F → id)
⇒ id + id * id (F → id)

Notice!

The grammar forces "id * id" under T, while "+" connects at the E level. This gives * higher precedence than +.

Parse Trees

A parse tree visualizes derivation structure. Watch it grow for aabb using S → aSb | ε.

Step 0 / 3

Rules for Parse Trees

  • Root = start symbol S
  • Internal nodes = variables
  • Leaves = terminals (left-to-right = the string)
  • Each node + children = one production

Family Tree Analogy

S is the ancestor. Each production is "parent has these children." Terminals at the bottom are the youngest generation -- the actual string!

Key Idea

The tree shows nesting. Outer S wraps a...b around inner S. This recursive structure is what DFAs cannot capture.

Parse Tree Builder: id + id * id

Grammar: E → E+T | T,   T → T*F | F,   F → (E) | id. Watch the tree grow step by step.

Step 0 / 8

Why This Tree is Correct

* sits deeper (under T) so evaluates first. + is at the top (under E), evaluates last: id + (id * id).

Ambiguity: Two Parse Trees for id + id * id

Ambiguous grammar E → E+E | E*E | (E) | id produces two different trees!

Tree 1: (id+id)*id = 20

Tree 2: id+(id*id) = 14

Ambiguity is dangerous!

Different trees = different meanings. A compiler that picks Tree 1 computes (2+3)*4=20 instead of 2+(3*4)=14. Ambiguity is a property of the grammar, not the language. Fix: rewrite the grammar.

Eliminating Ambiguity

Pick an expression to compare the ambiguous vs unambiguous grammar (id=2, id=3, id=4):

Ambiguous: E → E+E | E*E | id

Click an expression above

Unambiguous: E→E+T|T, T→T*F|F, F→id

Click an expression above

Recipe for Eliminating Ambiguity

1. One variable per precedence level. 2. Lowest precedence at top. 3. Each level references next tighter. 4. Left-recursive = left-associative.

Challenge: Predict the Parse Tree

Given the unambiguous grammar:

E → E + T | T    T → T * F | F    F → (E) | id

Which operation is at the root of the parse tree for id * id + id * id?

Chomsky Normal Form (CNF)

CNF Rules

Every production must be one of exactly two forms:

  • A → BC (two variables, no terminals)
  • A → a (exactly one terminal)

Plus optionally: S → ε (start symbol only, if ε ∈ L)

CNF: NOT CNF:
S → AB S → AaB
A → BC A → BCD
A → a A → B
B → b A → aB

Why CNF Matters

  • CYK Algorithm: O(n³) parsing requires CNF
  • Proofs: Restricted form makes proofs easier
  • Binary trees: Every parse tree is binary

Analogy

CNF is like putting equations in "standard form." It doesn't change what the grammar describes -- just reorganizes it into a form that's easier to work with systematically.

4-Step Conversion Recipe

1. Remove ε-productions. 2. Remove unit productions (A→B). 3. Break long productions. 4. Fix terminal mixing. Order matters!

CNF Conversion: Step-Through

Convert: S → ASB | ε,   A → aAS | a,   B → SbS | A | bb

Step 0 / 9
Press "Next Step" to begin.

Result

Every production becomes A → BC or A → a. The grammar is in CNF, ready for CYK!

Challenge: Fix the Grammar Bug

A student wrote this grammar for balanced parentheses:

S → (S) | SS | ε

They tried to convert to CNF but got stuck at Step 4. Which production below is the correct CNF form of S → (S)?

Properties of Context-Free Languages

Closure Properties

OperationClosed?Proof idea
UnionYESS → S1 | S2
ConcatenationYESS → S1 S2
Kleene StarYESS → S1 S | ε
IntersectionNOCounterexample below
ComplementNOWould imply ∩ closed
∩ with RegularYESProduct construction

NOT Closed Under Intersection

L1 = { an bn cm } -- CFL!
L2 = { am bn cn } -- CFL!
L1 ∩ L2 = { an bn cn }
This is NOT context-free!

Consequence

Since CFLs are closed under union but NOT complement: closure under complement would imply closure under intersection (De Morgan). So both must fail!

Inherently Ambiguous Languages

Some CFLs are so "tangled" that every possible grammar for them is ambiguous.

Definition

A CFL L is inherently ambiguous if every CFG generating L is ambiguous. The ambiguity is built into the language itself.

Classic Example

L = { ai bj ck |
i=j OR j=k }
Either a's match b's,
OR b's match c's (or both).

Why it's inherently ambiguous

Consider: an bn cn
This string is in L because:
i=j=n (a's match b's) YES
j=k=n (b's match c's) YES
Any grammar must handle BOTH
reasons separately, giving
two parse trees for anbncn.

Analogy

Like a Venn diagram overlap: two overlapping "reasons" a string belongs. When both apply, any grammar must pick one path or the other -- giving two trees.

Important Distinction

An ambiguous grammar might be fixable. An inherently ambiguous language cannot.

CFG vs. Regular: Head-to-Head

FeatureRegular LanguagesContext-Free Languages
MachineDFA / NFAPushdown Automaton (PDA)
MemoryFinite (states only)Infinite stack
Described byRegular expressionsContext-free grammars
Closure∪, ∩, *, complement∪, *, concat (NOT ∩, compl.)
ParsingO(n) linearO(n³) CYK
Pumpingxykzuvkxykz (two pumps)
Can doa*b*, (ab)*anbn, parens, palindromes
Can't doanbn, countinganbncn, cross-serial
Every regular language is context-free, but NOT vice versa

Key Idea

Regular languages are a proper subset of CFLs. A DFA is a PDA that never uses its stack.

Real-World Applications of CFGs

Compilers & Programming Languages

Source: if (x > 0) { y = x + 1; }
STATEMENT
/ | \
IF COND BLOCK
| / \ |
if x>0 ASSIGN
/ | \
y = EXPR

Every programming language has a CFG defining valid syntax. Compilers use parsers (LL, LR, LALR) to build parse trees.

XML / HTML

Nested tags are inherently context-free:

<div><p>Hello <b>world</b></p></div>

Natural Language Processing

SENTENCE
/ \
NP VP
| / \
Det V NP
| | / \
"the" | Det N
"ate" | |
"a" "fish"

Other Applications

  • JSON / YAML -- nested data formats
  • Math expressions -- calculators, CAS
  • DNA/RNA -- folding via stochastic CFGs
  • Protocols -- BNF grammars in RFCs

Analogy

CFGs are the blueprints of structured languages. Wherever you see nesting, hierarchy, or recursion, there's likely a CFG underneath.

Challenge: Match Language to Grammar

Match each language to its grammar:

Languages:

  1. { anbn | n ≥ 0 }
  2. { wwR | w ∈ {a,b}* }
  3. Balanced parentheses

Grammars:

A: S → SS | (S) | ε

B: S → aSa | bSb | ε

C: S → aSb | ε

Correct matching is:

Summary & Cheat Sheet

The Big Ideas

CFG = (V, T, P, S)
V = variables (non-terminals)
T = terminals (alphabet)
P = productions (rewrite rules)
S = start symbol
"Context-free" means:
Left side = SINGLE variable.
Power: Regular ⊂ CFL
Machine: DFA ⊂ PDA (+ stack)

What to Remember

  • CFGs express nesting and matching
  • Parse trees show derivation structure
  • Ambiguity = multiple parse trees
  • CNF: A → BC | a (for CYK)
  • Closed under ∪, concat, * but NOT ∩ or complement

Common Grammar Patterns

anbn: S → aSb | ε
palindromes: S → aSa | bSb | a | b | ε
bal. parens: S → SS | (S) | ε
arithmetic: E → E+T | T
T → T*F | F
F → (E) | id

CNF Conversion Steps

1. Remove ε-productions
2. Remove unit productions
3. Break long productions
4. Fix terminal mixing

Coming Next

Pushdown Automata (PDA) -- the machine model for CFGs. An NFA with a stack. PDAs and CFGs recognize exactly the same class of languages.

Quiz: Multiple Choice (3 Questions)

Quiz: Trace the Derivation

Given: S → aSb | ε

What is the leftmost derivation that produces aaabbb?

How many times is S → aSb applied before S → ε?

Derivation Pattern

For anbn, apply S → aSb exactly n times, then S → ε once. Each application adds one a on the left and one b on the right.

Quiz: Build a Grammar

Build a CFG for: { anbm | n ≥ m ≥ 0 }

Strings where the number of a's is at least the number of b's. Examples: aaa, aab, aabb, a, ε

Which grammar generates exactly this language?

Grammar Design Strategy

Think about what "extra" the language needs beyond a simpler base case. anbn uses S → aSb. For n ≥ m, we also need to generate extra a's that aren't matched to b's.