Context-Free Grammars (CFG)

CS305 -- Formal Language Theory

Use ← → arrows to navigate

The Big Picture: Chomsky Hierarchy

Where do context-free grammars fit? Hover over each level to explore.

Key Idea

Each level is strictly more powerful than the one inside it. Today we jump from Type 3 (regular) to Type 2 (context-free). The new superpower: recursive production rules.

Analogy

Think of nested boxes like Russian dolls -- each outer doll contains everything the inner ones can do, plus more.

Motivation: Why DFAs Can't Count

What DFAs/NFAs CAN'T do

The Pumping Lemma showed us these are NOT regular:

{ aⁿbⁿ | n ≥ 0 } -- equal a's then b's
Balanced parentheses -- (()(())), not (()(
Palindromes -- reads same forwards and backwards

Why DFAs fail

Finite automata have finite memory (just the current state). They can't "count" unboundedly or "match" things seen earlier.

Try it -- enter a string of a's and b's:

Analogy

A DFA is like counting on your fingers -- you run out. A CFG is like having a notepad: write down what to remember and check it later.

What is a Grammar?

Intuition: Grammars as Recipes

A grammar is a set of rewriting rules that tell you how to build strings step by step.

Recipe for a SENTENCE:
  SENTENCE  →  SUBJECT VERB OBJECT
  SUBJECT   →  "the dog" | "a cat"
  VERB      →  "chased"  | "ate"
  OBJECT    →  "the ball" | "a fish"

Analogy

Think of variables as categories in a recipe book. "DESSERT" isn't food -- it expands into "chocolate cake" or "apple pie." Terminals are the actual food!

Key Terminology

Variables (non-terminals): Placeholders that get replaced. UPPERCASE: S, A, B
Terminals: Actual characters in the final string. lowercase: a, b, 0, 1
Productions: Rewriting rules. "A → w" means "replace A with w"
Start symbol: Where every derivation begins (usually S)

Context-Free means:

Every production has a single variable on the left: A → w. Variable A can be replaced regardless of what's around it. The context doesn't matter!

Formal Definition of a CFG

Definition

A context-free grammar is a 4-tuple G = (V, T, P, S)

Component	Description
V	Finite set of variables
T	Finite set of terminals
P	Productions: A → w, where A ∈ V
S	Start symbol, S ∈ V

Why "Context-Free"?

Left side is always a single variable. In context-sensitive grammars, surrounding symbols also matter: αAβ → αγβ.

Interactive Explorer: { aⁿbⁿ | n ≥ 0 }

G = (V, T, P, S)
V = { S }
T = { a, b }
P = { S → aSb, S → ε }
Start = S

Enter n to generate aⁿbⁿ:

Derivations: Leftmost vs. Rightmost

A derivation transforms S into a terminal string. Pick a mode and step through "aabb".

Grammar: S → AB, A → aA | a, B → bB | b

|

Current: S   (select a mode)

Key Idea

For an unambiguous grammar, different derivation orders produce the same parse tree. The order is just how you traverse the tree -- the tree itself is what matters.

Classic: Arithmetic Grammar

Grammar for Arithmetic:
  E → E + T  |  T
  T → T * F  |  F
  F → ( E )  |  id

E = Expression, T = Term, F = Factor

This grammar encodes precedence:

* binds tighter than +
Both are left-associative
Parentheses override everything

Analogy

E, T, F are layers of binding strength. F is tightest (atoms and parens). T groups multiplications. E groups additions. Like layers of an onion -- inner layers bind first.

Derive: id + id * id

E
⇒ E + T          (E → E + T)
⇒ T + T          (E → T)
⇒ F + T          (T → F)
⇒ id + T         (F → id)
⇒ id + T * F    (T → T * F)
⇒ id + F * F    (T → F)
⇒ id + id * F   (F → id)
⇒ id + id * id (F → id)

Notice!

The grammar forces "id * id" under T, while "+" connects at the E level. This gives * higher precedence than +.

Parse Trees

A parse tree visualizes derivation structure. Watch it grow for aabb using S → aSb | ε.

Step 0 / 3

Rules for Parse Trees

Root = start symbol S
Internal nodes = variables
Leaves = terminals (left-to-right = the string)
Each node + children = one production

Family Tree Analogy

S is the ancestor. Each production is "parent has these children." Terminals at the bottom are the youngest generation -- the actual string!

Key Idea

The tree shows nesting. Outer S wraps a...b around inner S. This recursive structure is what DFAs cannot capture.

Parse Tree Builder: id + id * id

Grammar: E → E+T | T, T → T*F | F, F → (E) | id. Watch the tree grow step by step.

Step 0 / 8

Why This Tree is Correct

* sits deeper (under T) so evaluates first. + is at the top (under E), evaluates last: id + (id * id).

Ambiguity: Two Parse Trees for id + id * id

Ambiguous grammar E → E+E | E*E | (E) | id produces two different trees!

Tree 1: (id+id)*id = 20

Tree 2: id+(id*id) = 14

Ambiguity is dangerous!

Different trees = different meanings. A compiler that picks Tree 1 computes (2+3)*4=20 instead of 2+(3*4)=14. Ambiguity is a property of the grammar, not the language. Fix: rewrite the grammar.

Eliminating Ambiguity

Pick an expression to compare the ambiguous vs unambiguous grammar (id=2, id=3, id=4):

Ambiguous: E → E+E | E*E | id

Click an expression above

Unambiguous: E→E+T|T, T→T*F|F, F→id

Click an expression above

Recipe for Eliminating Ambiguity

1. One variable per precedence level. 2. Lowest precedence at top. 3. Each level references next tighter. 4. Left-recursive = left-associative.

Challenge: Predict the Parse Tree

Given the unambiguous grammar:

E → E + T | T T → T * F | F F → (E) | id

Which operation is at the root of the parse tree for id * id + id * id?

Chomsky Normal Form (CNF)

CNF Rules

Every production must be one of exactly two forms:

A → BC (two variables, no terminals)
A → a (exactly one terminal)

Plus optionally: S → ε (start symbol only, if ε ∈ L)

CNF:             NOT CNF:
S → AB          S → AaB
A → BC          A → BCD
A → a           A → B
B → b           A → aB

Why CNF Matters

CYK Algorithm: O(n³) parsing requires CNF
Proofs: Restricted form makes proofs easier
Binary trees: Every parse tree is binary

Analogy

CNF is like putting equations in "standard form." It doesn't change what the grammar describes -- just reorganizes it into a form that's easier to work with systematically.

4-Step Conversion Recipe

1. Remove ε-productions. 2. Remove unit productions (A→B). 3. Break long productions. 4. Fix terminal mixing. Order matters!

CNF Conversion: Step-Through

Convert: S → ASB | ε, A → aAS | a, B → SbS | A | bb

Step 0 / 9

Press "Next Step" to begin.

⇒

Result

Every production becomes A → BC or A → a. The grammar is in CNF, ready for CYK!

Challenge: Fix the Grammar Bug

A student wrote this grammar for balanced parentheses:

S → (S) | SS | ε

They tried to convert to CNF but got stuck at Step 4. Which production below is the correct CNF form of S → (S)?

Properties of Context-Free Languages

Closure Properties

Operation	Closed?	Proof idea
Union	YES	S → S₁ \| S₂
Concatenation	YES	S → S₁ S₂
Kleene Star	YES	S → S₁ S \| ε
Intersection	NO	Counterexample below
Complement	NO	Would imply ∩ closed
∩ with Regular	YES	Product construction

NOT Closed Under Intersection

L1 = { an bn cm } -- CFL!
L2 = { am bn cn } -- CFL!
L1 ∩ L2 = { an bn cn }
This is NOT context-free!

Consequence

Since CFLs are closed under union but NOT complement: closure under complement would imply closure under intersection (De Morgan). So both must fail!

Inherently Ambiguous Languages

Some CFLs are so "tangled" that every possible grammar for them is ambiguous.

Definition

A CFL L is inherently ambiguous if every CFG generating L is ambiguous. The ambiguity is built into the language itself.

Classic Example

L = { ai bj ck |
       i=j  OR  j=k }
Either a's match b's,
OR b's match c's (or both).

Why it's inherently ambiguous

Consider: an bn cn
This string is in L because:
  i=j=n (a's match b's) YES
  j=k=n (b's match c's) YES
Any grammar must handle BOTH
reasons separately, giving
two parse trees for anbncn.

Analogy

Like a Venn diagram overlap: two overlapping "reasons" a string belongs. When both apply, any grammar must pick one path or the other -- giving two trees.

Important Distinction

An ambiguous grammar might be fixable. An inherently ambiguous language cannot.

CFG vs. Regular: Head-to-Head

Feature	Regular Languages	Context-Free Languages
Machine	DFA / NFA	Pushdown Automaton (PDA)
Memory	Finite (states only)	Infinite stack
Described by	Regular expressions	Context-free grammars
Closure	∪, ∩, *, complement	∪, *, concat (NOT ∩, compl.)
Parsing	O(n) linear	O(n³) CYK
Pumping	xy^kz	uv^kxy^kz (two pumps)
Can do	ab, (ab)*	aⁿbⁿ, parens, palindromes
Can't do	aⁿbⁿ, counting	aⁿbⁿcⁿ, cross-serial
Every regular language is context-free, but NOT vice versa

Key Idea

Regular languages are a proper subset of CFLs. A DFA is a PDA that never uses its stack.

Real-World Applications of CFGs

Compilers & Programming Languages

Source: if (x > 0) { y = x + 1; }
      STATEMENT
      /   |   \
    IF   COND  BLOCK
     |   / \     |
    if  x>0   ASSIGN
             / | \
            y  =  EXPR

Every programming language has a CFG defining valid syntax. Compilers use parsers (LL, LR, LALR) to build parse trees.

XML / HTML

Nested tags are inherently context-free:

<div><p>Hello <b>world</b></p></div>

Natural Language Processing

       SENTENCE
        / \
      NP    VP
      |    / \
     Det  V   NP
      |   |   / \
    "the" | Det  N
        "ate" |    |
            "a" "fish"

Other Applications

JSON / YAML -- nested data formats
Math expressions -- calculators, CAS
DNA/RNA -- folding via stochastic CFGs
Protocols -- BNF grammars in RFCs

Analogy

CFGs are the blueprints of structured languages. Wherever you see nesting, hierarchy, or recursion, there's likely a CFG underneath.

Challenge: Match Language to Grammar

Match each language to its grammar:

Languages:

{ aⁿbⁿ | n ≥ 0 }
{ ww^R | w ∈ {a,b}* }
Balanced parentheses

Grammars:

A: S → SS | (S) | ε

B: S → aSa | bSb | ε

C: S → aSb | ε

Correct matching is:

Summary & Cheat Sheet

The Big Ideas

CFG = (V, T, P, S)
V = variables (non-terminals)
T = terminals (alphabet)
P = productions (rewrite rules)
S = start symbol
"Context-free" means:
 Left side = SINGLE variable.
Power: Regular ⊂ CFL
Machine: DFA ⊂ PDA (+ stack)

What to Remember

CFGs express nesting and matching
Parse trees show derivation structure
Ambiguity = multiple parse trees
CNF: A → BC | a (for CYK)
Closed under ∪, concat, * but NOT ∩ or complement

Common Grammar Patterns

anbn:       S → aSb | ε
palindromes: S → aSa | bSb | a | b | ε
bal. parens: S → SS | (S) | ε
arithmetic:  E → E+T | T
             T → T*F | F
             F → (E) | id

CNF Conversion Steps

Remove ε-productions
Remove unit productions
Break long productions
Fix terminal mixing

Coming Next

Pushdown Automata (PDA) -- the machine model for CFGs. An NFA with a stack. PDAs and CFGs recognize exactly the same class of languages.

Quiz: Multiple Choice (3 Questions)

Quiz: Trace the Derivation

Given: S → aSb | ε

What is the leftmost derivation that produces aaabbb?

How many times is S → aSb applied before S → ε?

Derivation Pattern

For aⁿbⁿ, apply S → aSb exactly n times, then S → ε once. Each application adds one a on the left and one b on the right.

Quiz: Build a Grammar

Build a CFG for: { aⁿb^m | n ≥ m ≥ 0 }

Strings where the number of a's is at least the number of b's. Examples: aaa, aab, aabb, a, ε

Which grammar generates exactly this language?

Grammar Design Strategy

Think about what "extra" the language needs beyond a simpler base case. aⁿbⁿ uses S → aSb. For n ≥ m, we also need to generate extra a's that aren't matched to b's.

Context-Free Grammars (CFG)

The Big Picture: Chomsky Hierarchy

Key Idea

Analogy

Motivation: Why DFAs Can't Count

What DFAs/NFAs CAN'T do

Why DFAs fail

Analogy

What is a Grammar?

Intuition: Grammars as Recipes

Analogy

Key Terminology

Context-Free means:

Formal Definition of a CFG

Definition

Why "Context-Free"?

Interactive Explorer: { anbn | n ≥ 0 }

Derivations: Leftmost vs. Rightmost

Key Idea

Classic: Arithmetic Grammar

Analogy

Derive: id + id * id

Notice!

Parse Trees

Rules for Parse Trees

Family Tree Analogy

Key Idea

Parse Tree Builder: id + id * id

Why This Tree is Correct

Ambiguity: Two Parse Trees for id + id * id

Ambiguity is dangerous!

Eliminating Ambiguity

Ambiguous: E → E+E | E*E | id

Unambiguous: E→E+T|T, T→T*F|F, F→id

Recipe for Eliminating Ambiguity

Challenge: Predict the Parse Tree

Given the unambiguous grammar:

Chomsky Normal Form (CNF)

CNF Rules

Why CNF Matters

Analogy

4-Step Conversion Recipe

CNF Conversion: Step-Through

Result

Challenge: Fix the Grammar Bug

A student wrote this grammar for balanced parentheses:

Properties of Context-Free Languages

Closure Properties

NOT Closed Under Intersection

Consequence

Inherently Ambiguous Languages

Definition

Classic Example

Why it's inherently ambiguous

Analogy

Important Distinction

CFG vs. Regular: Head-to-Head

Key Idea

Real-World Applications of CFGs

Compilers & Programming Languages

XML / HTML

Natural Language Processing

Other Applications

Analogy

Challenge: Match Language to Grammar

Match each language to its grammar:

Summary & Cheat Sheet

The Big Ideas

What to Remember

Common Grammar Patterns

CNF Conversion Steps

Coming Next

Quiz: Multiple Choice (3 Questions)

Quiz: Trace the Derivation

Given: S → aSb | ε

Derivation Pattern

Quiz: Build a Grammar

Build a CFG for: { anbm | n ≥ m ≥ 0 }

Grammar Design Strategy

Interactive Explorer: { aⁿbⁿ | n ≥ 0 }

Build a CFG for: { aⁿb^m | n ≥ m ≥ 0 }