CS305 -- Formal Language Theory
Use ← → arrows to navigate
Where do context-free grammars fit? Hover over each level to explore.
Each level is strictly more powerful than the one inside it. Today we jump from Type 3 (regular) to Type 2 (context-free). The new superpower: recursive production rules.
Think of nested boxes like Russian dolls -- each outer doll contains everything the inner ones can do, plus more.
The Pumping Lemma showed us these are NOT regular:
Finite automata have finite memory (just the current state). They can't "count" unboundedly or "match" things seen earlier.
Try it -- enter a string of a's and b's:
A DFA is like counting on your fingers -- you run out. A CFG is like having a notepad: write down what to remember and check it later.
A grammar is a set of rewriting rules that tell you how to build strings step by step.
Think of variables as categories in a recipe book. "DESSERT" isn't food -- it expands into "chocolate cake" or "apple pie." Terminals are the actual food!
Every production has a single variable on the left: A → w. Variable A can be replaced regardless of what's around it. The context doesn't matter!
A context-free grammar is a 4-tuple G = (V, T, P, S)
| Component | Description |
|---|---|
| V | Finite set of variables |
| T | Finite set of terminals |
| P | Productions: A → w, where A ∈ V |
| S | Start symbol, S ∈ V |
Left side is always a single variable. In context-sensitive grammars, surrounding symbols also matter: αAβ → αγβ.
Enter n to generate anbn:
A derivation transforms S into a terminal string. Pick a mode and step through "aabb".
For an unambiguous grammar, different derivation orders produce the same parse tree. The order is just how you traverse the tree -- the tree itself is what matters.
E = Expression, T = Term, F = Factor
This grammar encodes precedence:
E, T, F are layers of binding strength. F is tightest (atoms and parens). T groups multiplications. E groups additions. Like layers of an onion -- inner layers bind first.
The grammar forces "id * id" under T, while "+" connects at the E level. This gives * higher precedence than +.
A parse tree visualizes derivation structure. Watch it grow for aabb using S → aSb | ε.
S is the ancestor. Each production is "parent has these children." Terminals at the bottom are the youngest generation -- the actual string!
The tree shows nesting. Outer S wraps a...b around inner S. This recursive structure is what DFAs cannot capture.
Grammar: E → E+T | T, T → T*F | F, F → (E) | id. Watch the tree grow step by step.
* sits deeper (under T) so evaluates first. + is at the top (under E), evaluates last: id + (id * id).
Ambiguous grammar E → E+E | E*E | (E) | id produces two different trees!
Tree 1: (id+id)*id = 20
Tree 2: id+(id*id) = 14
Different trees = different meanings. A compiler that picks Tree 1 computes (2+3)*4=20 instead of 2+(3*4)=14. Ambiguity is a property of the grammar, not the language. Fix: rewrite the grammar.
Pick an expression to compare the ambiguous vs unambiguous grammar (id=2, id=3, id=4):
Click an expression above
Click an expression above
1. One variable per precedence level. 2. Lowest precedence at top. 3. Each level references next tighter. 4. Left-recursive = left-associative.
E → E + T | T T → T * F | F F → (E) | id
Which operation is at the root of the parse tree for id * id + id * id?
Every production must be one of exactly two forms:
Plus optionally: S → ε (start symbol only, if ε ∈ L)
CNF is like putting equations in "standard form." It doesn't change what the grammar describes -- just reorganizes it into a form that's easier to work with systematically.
1. Remove ε-productions. 2. Remove unit productions (A→B). 3. Break long productions. 4. Fix terminal mixing. Order matters!
Convert: S → ASB | ε, A → aAS | a, B → SbS | A | bb
Every production becomes A → BC or A → a. The grammar is in CNF, ready for CYK!
S → (S) | SS | ε
They tried to convert to CNF but got stuck at Step 4. Which production below is the correct CNF form of S → (S)?
| Operation | Closed? | Proof idea |
|---|---|---|
| Union | YES | S → S1 | S2 |
| Concatenation | YES | S → S1 S2 |
| Kleene Star | YES | S → S1 S | ε |
| Intersection | NO | Counterexample below |
| Complement | NO | Would imply ∩ closed |
| ∩ with Regular | YES | Product construction |
Since CFLs are closed under union but NOT complement: closure under complement would imply closure under intersection (De Morgan). So both must fail!
Some CFLs are so "tangled" that every possible grammar for them is ambiguous.
A CFL L is inherently ambiguous if every CFG generating L is ambiguous. The ambiguity is built into the language itself.
Like a Venn diagram overlap: two overlapping "reasons" a string belongs. When both apply, any grammar must pick one path or the other -- giving two trees.
An ambiguous grammar might be fixable. An inherently ambiguous language cannot.
| Feature | Regular Languages | Context-Free Languages |
|---|---|---|
| Machine | DFA / NFA | Pushdown Automaton (PDA) |
| Memory | Finite (states only) | Infinite stack |
| Described by | Regular expressions | Context-free grammars |
| Closure | ∪, ∩, *, complement | ∪, *, concat (NOT ∩, compl.) |
| Parsing | O(n) linear | O(n³) CYK |
| Pumping | xykz | uvkxykz (two pumps) |
| Can do | a*b*, (ab)* | anbn, parens, palindromes |
| Can't do | anbn, counting | anbncn, cross-serial |
| Every regular language is context-free, but NOT vice versa | ||
Regular languages are a proper subset of CFLs. A DFA is a PDA that never uses its stack.
Every programming language has a CFG defining valid syntax. Compilers use parsers (LL, LR, LALR) to build parse trees.
Nested tags are inherently context-free:
CFGs are the blueprints of structured languages. Wherever you see nesting, hierarchy, or recursion, there's likely a CFG underneath.
Languages:
Grammars:
A: S → SS | (S) | ε
B: S → aSa | bSb | ε
C: S → aSb | ε
Correct matching is:
Pushdown Automata (PDA) -- the machine model for CFGs. An NFA with a stack. PDAs and CFGs recognize exactly the same class of languages.
What is the leftmost derivation that produces aaabbb?
How many times is S → aSb applied before S → ε?
For anbn, apply S → aSb exactly n times, then S → ε once. Each application adds one a on the left and one b on the right.
Strings where the number of a's is at least the number of b's. Examples: aaa, aab, aabb, a, ε
Which grammar generates exactly this language?
Think about what "extra" the language needs beyond a simpler base case. anbn uses S → aSb. For n ≥ m, we also need to generate extra a's that aren't matched to b's.