There are many forms of parsers
Both TD and BU scan the input stream left to right.
A Context Free Grammar G = (Σ,N,P,S)
productions P are of the form A→α where
the string α ∈ Σ ∪ N and A∈ N
Notation
A derivation is a sequence of sentential forms starting from start symbol.
Derivation trees:
Grammar: B → 0B | 1B | 0 | 1
Derivation: B ⇒ 0B ⇒ 01B⇒ 010
From derivation get parse tree
But derivations may not be unique
S → SS | (S) | ()
S ⇒ SS ⇒ (S)S⇒ (())S⇒ (())()
S ⇒ SS ⇒ S() ⇒ (S)() ⇒ (())()
Different derivations but get the same parse tree. That is, how you build the tree (left-to-right, right-to-left, bottom up, depth-first) represents the derivations. Do you get the same parse tree?
Assume we have the string "z*(x+y)".
Is it in the language? If it is we should be able to start with the start symbol (E in this case) and repeatedly replace nonterminals by the right-hand-side of a corresponding production and eventually produce the desired string.
The underlined nonterminal is the next one replaced:
E⇒T⇒T * F ⇒ F * F ⇒ z * F
⇒z * ( E ) ⇒ z * ( E + T )
⇒ z * ( T + T )
⇒ z * ( F + T )
⇒ z * ( x + T )
⇒ z * ( x + F )
⇒ z * ( x + y )
A sentential form is any one of the representations of the derivation between the start symbol and the target string, including S and the string.
E ⇒* z + ( x * y )
"E
derives the string z + ( x * y )'
α γ β ⇒ α δ β
iff γ→δ ∈ P
"A sentential
form of alpha gamma beta
is rewritten to alpha delta beta if and only if there is a
production gamma replaces delta."( Gamma is
a nonterminal, and delta can be a string of terminals and nonterminals.)
E ⇒* z + ( x * y )
Consider this example to define a sequence of statements in a begin-end block
begin={ and end=}
block→begin opt-stmts end opt-stmts→stmts | ε stmts→stmts; stmt | stmt stmt→. . .
It would be simple to express a grammar to represent just the pattern of an arithmetic expression, as in
E → E + E | E * E | (E) | x | y | z
But when we parse the string "x+y*z" we can get two different parse trees, ultimately representing different precedence of the operators:
In this example the precedence of multiplication over addition is lost. The first example above of a context free grammar retains the precedence of multiplication over addition.
The same can be true for the associativity of operators. In the next example the left associativity of subtraction is lost:
E → E - E | x | y | z
Again two trees are possible:
A derivation is formally defined as α A β ⇒α γ β iff A → γ ∈ P.
It is said that α_{1} derives α_{n} if α_{1}⇒α_{2}⇒α_{3} .... ⇒α_{n}
A left-most derivation defines a left sentential form
A right-most derivation defines a right sentential form
A parse tree gives a graphical derivation
Generally done by adding more productions that more clearly indicated legal cases resolving ambiguity.
When a derivation exists such that A ⇒^{+}A α
A → A α | β is rewritten as
A →β A'
A' → α A'
| ε
in general
A → A α_{1} | A α_{2} | .... | A α_{m} | β_{1} | β_{2} | ... | β_{n}
is converted to
A →β_{1} A' | β_{2} A' | .... | β_{n} A'
A' →α_{1} A' | α_{2} A' |... | α_{m} A' | ε
It may be useful to simplify a grammar for hand coded parsers
Rules of the form A →α β_{1} | α β_{2}
can be rewritten
A → αA'
A' → β_{1} | β_{2}
There are features in programming languages that simply cannot be represented in CFG's. Information is retained in the symbol table or in other compiler variables to track and enforce certain requirements.
This language abstracts the declaration and use of variables:
L = { w c w | w ∈ (a | b)*}
This language abstracts parameters and argument lists L = { a^{n} b^{m} c^{n} d^{m} | n>=1, m>=1}
The easiest implementation of LL(1) type grammars is Recursive Descent parsing.
No left recursion can exist in the grammar, else the recursive descent parsers loops infinitely.
No backtracking can exist. A predictive parser is best: for each state the move to the next state can be determined by looking at the next input token (the 1 in LL(1)).
Consider the following back tracking example where the grammar has an alternative:
S → c A d
A → a b | a
backtracking can occur on the input "cad"
Transition Diagrams can be a plan for predictive parsing
For each nonterminal A,
Parsing
If input = a ∈ Σ then move state from s to t
Call procedure A to "match" one of the productions of A∈N. The match is complete only if final state of A is reached.
Can go to state t without advancing input
This works if no nondeterminism exists in the grammar
Example grammar in which left recursion has been removed.
E → T E'
E' → + T E' | ε
T → F T'
T' → * F T' | ε
F → id | ( E )
Remove tailing recursion by transforming to loops
Combine "used once" routines -- put in line and refine
Finally, for each transition diagram create a procedure. Use states or program flow to represent the finite state machine.
void E( ) { T( ); while(lookahead == PLUS){ match(PLUS); T( ); //semantic actions go in too } }
General idea is to "reduce" the input string back to the start symbol. At each reduction step, replace the right-hand-side (rhs) of production by left-hand-side (lhs) non-terminal.
For example the input a+b*c and the grammar
E →E + T
E →T
T →T * F
T →F
F →( E )
F →id
has the derivation E ⇒E+T ⇒E+T*F ⇒....
a + b * c F + b * c T + b * c E + b * c E + f * c E + T * c E + T * F E + T E
This is a "right-most derivation in reverse"
Handles - a substring of a sentential form matching rhs of a production whose replacement will lead to the start symbol
If S _{rm}⇒^{*} αAw ⇒αβw then A→β is a handle.
If the grammar is unambiguous then there exists exactly one handle at each step.
Example ambiguity for EE+E | E*E | (E) | id and the input string a+b*c
E⇒E+E⇒E+E*E⇒E+E*c⇒E+b*c⇒a+b*c
E⇒E*E⇒E+E*E⇒...
Viable prefixes- the set of prefixes of right sentential forms
that can appear on the stack.
(A handle never appear below the top of the stack.)
Initial state
Final state
Actions
1. shift-reduce conflict
stmt→IF expr THEN stmt
| IF expr THEN stmt ELSE stmt
| other
Handle is IF expr THEN stmt and input symbol is ELSE Do we Shift? Do we Reduce?
2. reduce-reduce conflict
A →α
B →α
If the top of stack is α which reduction should be performed?
LR parsing is the most general bottom up parser
SLR parsing (simple LR) is the easiest parser to generate but works for fewer grammars.
LALR parsing (LookAhead LR) is a middle compromise between LR and SLR
LR Parsing Algorithm
Action[s_{m}, a_{i}] →shift/state#, reduce/production#, accept, or error
Goto[s_{n}, A] →state (serves as a transition function for nonterminals)
Production[num] →production details
ALGORITHM: { push(STATE_0); input = yylex(); /*get next token*/ while(TRUE){ state = topOfStack(); switch(Action[state,input].type){ case SHIFT: push(input); push(Action[state,input].nextstate); input = yylex(); /* advance token*/ break; case REDUCE: prod = Action[state,input].prodnum; for(i=1; i<=Production[prod].length; i++){ state = pop(); /*SEMANTIC ACTIONS*/ xinput = pop(); /*USE INFO AT THIS POINT*/ } state = topOfStack(); push(Production[prod].lhs); push(Goto[state,Production[prod].lhs); break; case ACCEPT: return; default: /*ERROR*/ } } }
PRODUCTION |
|||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
E→E+T | |||||||||||
E→T | |||||||||||
T→T*F | |||||||||||
T→F | |||||||||||
F→(E) | |||||||||||
F→id | |||||||||||
Working through an example: