Syntax Analysis of Context Free Grammars (CFG's)

 last updated 9/10/15

There are many forms of parsers

Both TD and BU scan the input stream left to right.

 

 


Error recovery

 

 

 

 

 

 


Context Free Grammar Definition

A Context Free Grammar G = (Σ,N,P,S)

productions P are of the form Aα where

the string αΣ ∪ N and A∈ N

Notation

 

 

 


Derivations

A derivation is a sequence of sentential forms starting from start symbol.

Derivation trees:

Grammar: B 0B | 1B | 0 | 1

Derivation: B 0B 01B 010

From derivation get parse tree  

But derivations may not be unique

S SS | (S) | ()

S SS (S)S (())S (())()

S SS S() (S)() (())()

Different derivations but get the same parse tree.  That is, how you build the tree (left-to-right, right-to-left, bottom up, depth-first) represents the derivations.  Do you get the same parse tree?


Parsing

Assume we have the string "z*(x+y)". 

Is it in the language? If it is we should be able to start with the start symbol (E in this case) and repeatedly replace nonterminals by the right-hand-side of a corresponding production and eventually  produce the desired string.

The underlined nonterminal is the next one replaced:

ETT * F  F * F   z * F 
z * ( E )  z * ( E + T )
 z * ( T + T ) 
 z * ( F + T ) 
 z * ( x + T ) 
 z * ( x + F ) 
 z * ( x + y )


Sentential form

A sentential form is any one of the representations of the derivation between the start symbol and the target string, including S and the string.

E * z + ( x * y )
"E derives the string z + ( x * y )'

α γ β α δ β   iff γδ ∈ P
"A sentential form of alpha gamma beta is rewritten to alpha delta beta if and only if there is a production gamma replaces delta."( Gamma is a nonterminal, and delta can be a string of terminals and nonterminals.
)

E * z + ( x * y )


Example

Consider this example to define  a sequence of statements in a begin-end block

begin={ and end=}

blockbegin opt-stmts end
opt-stmtsstmts | ε
stmtsstmts; stmt | stmt
stmt. . .

 

 

 

 


Ambiguity

It would be simple to express a grammar to represent just the pattern of an arithmetic expression, as in

EE + E | E * E | (E) | x | y | z

But when we parse the string "x+y*z" we can get two different parse trees, ultimately representing different precedence of the operators:

In this example the precedence of multiplication over addition is lost. The first example above of a context free grammar retains the precedence of multiplication over addition.

The same can be true for the associativity of operators. In the next example the left associativity of subtraction is lost:

E → E - E | x | y | z

Again two trees are possible:

 

Derivation

A derivation is formally defined as α A β α γ β   iff A → γ ∈ P.

It is said that α1 derives αn if α1α2α3 .... ⇒αn

A left-most derivation defines a left sentential form

A right-most derivation defines a right sentential form

A parse tree gives a graphical derivation

 

 

 


Eliminating Ambiguity

Generally done by adding more productions that more clearly indicated legal cases resolving ambiguity.

Eliminating Left Recursion

When a derivation exists such that A +A α

A A α | β is rewritten as

A β A'
A' α A' | ε

in general

A A α1 | A α2 | .... | A αm | β1 | β2 | ... | βn

is converted to

A β1 A' | β2 A' | .... | βn A'

A' α1 A' | α2 A' |... | αm A' | ε

 

 

 


Left factoring

It may be useful to simplify a grammar for hand coded parsers

Rules of the form A α β1 | α β2

can be rewritten

A αA'

A' β1 | β2


Non-context Free constructs

There are features in programming languages that simply cannot be represented in CFG's. Information is retained in the symbol table or in other compiler variables to track and enforce certain requirements.

This language abstracts the declaration and use of variables:
L = { w c w | w ∈ (a | b)*}

This language abstracts parameters and argument lists L = { an bm cn dm | n>=1, m>=1}

 

 

 


Top Down Parsing - LL(1)

The easiest implementation of LL(1) type grammars is Recursive Descent parsing.

No left recursion can exist in the grammar, else the recursive descent parsers loops infinitely.

No backtracking can exist. A predictive parser is best: for each state the move to the next state can be determined by looking at the next input token (the 1 in LL(1)).

Consider the following back tracking example where the grammar has an alternative:

S c A d
A a b | a

backtracking can occur on the input "cad"

Transition Diagrams can be a plan for predictive parsing

For each nonterminal A,

Parsing

If input = a ∈ Σ then move state from s to t

Call procedure A to "match" one of the productions of A∈N. The match is complete only if final state of A is reached.

Can go to state t without advancing input

This works if no nondeterminism exists in the grammar

Example grammar in which left recursion has been removed.

E T E'

E' + T E' | ε

T F T'

T' * F T' | ε

F id | ( E )

Remove tailing recursion by transforming to loops

Combine "used once" routines -- put in line and refine

Finally, for each transition diagram create a procedure. Use states or program flow to represent the finite state machine.

void E( )
{
    T( );
    while(lookahead == PLUS){
        match(PLUS);
        T( ); //semantic actions go in too
    }
}


Bottom Up Parsing

General idea is to "reduce" the input string back to the start symbol. At each reduction step, replace the right-hand-side (rhs) of production by left-hand-side (lhs) non-terminal.

Example

For example the input a+b*c and the grammar

E E + T
E T
T T * F
T F
F ( E )
F id

has the derivation E E+T E+T*F ....

a + b * c
F + b * c
T + b * c
E + b * c
E + f * c
E + T * c
E + T * F
E + T
E

This is a "right-most derivation in reverse" 


Handles

Handles - a substring of a sentential form matching rhs of a production whose replacement will lead to the start symbol

If S rm* αAw αβw then Aβ is a handle.

If the grammar is unambiguous then there exists exactly one handle at each step.

Example ambiguity for EE+E | E*E | (E) | id and the input string a+b*c

EE+EE+E*EE+E*cE+b*ca+b*c

EE*EE+E*E...

Viable prefixes- the set of prefixes of right sentential forms that can appear on the stack.
(A handle never appear below the top of the stack.)

 

 


Stack implementation of shift-reduce parsing

Initial state
Stack
Input
$
w$

Final state
Stack
Input
S$
$

Actions

  1. Shift - move next input symbol from input to stack
  2. Reduce - top most stack elements == handle, replace with lhs of production
  3. Accept - final state found
  4. Error - can't reduce and shift produces no handle

 

 


Parsing conflicts.

1. shift-reduce conflict

stmtIF expr THEN stmt
     | IF expr THEN stmt ELSE stmt
     | other

Handle is IF expr THEN stmt and input symbol is ELSE Do we Shift? Do we Reduce?

2. reduce-reduce conflict

A α

B α

If the top of stack is α which reduction should be performed?

 

 


Parsing Types

LR parsing is the most general bottom up parser

SLR parsing (simple LR) is the easiest parser to generate but works for fewer grammars.

LALR parsing (LookAhead LR) is a middle compromise between LR and SLR


LR Parsing Algorithm


Action[sm, ai] shift/state#, reduce/production#, accept, or error

Goto[sn, A] state (serves as a transition function for nonterminals)

Production[num] production details


ALGORITHM:
{
    push(STATE_0);
    input = yylex(); /*get next token*/ 

    while(TRUE){
        state = topOfStack();

        switch(Action[state,input].type){
        case SHIFT:
            push(input);
            push(Action[state,input].nextstate);
            input = yylex(); /* advance token*/
            break;

        case REDUCE:
            prod = Action[state,input].prodnum;
            for(i=1; i<=Production[prod].length; i++){
                state = pop(); /*SEMANTIC ACTIONS*/
                xinput = pop(); /*USE INFO AT THIS POINT*/
            }
            state = topOfStack();
            push(Production[prod].lhs);
            push(Goto[state,Production[prod].lhs);
            break;

        case ACCEPT:
            return;
        default: /*ERROR*/
        }
    }
}


Tables for the example grammar:

ACTION
GOTO

PRODUCTION

State
id
+
*
(
)
$
E
T
F
#
Production
0
s5


s4


1
2
3
1
EE+T
1

s6



acc



2
ET
2

r2
s7

r2
r2



3
TT*F
3

r4
r4

r4
r4



4
TF
4
s5


s4


8
2
3
5
F(E)
5

r6
r6

r6
r6



6
Fid
6
s5


s4



9
3
   
7
s5


s4




10
   
8

s6


s11




   
9

r1
s7

r1
r1



   
10

r3
r3

r3
r3



   
11

r5
r5

r5
r5



   

Working through an example:
Stack
Input
Action
0
a*b+c