To write a scanner or parser is not a very difficult thing. Until a certain complexity and/or size it’s maybe ok to write it manually. Beyond that size or complexity the utilization of a scanner or parser generator is recommended. There are many tools doing this, lex
/yacc
to name some of the best known (or their GNU coutnerparts flex
and bison
).
flex++
and bison++
are more improved ones. They produce C++ code and are more customizable. The bad thing is, there is not much documentation and tutorials concerning those utilities. This is what this document is for. It will show a basic example of a scanner, continues with a parser (scanner and parser working together) and finally shows an example how to write programs having multiple parsers.
Note: all source codes provided by this page are free to copy, modify, distribute, use or to treat in any manner you like. Build and run the programs on your own risk. The tutorial is copyright of Mario Konrad.
Note: this is not a tutorial about fundamentals of scanners, parsers or compiler building. It just covers the tools mentioned. For the other topics, please read CS literature.
GCC 3.2
(or newer)bison++ 1.21-8
(or newer)flex++ 2.3.8-7
(or newer)This is a rather basic example of a simple scanner. Don’t expect too much.
calc.l
:
%name Scanner
%define IOSTREAM
DIGIT [0-9]
DIGIT1 [1-9]
%%
"+" { cout << "operator <" << yytext[0] << ">" << endl; }
"-" { cout << "operator <" << yytext[0] << ">" << endl; }
"=" { cout << "operator <" << yytext[0] << ">" << endl; }
{DIGIT1}{DIGIT}* { cout << " number <" << yytext << ">" << endl; }
. { cout << " UNKNOWN <" << yytext[0] << ">" << endl; }
%%
int main(int argc, char ** argv)
{
Scanner scanner;
scanner.yylex();
return 0;
}
Build it with the following sequence:
The result is a program called calc
. To test it out simply type:
If you look at the source code, there isn’t really much to say about it. It’s a plain and simple scanner that knows the tokens +
, -
, =
and positive integer numbers.
The more interesting thing about the source is, it demonstrates the ability of flex++
to generate a scanner class (called Scanner in this example) which is used to scan the input (stdin
or cin
respectively):
This is a quite simple example of a parser using a scanner. There are tree files:
parser.y
: source file for the parserscanner.l
: source file for scannercalc.cc
: definition of the parser class and the function mainparser.y
:
%name CalcParser
%define LSP_NEEDED
%define ERROR_BODY = 0
%define LEX_BODY = 0
%header{
#include <iostream>
#include <fstream>
%}
%union {
int i_type;
char c_type;
}
%token UNKNOWN
%token <c_type> PLUS MINUS EQUALS
%token <i_type> NUMBER
%type <i_type> number subexpression addexpression
%type <i_type> expression calculation
%start calculation
%%
calculation
: expression EQUALS { $$ = $1; }
;
expression
: addexpression { $$ = $1; }
;
addexpression
: subexpression { $$ = $1; }
| addexpression PLUS subexpression { $$ = $1 + $3; }
;
subexpression
: number { $$ = $1; }
| subexpression MINUS number { $$ = $1 - $3; }
;
number
: NUMBER { $$ = $1; }
;
%%
scanner.l
:
%name CalcScanner
%define IOSTREAM
%define LEX_PARAM YY_CalcParser_STYPE *val, YY_CalcParser_LTYPE *loc
%define MEMBERS public: int line, column;
%define CONSTRUCTOR_INIT : line(1), column(1)
%header{
#include<sstream>
#include "parser.h"
%}
DIGIT [0-9]
DIGIT1 [1-9]
%%
" " {
++column;
}
"\t" {
column += 8;
}
"+" {
++column;
val->c_type = yytext[0];
return CalcParser::PLUS;
}
"-" {
++column;
val->c_type = yytext[0];
return CalcParser::MINUS;
}
"=" {
++column;
val->c_type = yytext[0];
return CalcParser::EQUALS;
}
{DIGIT1}{DIGIT}* {
column += strlen(yytext);
std::istringstream(yytext) >> val->i_type;
return CalcParser::NUMBER;
}
. {
++column;
val->c_type = yytext[0];
return CalcParser::UNKNOWN;
}
<<EOF>> {
yyterminate();
}
%%
calc.cc
:
#include "parser.h"
#include "scanner.h"
#include <iostream>
using namespace std;
class Parser : public CalcParser
{
private:
CalcScanner scanner;
public:
Parser();
virtual void yyerror(char * msg);
virtual int yylex();
};
Parser::Parser()
{
}
void Parser::yyerror(char * msg)
{
cerr << yylloc.first_line << ":" << yylloc.first_column << ": "
<< msg << " : <" << yylloc.text << ">" << endl;
}
int Parser::yylex()
{
yylloc.first_line = scanner.line;
yylloc.first_column = scanner.column;
int token = scanner.yylex(&yylval, &yylloc);
yylloc.last_line = scanner.line;
yylloc.last_column = scanner.column;
yylloc.text = (char *)scanner.yytext;
return token;
}
int main(int argc, char ** argv)
{
Parser parser;
parser.yyparse();
return 0;
}
It is an example of a very simple algebraic calculator, knowing only integer values, additions and subtractions.
The previous example were nothing special, business as usual. To build a multiple-parser-application is a bit more tricky. Luckily flex++
and bison++
are more customizable as the original tools lex
and yacc
.
Let’s build an application that can utilize a parser for algegraic calculations as well as UPN calculation (UPN in german = IPN in english: inverse polnish notation).
For the algebraic calculator:
calculation ::= expression "=" .
expression ::= addexpression .
addexpression ::= subexpressoin
| (addexpression "+" subexpression) .
subexpression ::= number
| (subexpression "-" number) .
number ::= [1-9][0-9]* .
For the UPN calculator:
calculation ::= expression .
expression ::= number
| (expression expression "+")
| (expression expression "-") .
number ::= [1-9][0-9]* .
number is not in pure EBNF, but it’s easier to describe it as a regular expression.
The parser generator shall generate the parsers for both calculators. The application however likes to use them dynamically. The following diagram is UML and shows the class design of the classes provided by the generator as well as the other classes.
It is recommended to use the Makefile
:
The build process goes as follows:
$ bison++ -d -hAlgCalcParser.h -oAlgCalcParser.cc AlgCalcParser.y
$ flex++ -hAlgCalcScanner.h -oAlgCalcScanner.cc AlgCalcScanner.l
$ bison++ -d -hUPNCalcParser.h -oUPNCalcParser.cc UNPCalcParser.y
$ flex++ -hUPNCalcScanner.h -oUPNCalcScanner.cc UPNCalcScanner.l
$ g++ -c calc.cc
$ g++ -c AlgCalcParser.cc
$ g++ -c AlgCalcScanner.cc
$ g++ -c AlgParser.cc
$ g++ -c UPNCalcParser.cc
$ g++ -c UPNCalcScanner.cc
$ g++ -c UPNParser.cc
$ g++ -o calc calc.o AlgCalcParser.o AlgCalcScanner.o AlgParser.o UPNCalcParser.o UPNCalcScanner.o UPNParser.o
Now we have a calulator that is able to handle algebraic as well as inverse polnish notation. Just tell the calculator which one you like:
flex++
Skeletons:
Download Flex++
and Bison++
packages, or directly from here: