lexers vs parsers

Question

Are lexers and parsers really that different in theory?

It seems fashionable to hate regular expressions: coding horror, another blog post.

However, popular lexing based tools: pygments, geshi, or prettify, all use regular expressions. They seem to lex anything...

When is lexing enough, when do you need EBNF?

Has anyone used the tokens produced by these lexers with bison or antlr parser generators?

User · Accepted Answer

What parsers and lexers have in common    They read symbols of some alphabet from their input    Hint  The alphabet doesn t necessarily have to be of letters  But it has to be of symbols which are atomic for the language understood by parser lexer  Symbols for the lexer  ASCII characters  Symbols for the parser  the particular tokens  which are terminal symbols of their grammar   They analyse these symbols and try to match them with the grammar of the language they understood    Here s where the real difference usually lies  See below for more  Grammar understood by lexers  regular grammar  Chomsky s level 3   Grammar understood by parsers  context-free grammar  Chomsky s level 2    They attach semantics  meaning  to the language pieces they find    Lexers attach meaning by classifying lexemes  strings of symbols from the input  as the particular tokens  E g  All these lexemes          lt      will be classified as  operator  token by the C C   lexer  Parsers attach meaning by classifying strings of tokens from the input  sentences  as the particular nonterminals and building the parse tree  E g  all these token strings   number  operator  number    id  operator  id    id  operator  number  operator  number  will be classified as  expression  nonterminal by the C C   parser   They can attach some additional meaning  data  to the recognized elements    When a lexer recognizes a character sequence constituting a proper number  it can convert it to its binary value and store with the  number  token   Similarly  when a parser recognize an expression  it can compute its value and store with the  expression  node of the syntax tree   They all produce on their output a proper sentences of the language they recognize    Lexers produce tokens  which are sentences of the regular language they recognize  Each token can have an inner syntax  though level 3  not level 2   but that doesn t matter for the output data and for the one which reads them  Parsers produce syntax trees  which are representations of sentences of the context-free language they recognize  Usually it s only one big tree for the whole document source file  because the whole document source file is a proper sentence for them  But there aren t any reasons why parser couldn t produce a series of syntax trees on its output  E g  it could be a parser which recognizes SGML tags sticked into plain-text  So it ll tokenize the SGML document into a series of tokens   TXT  TAG  TAG  TXT  TAG  TXT         As you can see  parsers and tokenizers have much in common  One parser can be a tokenizer for other parser  which reads its input tokens as symbols from its own alphabet  tokens are simply symbols of some alphabet  in the same way as sentences from one language can be alphabetic symbols of some other  higher-level language  For example  if   and - are the symbols of the alphabet M  as  Morse code symbols    then you can build a parser which recognizes strings of these dots and lines as letters encoded in the Morse code  The sentences in the language  Morse Code  could be tokens for some other parser  for which these tokens are atomic symbols of its language  e g   English Words  language   And these  English Words  could be tokens  symbols of the alphabet  for some higher-level parser which understands  English Sentences  language  And all these languages differ only in the complexity of the grammar  Nothing more   So what s all about these  Chomsky s grammar levels   Well  Noam Chomsky classified grammars into four levels depending on their complexity    Level 3  Regular grammars  They use regular expressions  that is  they can consist only of the symbols of alphabet  a b   their concatenations  ab aba bbb etd    or alternatives  e g  a b  They can be implemented as finite state automata  FSA   like NFA  Nondeterministic Finite Automaton  or better DFA  Deterministic Finite Automaton  Regular grammars can t handle with nested syntax  e g  properly nested matched parentheses               nested HTML BBcode tags  nested blocks etc  It s because state automata to deal with it should have to have infinitely many states to handle infinitely many nesting levels  Level 2  Context-free grammars  They can have nested  recursive  self-similar branches in their syntax trees  so they can handle with nested structures well They can be implemented as state automaton with stack  This stack is used to represent the nesting level of the syntax  In practice  they re usually implemented as a top-down  recursive-descent parser which uses machine s procedure call stack to track the nesting level  and use recursively called procedures functions for every non-terminal symbol in their syntax But they can t handle with a context-sensitive syntax  E g  when you have an expression x 3 and in one context this x could be a name of a variable  and in other context it could be a name of a function etc  Level 1  Context-sensitive grammars Level 0  Unrestricted grammarsAlso called recursively enumerable grammars

User · Answer

When is lexing enough  when do you need EBNF    EBNF really doesn t add much to the power of grammars  It s just a convenience   shortcut notation    syntactic sugar  over the standard Chomsky s Normal Form  CNF  grammar rules  For example  the EBNF alternative   S -- gt  A   B   you can achieve in CNF by just listing each alternative production separately   S -- gt  A          S  can be  A   S -- gt  B         or it can be  B     The optional element from EBNF   S -- gt  X    you can achieve in CNF by using a nullable production  that is  the one which can be replaced by an empty string  denoted by just empty production here  others use epsilon or lambda or crossed circle    S -- gt  B           S  can be  B   B -- gt  X          and  B  can be just  X   B -- gt             or it can be empty    A production in a form like the last one B above is called  erasure   because it can erase whatever it stands for in other productions  product an empty string instead of something else    Zero-or-more repetiton from EBNF   S -- gt  A    you can obtan by using recursive production  that is  one which embeds itself somewhere in it  It can be done in two ways  First one is left recursion  which usually should be avoided  because Top-Down Recursive Descent parsers cannot parse it    S -- gt  S A        S  is just itself ended with  A   which can be done many times   S -- gt            or it can begin with empty-string  which stops the recursion    Knowing that it generates just an empty string  ultimately  followed by zero or more As  the same string  but not the same language   can be expressed using right-recursion   S -- gt  A S        S  can be  A  followed by itself  which can be done many times   S -- gt            or it can be just empty-string end  which stops the recursion    And when it comes to   for one-or-more repetition from EBNF   S -- gt  A    it can be done by factoring out one A and using   as before   S -- gt  A A    which you can express in CNF as such  I use right recursion here  try to figure out the other one yourself as an exercise    S -- gt  A S       S  can be one  A  followed by  S   which stands for more  A s   S -- gt  A        or it could be just one single  A     Knowing that  you can now probably recognize a grammar for a regular expression  that is  regular grammar  as one which can be expressed in a single EBNF production consisting only from terminal symbols  More generally  you can recognize regular grammars when you see productions similar to these   A -- gt            Empty  nullable  production  AKA erasure   B -- gt  x         Single terminal symbol  C -- gt  y D       Simple state change from  C  to  D  when seeing input  y   E -- gt  F z       Simple state change from  E  to  F  when seeing input  z   G -- gt  G u       Left recursion  H -- gt  v H       Right recursion    That is  using only empty strings  terminal symbols  simple non-terminals for substitutions and state changes  and using recursion only to achieve repetition  iteration  which is just linear recursion - the one which doesn t branch tree-like   Nothing more advanced above these  then you re sure it s a regular syntax and you can go with just lexer for that   But when your syntax uses recursion in a non-trivial way  to produce tree-like  self-similar  nested structures  like the following one   S -- gt  a S b        S  can be itself  parenthesized  by  a  and  b  on both sides  S -- gt              or it could be  ultimately  empty  which ends recursion    then you can easily see that this cannot be done with regular expression  because you cannot resolve it into one single EBNF production in any way  you ll end up with substituting for S indefinitely  which will always add another as and bs on both sides  Lexers  more specifically  Finite State Automata used by lexers  cannot count to arbitrary number  they are finite  remember    so they don t know how many as were there to match them evenly with so many bs  Grammars like this are called context-free grammars  at the very least   and they require a parser   Context-free grammars are well-known to parse  so they are widely used for describing programming languages  syntax  But there s more  Sometimes a more general grammar is needed -- when you have more things to count at the same time  independently  For example  when you want to describe a language where one can use round parentheses and square braces interleaved  but they have to be paired up correctly with each other  braces with braces  round with round   This kind of grammar is called context-sensitive  You can recognize it by that it has more than one symbol on the left  before the arrow   For example   A R B -- gt  A S B   You can think of these additional symbols on the left as a  context  for applying the rule  There could be some preconditions  postconditions etc  For example  the above rule will substitute R into S  but only when it s in between A and B  leaving those A and B themselves unchanged  This kind of syntax is really hard to parse  because it needs a full-blown Turing machine  It s a whole another story  so I ll end here

User · Answer

To answer the question as asked  without repeating unduly what appears in other answers   Lexers and parsers are not very different  as suggested by the accepted answer  Both are based on simple language formalisms  regular languages for lexers and  almost always  context-free  CF  languages for parsers  They both are associated with fairly simple computational models  the finite state automaton and the push-down stack automaton  Regular languages are a special case of context-free languages  so that lexers could be produced with the somewhat more complex CF technology  But it is not a good idea for at least two reasons   A fundamental point in programming is that a system component should be buit with the most appropriate technology  so that it is easy to produce  to understand and to maintain  The technology should not be overkill  using techniques much more complex and costly than needed   nor should it be at the limit of its power  thus requiring technical contortions to achieve the desired goal   That is why  It seems fashionable to hate regular expressions   Though they can do a lot  they sometimes require very unreadable coding to achieve it  not to mention the fact that various extensions and restrictions in implementation somewhat reduce their theoretical simplicity  Lexers do not usually do that  and are usually a simple  efficient  and appropriate technology to parse token  Using CF parsers for token would be overkill  though it is possible   Another reason not to use CF formalism for lexers is that it might then be tempting to use the full CF power  But that might raise sructural problems regarding the reading of programs   Fundamentally  most of the structure of program text  from which meaning is extracted  is a tree structure  It expresses how the parse sentence  program  is generated from syntax rules  Semantics is derived by compositional techniques  homomorphism for the mathematically oriented  from the way syntax rules are composed to build the parse tree  Hence the tree structure is essential  The fact that tokens are identified with a regular set based lexer does not change the situation  because CF composed with regular still gives CF  I am speaking very loosely about regular transducers  that transform a stream of characters into a stream of token    However  CF composed with CF  via CF transducers     sorry for the math   does not necessarily give CF  and might makes things more general  but less tractable in practice  So CF is not the appropriate tool for lexers  even though it can be used   One of the major differences between regular and CF is that regular languages  and transducers  compose very well with almost any formalism in various ways  while CF languages  and transducers  do not  not even with themselves  with a few exceptions     Note that regular transducers may have others uses  such as formalization of some syntax error handling techniques    BNF is just a specific syntax for presenting CF grammars   EBNF is a syntactic sugar for BNF  using the facilities of regular notation to give terser version of BNF grammars  It can always be transformed into an equivalent pure BNF   However  the regular notation is often used in EBNF only to emphasize these parts of the syntax that correspond to the structure of lexical elements  and should be recognized with the lexer  while the rest with be rather presented in straight BNF  But it is not an absolute rule   To summarize  the simpler structure of token is better analyzed with the simpler technology of regular languages  while the tree oriented structure of the language  of program syntax  is better handled by CF grammars   I would suggest also looking at AHR s answer   But this leaves a question open  Why trees   Trees are a  good basis for specifying syntax because   they give a simple structure to the text there are very convenient for associating semantics with the text on the basis of that structure  with a mathematically well understood technology  compositionality via homomorphisms   as indicated above  It is a fundamental algebraic tool to define the semantics of mathematical formalisms    Hence it is a good intermediate representation  as shown by the success of Abstract Syntax Trees  AST    Note that AST are often different from parse tree because the parsing technology used by many professionals  Such as LL or LR  applies only to a subset of CF grammars  thus forcing grammatical distorsions which are later corrected in AST  This can be avoided with more general parsing technology  based on dynamic programming  that accepts any CF grammar   Statement about the fact that programming languages are context-sensitive  CS  rather than CF are arbitrary and disputable   The problem is that the separation of syntax and semantics is arbitrary  Checking declarations or type agreement may be seen as either part of syntax  or part of semantics  The same would be true of gender and number agreement in natural languages  But there are natural languages where plural agreement depends on the actual semantic meaning of words  so that it does not fit well with syntax   Many definitions of programming languages in denotational semantics place declarations and type checking in the semantics  So stating as done by Ira Baxter that CF parsers are being hacked to get a context sensitivity required by syntax is at best an arbitrary view of the situation  It may be organized as a hack in some compilers  but it does not have to be   Also it is not just that CS parsers  in the sense used in other answers here  are hard to build  and less efficient  They are are also inadequate to express perspicuously the kinf of context-sensitivity that might be needed  And they do not naturally produce a syntactic structure  such as parse-trees  that is convenient to derive the semantics of the program  i e  to generate the compiled code

User · Answer

There are a number of reasons why the analysis portion of a compiler is normally separated into lexical analysis and parsing ( syntax analysis) phases.

Simplicity of design is the most important consideration. The separation of lexical and syntactic analysis often allows us to simplify at least one of these tasks. For example, a parser that had to deal with comments and white space as syntactic units would be. Considerably more complex than one that can assume comments and white space have already been removed by the lexical analyzer. If we are designing a new language, separating lexical and syntactic concerns can lead to a cleaner overall language design.
Compiler efficiency is improved. A separate lexical analyzer allows us to apply specialized techniques that serve only the lexical task, not the job of parsing. In addition, specialized buffering techniques for reading input characters can speed up the compiler significantly.
Compiler portability is enhanced. Input-device-specific peculiarities can be restricted to the lexical analyzer.

resource___Compilers (2nd Edition) written by- Alfred V. Abo Columbia University Monica S. Lam Stanford University Ravi Sethi Avaya Jeffrey D. Ullman Stanford University

User · Answer

Yes  they are very different in theory  and in implementation   Lexers are used to recognize  words  that make up language elements  because the structure of such words is generally simple    Regular expressions are extremely good at handling this simpler structure  and there are very high-performance regular-expression matching engines used to implement lexers   Parsers are used to recognize  structure  of a language phrases   Such structure is generally far beyond what  regular expressions  can recognize  so one needs   context sensitive  parsers to extract such structure    Context-sensitive parsers are hard to build  so the engineering compromise is to use  context-free  grammars and add hacks to the parsers   symbol tables   etc   to handle the context-sensitive part   Neither lexing nor parsing technology is likely to go away soon   They may be unified by deciding to use  parsing  technology to recognize  words   as is currently explored by so-called scannerless GLR parsers    That has a runtime cost  as you are applying more general machinery to what is often a problem that doesn t need it  and usually you pay for that in overhead    Where you have lots of free cycles  that overhead may not matter   If you process a lot of text  then the overhead does matter and classical regular expression parsers will continue to be used

[parsing] lexers vs parsers

The answer is

They read symbols of some alphabet from their input.

They analyse these symbols and try to match them with the grammar of the language they understood.

They attach semantics (meaning) to the language pieces they find.

They can attach some additional meaning (data) to the recognized elements.

They all produce on their output a proper sentences of the language they recognize.

Level 3: Regular grammars

Level 2: Context-free grammars

Level 1: Context-sensitive grammars

Level 0: Unrestricted grammars
Also called recursively enumerable grammars.

Examples related to parsing

Examples related to antlr

Examples related to lexer

Examples related to pygments

Tags