What is the difference between a token and a lexeme

Question

In Compiler Construction by Aho Ullman and Sethi  it is given that the input string of characters of the source program are divided into sequence of characters that have a logical meaning  and are known as tokens and lexemes are sequences that make up the token so what is the basic difference

User · Answer

Lexical Analyzer takes a sequence of characters identifies a lexeme that matches the regular expression and further categorizes it to token. Thus, a Lexeme is matched string and a Token name is the category of that lexeme.

For example, consider below regular expression for an identifier with input "int foo, bar;"

letter(letter|digit|_)*

Here, foo and bar match the regular expression thus are both lexemes but are categorized as one token ID i.e identifier.

Also note, next phase i.e syntax analyzer need not have to know about lexeme but a token.

User · Answer

LEXEME - Sequence of characters matched by PATTERN forming the TOKEN  PATTERN - The set of rule that define a TOKEN  TOKEN - The meaningful collection of characters  over the character set of the programming language ex ID  Constant  Keywords  Operators  Punctuation  Literal String

User · Answer

Lexeme- A lexeme is a string of character that is the lowest level syntactic unit in the programming language   Token- The token is a syntactic category that forms a class of lexemes that means which class the lexeme belong is it a keyword or identifier or anything else  One of the major tasks of the lexical analyzer is to create a pair of lexemes and tokens  that is to collect all the characters   Let us take an example -     if y lt   t        y y-3    Lexeme      ensp  ensp  ensp  ensp  ensp  ensp  ensp  ensp      ensp        ensp        ensp        ensp   ensp   ensp    Token     if  ensp   ensp  ensp   ensp   ensp   ensp   ensp   ensp   ensp     ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp     ensp KEYWORD         ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp     ensp   ensp   ensp   ensp   ensp   ensp   ensp  LEFT PARENTHESIS      y  ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp     ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp    IDENTIFIER       lt     ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp     ensp   ensp   ensp   ensp   ensp   ensp   ensp   COMPARISON      t  ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp     ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp    IDENTIFIER        ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp     ensp   ensp   ensp   ensp   ensp   ensp   RIGHT PARENTHESIS      y  ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp     ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp     ensp IDENTIFIER         ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp  ASSGNMENT        y  ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp     ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp  IDENTIFIER           ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp  ARITHMATIC       3  ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp     ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp  INTEGER         ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp   ensp     ensp   ensp   ensp   ensp   ensp   ensp   ensp  ensp   ensp   SEMICOLON   Relation between Lexeme and Token

User · Answer

When a source program is fed into the lexical analyzer  it begins by breaking up the characters into sequences of lexemes  The lexemes are then used in the construction of tokens  in which the lexemes are mapped into tokens  A variable called myVar would be mapped into a token stating  lt id   num    where  num  should point to the variable s location in the symbol table    Shortly put    Lexemes are the words derived from the character input stream  Tokens are lexemes mapped into a token-name and an attribute-value     An example includes  x   a   b   2 Which yields the lexemes   x     a     b     2  With corresponding tokens    lt id  0    lt      lt id  1    lt      lt id  2    lt      lt id  3

User · Answer

Lexeme     Lexemes are said to be a sequence of characters  alphanumeric  in a token   Token    A token is a sequence of characters that can be identified as a single logical entity   Typically tokens are keywords  identifiers  constants  strings  punctuation symbols  operators  numbers   Pattern   A set of strings described by rule called pattern  A pattern explains what can be a token and these patterns are defined by means of regular expressions  that are associated with the token

User · Answer

a  Tokens are symbolic names for the entities that make up the text of the program  e g  if for the keyword if  and id for any identifier  These make up the output of the lexical analyser  5   b  A pattern is a rule that specifies when a sequence of characters from the input constitutes a token  e g the sequence i  f for the token if   and any sequence of alphanumerics starting with a letter for the token id    c  A lexeme is a sequence of characters from the input that match a pattern  and hence constitute an instance of a token   for example if matches the pattern for if   and foo123bar matches the pattern for id

User · Answer

CS researchers  as those from Math  are fond of creating  quot new quot  terms  The answers above are all nice but apparently  there is no such a great need to distinguish tokens and lexemes IMHO  They are like two ways to represent the same thing  A lexeme is concrete -- here a set of char   a token  on the other hand   is abstract -- usually referring to the type of a lexeme together with its semantic value if that makes sense   Just my two cents

User · Answer

Token  The kind for  keywords identifier punctuation character  multi-character operators  is  simply  a Token   Pattern  A rule for formation of token from input characters   Lexeme   Its a sequence of characters in SOURCE PROGRAM matched by a pattern for a token           Basically  its an element of Token

User · Answer

Let s see the working of a lexical analyser   also called Scanner    Let s take an example expression    INPUT   cout  lt  lt  3 2 3   FORMATTING PERFORMED BY SCANNER     cout  space   lt  lt   space  3     2     3        not the actual output though    SCANNER SIMPLY LOOKS REPEATEDLY FOR A LEXEME IN SOURCE-PROGRAM TEXT UNTIL INPUT IS EXHAUSTED   Lexeme is a substring of input that forms a valid string-of-terminals present in grammar   Every lexeme follows a pattern which is explained at the end   the part that reader may skip at last      Important rule is to look for the longest possible prefix forming a valid string-of-terminals until next whitespace is encountered     explained below    LEXEMES      cout  lt  lt          although   lt   is also valid terminal-string but above mentioned rule  shall select the pattern for lexeme   lt  lt   in order to generate token returned by scanner      3   2     TOKENS   Tokens are returned one at a time   by Scanner when requested by Parser   each time Scanner finds a  valid  lexeme  Scanner creates  if not already present  a symbol-table entry   having attributes   mainly token-category and few others     when it finds a lexeme  in order to generate it s token      denotes a symbol table entry   I have pointed to lexeme number in above list for ease of understanding but it technically should be actual index of record in symbol table   The following tokens are returned by scanner to parser in specified order for above example     lt  identifier    1    lt  Operator    2    lt  Literal    3    lt  Operator    4    lt  Literal    5    lt  Operator    4    lt  Literal    3    lt  Punctuator    6     As you can see the difference   a token is a pair unlike lexeme which is a substring of input    And first element of the pair is the token-class category  Token Classes are listed below         KEYWORDS    IDENTIFIERS    LITERALS    PUNCTUATORS    OPERATORS         And one more thing   Scanner detects whitespaces   ignores them and does not form any token for a whitespace at all  Not all delimiters are whitespaces  a whitespace is one form of delimiter used by scanners for it s purpose   Tabs   Newlines   Spaces   Escaped Characters in input all are collectively called Whitespace delimiters  Few other delimiters are             etc   which are widely recognised as lexemes that form token    Total number of tokens returned are 8 here   however only 6 symbol table entries are made for lexemes   Lexemes are also 8 in total   see definition of lexeme    --- You can skip this part     A    pattern    is a rule   say  a regular expression   that is used to check if a string-of-terminals is valid or not       If a substring of input composed only of grammar terminals is   following the rule specified by any of the listed patterns   it is   validated as a lexeme and selected pattern will identify the category   of lexeme  else a lexical error is reported due to either  i  not   following any of the rules or  ii  input consists of a bad   terminal-character not present in grammar itself    for example    1  No Pattern Exists   In C      99Id Var  is grammar-supported string-of-terminals but is not recognised by any of patterns hence lexical error is reported    2  Bad Input Character       unicode characters may not be supported as a valid character in few programming languages

User · Answer

Lexeme is basically the unit of a token and it is basically sequence of characters that matches the token and helps to break the source code into tokens   For example  If the source is x b  then the lexemes would be x     b and the tokens would be  lt id  0 gt    lt   gt    lt id  1 gt

User · Answer

Lexeme - A lexeme is a sequence of characters in the source program that matches the pattern for a token and is identified by the lexical analyzer as an instance of that token   Token - Token is a pair consisting of a token name and an optional token value  The token name is a category of a lexical unit Common token names are   identifiers  names the programmer chooses keywords  names already in the programming language separators  also known as punctuators   punctuation characters and paired-delimiters operators  symbols that operate on arguments and produce results literals  numeric  logical  textual  reference literals   Consider this expression in the programming language C   sum   3   2   Tokenized and represented by the following table    Lexeme        Token category ------------------------------ sum           Identifier               Assignment operator  3            Integer literal               Addition operator  2            Integer literal               End of statement

User · Answer

Using  Compilers Principles  Techniques   amp  Tools  2nd Ed    WorldCat  by Aho  Lam  Sethi and Ullman  AKA the Purple Dragon Book   Lexeme pg  111       A lexeme is a sequence of characters in the source program that   matches the pattern for a token and is identified by the lexical   analyzer as an instance of that token    Token pg  111       A token is a pair consisting of a token name and an optional attribute   value  The token name is an abstract symbol representing a kind of   lexical unit  e g   a particular keyword  or sequence of input   characters denoting an identifier  The token names are the input   symbols that the parser processes    Pattern pg  111       A pattern is a description of the form that the lexemes of a token may   take  In the case of a keyword as a token  the pattern is just the   sequence of characters that form the keyword  For identifiers and some   other tokens  the pattern is more complex structure that is matched by   many strings    Figure 3 2  Examplesof tokens pg 112     Token         Informal Description                    Sample Lexemes  if            characters i  f                         if else          characters e  l  s  e                   else comparison     lt  or  gt  or  lt   or  gt   or    or              lt       id            letter followed by letters and digits   pi  score  D2 number        any numeric constant                    3 14159  0  6 02e23 literal       anything but    surrounded by   s        core dumped    To better understand this relation to a lexer and parser we will start with the parser and work backwards to the input   To make it easier to design a parser  a parser does not work with the input directly but takes in a list of tokens generated by a lexer  Looking at the token column in Figure 3 2 we see tokens such as if  else  comparison  id  number and literal  these are names of tokens  Typically with a lexer parser a token is a structure that holds not only the name of the token  but the characters symbols that make up the token and the start and end position of the string of characters that make up the token  with the start and end position being used for error reporting  highlighting  etc    Now the lexer takes the input of characters symbols and using the rules of the lexer converts the input characters symbols into tokens  Now people who work with lexer parser have their own words for things they use often  What you think of as a sequence of characters symbols that make up a token are what people who use lexer parsers call lexeme  So when you see lexeme  just think of a sequence of characters symbols representing a token  In the comparison example  the sequence of characters symbols can be different patterns such as  lt  or  gt  or else or 3 14  etc   Another way to think of the relation between the two is that a token is a programming structure used by the parser that has a property called lexeme that holds the character symbols from the input  Now if you look at most definitions of token in code you may not see lexeme as one of the properties of the token  This is because a token will more likely hold the start and end position of the characters symbols that represent the token and the lexeme  sequence of characters symbols can be derived from the start and end position as needed because the input is static

User · Answer

Token   Token is a sequence of characters that can be treated as a single logical entity  Typical tokens are  1  Identifiers 2  keywords 3  operators 4  special symbols 5 constants    Pattern  A set of strings in the input for which the same token is produced as output  This set of strings is described by a rule called a pattern associated with the token  Lexeme  A lexeme is a sequence of characters in the source program that is matched by the pattern for a token

[compilation] What is the difference between a token and a lexeme?

Examples related to compilation

Examples related to compiler-construction

Examples related to token