[compilation] What is the difference between a token and a lexeme?

In Compiler Construction by Aho Ullman and Sethi, it is given that the input string of characters of the source program are divided into sequence of characters that have a logical meaning, and are known as tokens and lexemes are sequences that make up the token so what is the basic difference?

This question is related to compilation compiler-construction token

The answer is


Lexeme is basically the unit of a token and it is basically sequence of characters that matches the token and helps to break the source code into tokens.

For example: If the source is x=b, then the lexemes would be x, =, b and the tokens would be <id, 0>, <=>, <id, 1>.


Let's see the working of a lexical analyser ( also called Scanner )

Let's take an example expression :

INPUT : cout << 3+2+3;

FORMATTING PERFORMED BY SCANNER :  {cout}|space|{<<}|space|{3}{+}{2}{+}{3}{;} 

not the actual output though .

SCANNER SIMPLY LOOKS REPEATEDLY FOR A LEXEME IN SOURCE-PROGRAM TEXT UNTIL INPUT IS EXHAUSTED

Lexeme is a substring of input that forms a valid string-of-terminals present in grammar . Every lexeme follows a pattern which is explained at the end ( the part that reader may skip at last )

( Important rule is to look for the longest possible prefix forming a valid string-of-terminals until next whitespace is encountered ... explained below )

LEXEMES :

  1. cout
  2. <<

( although "<" is also valid terminal-string but above mentioned rule shall select the pattern for lexeme "<<" in order to generate token returned by scanner )

  1. 3
  2. +
  3. 2
  4. ;

TOKENS : Tokens are returned one at a time ( by Scanner when requested by Parser ) each time Scanner finds a (valid) lexeme. Scanner creates ,if not already present, a symbol-table entry ( having attributes : mainly token-category and few others ) , when it finds a lexeme, in order to generate it's token

'#' denotes a symbol table entry . I have pointed to lexeme number in above list for ease of understanding but it technically should be actual index of record in symbol table.

The following tokens are returned by scanner to parser in specified order for above example.

  1. < identifier , #1 >

  2. < Operator , #2 >

  3. < Literal , #3 >

  4. < Operator , #4 >

  5. < Literal , #5 >

  6. < Operator , #4 >

  7. < Literal , #3 >

  8. < Punctuator , #6 >

As you can see the difference , a token is a pair unlike lexeme which is a substring of input.

And first element of the pair is the token-class/category

Token Classes are listed below:

  • KEYWORDS
  • IDENTIFIERS
  • LITERALS
  • PUNCTUATORS
  • OPERATORS
  • And one more thing , Scanner detects whitespaces , ignores them and does not form any token for a whitespace at all. Not all delimiters are whitespaces, a whitespace is one form of delimiter used by scanners for it's purpose . Tabs , Newlines , Spaces , Escaped Characters in input all are collectively called Whitespace delimiters. Few other delimiters are ';' ',' ':' etc, which are widely recognised as lexemes that form token.

    Total number of tokens returned are 8 here , however only 6 symbol table entries are made for lexemes . Lexemes are also 8 in total ( see definition of lexeme )

    --- You can skip this part

    A ***pattern*** is a rule ( say, a regular expression ) that is used to check if a string-of-terminals is valid or not.

    If a substring of input composed only of grammar terminals is following the rule specified by any of the listed patterns , it is validated as a lexeme and selected pattern will identify the category of lexeme, else a lexical error is reported due to either (i) not following any of the rules or (ii) input consists of a bad terminal-character not present in grammar itself.

    for example :
    
    1. No Pattern Exists : In C++ , "99Id_Var" is grammar-supported string-of-terminals but is not recognised by any of patterns hence lexical error is reported .
    
    2. Bad Input Character : $,@,unicode characters may not be supported as a valid character in few programming languages.`
    

    Token: Token is a sequence of characters that can be treated as a single logical entity. Typical tokens are,
    1) Identifiers
    2) keywords
    3) operators
    4) special symbols
    5)constants

    Pattern: A set of strings in the input for which the same token is produced as output. This set of strings is described by a rule called a pattern associated with the token.
    Lexeme: A lexeme is a sequence of characters in the source program that is matched by the pattern for a token.


    Token: The kind for (keywords,identifier,punctuation character, multi-character operators) is ,simply, a Token.

    Pattern: A rule for formation of token from input characters.

    Lexeme : Its a sequence of characters in SOURCE PROGRAM matched by a pattern for a token. Basically, its an element of Token.


    Lexeme- A lexeme is a string of character that is the lowest level syntactic unit in the programming language.

    Token- The token is a syntactic category that forms a class of lexemes that means which class the lexeme belong is it a keyword or identifier or anything else. One of the major tasks of the lexical analyzer is to create a pair of lexemes and tokens, that is to collect all the characters.

    Let us take an example:-

    if(y<= t)

    y=y-3;

    Lexeme                      Token

    if                                       KEYWORD

    (                                 LEFT PARENTHESIS

    y                                     IDENTIFIER

    < =                                 COMPARISON

    t                                     IDENTIFIER

    )                                RIGHT PARENTHESIS

    y                                    IDENTIFIER

    =                                   ASSGNMENT

    y                                   IDENTIFIER

    _                                   ARITHMATIC

    3                                     INTEGER

    ;                                    SEMICOLON

    Relation between Lexeme and Token

    Relation between lexeme and token


    CS researchers, as those from Math, are fond of creating "new" terms. The answers above are all nice but apparently, there is no such a great need to distinguish tokens and lexemes IMHO. They are like two ways to represent the same thing. A lexeme is concrete -- here a set of char; a token, on the other hand, is abstract -- usually referring to the type of a lexeme together with its semantic value if that makes sense. Just my two cents.


    Lexeme Lexemes are said to be a sequence of characters (alphanumeric) in a token.

    Token A token is a sequence of characters that can be identified as a single logical entity . Typically tokens are keywords, identifiers, constants, strings, punctuation symbols, operators. numbers.

    Pattern A set of strings described by rule called pattern. A pattern explains what can be a token and these patterns are defined by means of regular expressions, that are associated with the token.


    a) Tokens are symbolic names for the entities that make up the text of the program; e.g. if for the keyword if, and id for any identifier. These make up the output of the lexical analyser. 5

    (b) A pattern is a rule that specifies when a sequence of characters from the input constitutes a token; e.g the sequence i, f for the token if , and any sequence of alphanumerics starting with a letter for the token id.

    (c) A lexeme is a sequence of characters from the input that match a pattern (and hence constitute an instance of a token); for example if matches the pattern for if , and foo123bar matches the pattern for id.


    When a source program is fed into the lexical analyzer, it begins by breaking up the characters into sequences of lexemes. The lexemes are then used in the construction of tokens, in which the lexemes are mapped into tokens. A variable called myVar would be mapped into a token stating <id, "num">, where "num" should point to the variable's location in the symbol table.

    Shortly put:

    • Lexemes are the words derived from the character input stream.
    • Tokens are lexemes mapped into a token-name and an attribute-value.


    An example includes:
    x = a + b * 2
    Which yields the lexemes: {x, =, a, +, b, *, 2}
    With corresponding tokens: {<id, 0>, <=>, <id, 1>, <+>, <id, 2>, <*>, <id, 3>}


    LEXEME - Sequence of characters matched by PATTERN forming the TOKEN

    PATTERN - The set of rule that define a TOKEN

    TOKEN - The meaningful collection of characters over the character set of the programming language ex:ID, Constant, Keywords, Operators, Punctuation, Literal String


    Lexeme - A lexeme is a sequence of characters in the source program that matches the pattern for a token and is identified by the lexical analyzer as an instance of that token.

    Token - Token is a pair consisting of a token name and an optional token value. The token name is a category of a lexical unit.Common token names are

    • identifiers: names the programmer chooses
    • keywords: names already in the programming language
    • separators (also known as punctuators): punctuation characters and paired-delimiters
    • operators: symbols that operate on arguments and produce results
    • literals: numeric, logical, textual, reference literals

    Consider this expression in the programming language C:

    sum = 3 + 2;

    Tokenized and represented by the following table:

     Lexeme        Token category
    ------------------------------
    sum      |    Identifier
     =       |    Assignment operator
     3       |    Integer literal
     +       |    Addition operator
     2       |    Integer literal
     ;       |    End of statement
    

    Lexical Analyzer takes a sequence of characters identifies a lexeme that matches the regular expression and further categorizes it to token. Thus, a Lexeme is matched string and a Token name is the category of that lexeme.

    For example, consider below regular expression for an identifier with input "int foo, bar;"

    letter(letter|digit|_)*

    Here, foo and bar match the regular expression thus are both lexemes but are categorized as one token ID i.e identifier.

    Also note, next phase i.e syntax analyzer need not have to know about lexeme but a token.


    Examples related to compilation

    WARNING: API 'variant.getJavaCompile()' is obsolete and has been replaced with 'variant.getJavaCompileProvider()' How to enable C++17 compiling in Visual Studio? How can I use/create dynamic template to compile dynamic Component with Angular 2.0? Microsoft Visual C++ Compiler for Python 3.4 C compile : collect2: error: ld returned 1 exit status Error:java: invalid source release: 8 in Intellij. What does it mean? Eclipse won't compile/run java file IntelliJ IDEA 13 uses Java 1.5 despite setting to 1.7 OPTION (RECOMPILE) is Always Faster; Why? (.text+0x20): undefined reference to `main' and undefined reference to function

    Examples related to compiler-construction

    fatal error C1010 - "stdafx.h" in Visual Studio how can this be corrected? Compilation error: stray ‘\302’ in program etc What is difference between sjlj vs dwarf vs seh? What is the difference between a token and a lexeme? How to compile makefile using MinGW? C++ variable has initializer but incomplete type? It is more efficient to use if-return-return or if-else-return? Could not load file or assembly ... The parameter is incorrect How do I compile the asm generated by GCC? Visual Studio: LINK : fatal error LNK1181: cannot open input file

    Examples related to token

    Sending the bearer token with axios JWT (JSON Web Token) library for Java Python requests library how to pass Authorization header with single token best practice to generate random token for forgot password syntax error: unexpected token < What is the difference between a token and a lexeme? how to generate a unique token which expires after 24 hours? Parse (split) a string in C++ using string delimiter (standard C++) How do I fix a "Expected Primary-expression before ')' token" error? How can a Jenkins user authentication details be "passed" to a script which uses Jenkins API to create jobs?