[compilation] What is the difference between a token and a lexeme?

Using "Compilers Principles, Techniques, & Tools, 2nd Ed." (WorldCat) by Aho, Lam, Sethi and Ullman, AKA the Purple Dragon Book,

Lexeme pg. 111

A lexeme is a sequence of characters in the source program that matches the pattern for a token and is identified by the lexical analyzer as an instance of that token.

Token pg. 111

A token is a pair consisting of a token name and an optional attribute value. The token name is an abstract symbol representing a kind of lexical unit, e.g., a particular keyword, or sequence of input characters denoting an identifier. The token names are the input symbols that the parser processes.

Pattern pg. 111

A pattern is a description of the form that the lexemes of a token may take. In the case of a keyword as a token, the pattern is just the sequence of characters that form the keyword. For identifiers and some other tokens, the pattern is more complex structure that is matched by many strings.

Figure 3.2: Examplesof tokens pg.112

[Token]       [Informal Description]                  [Sample Lexemes]
if            characters i, f                         if
else          characters e, l, s, e                   else
comparison    < or > or <= or >= or == or !=          <=, !=
id            letter followed by letters and digits   pi, score, D2
number        any numeric constant                    3.14159, 0, 6.02e23
literal       anything but ", surrounded by "'s       "core dumped"

To better understand this relation to a lexer and parser we will start with the parser and work backwards to the input.

To make it easier to design a parser, a parser does not work with the input directly but takes in a list of tokens generated by a lexer. Looking at the token column in Figure 3.2 we see tokens such as if, else, comparison, id, number and literal; these are names of tokens. Typically with a lexer/parser a token is a structure that holds not only the name of the token, but the characters/symbols that make up the token and the start and end position of the string of characters that make up the token, with the start and end position being used for error reporting, highlighting, etc.

Now the lexer takes the input of characters/symbols and using the rules of the lexer converts the input characters/symbols into tokens. Now people who work with lexer/parser have their own words for things they use often. What you think of as a sequence of characters/symbols that make up a token are what people who use lexer/parsers call lexeme. So when you see lexeme, just think of a sequence of characters/symbols representing a token. In the comparison example, the sequence of characters/symbols can be different patterns such as < or > or else or 3.14, etc.

Another way to think of the relation between the two is that a token is a programming structure used by the parser that has a property called lexeme that holds the character/symbols from the input. Now if you look at most definitions of token in code you may not see lexeme as one of the properties of the token. This is because a token will more likely hold the start and end position of the characters/symbols that represent the token and the lexeme, sequence of characters/symbols can be derived from the start and end position as needed because the input is static.

Examples related to compilation

WARNING: API 'variant.getJavaCompile()' is obsolete and has been replaced with 'variant.getJavaCompileProvider()' How to enable C++17 compiling in Visual Studio? How can I use/create dynamic template to compile dynamic Component with Angular 2.0? Microsoft Visual C++ Compiler for Python 3.4 C compile : collect2: error: ld returned 1 exit status Error:java: invalid source release: 8 in Intellij. What does it mean? Eclipse won't compile/run java file IntelliJ IDEA 13 uses Java 1.5 despite setting to 1.7 OPTION (RECOMPILE) is Always Faster; Why? (.text+0x20): undefined reference to `main' and undefined reference to function

Examples related to compiler-construction

fatal error C1010 - "stdafx.h" in Visual Studio how can this be corrected? Compilation error: stray ‘\302’ in program etc What is difference between sjlj vs dwarf vs seh? What is the difference between a token and a lexeme? How to compile makefile using MinGW? C++ variable has initializer but incomplete type? It is more efficient to use if-return-return or if-else-return? Could not load file or assembly ... The parameter is incorrect How do I compile the asm generated by GCC? Visual Studio: LINK : fatal error LNK1181: cannot open input file

Examples related to token

Sending the bearer token with axios JWT (JSON Web Token) library for Java Python requests library how to pass Authorization header with single token best practice to generate random token for forgot password syntax error: unexpected token < What is the difference between a token and a lexeme? how to generate a unique token which expires after 24 hours? Parse (split) a string in C++ using string delimiter (standard C++) How do I fix a "Expected Primary-expression before ')' token" error? How can a Jenkins user authentication details be "passed" to a script which uses Jenkins API to create jobs?