This process usually uses highly efficient regular expression-based techniques to identify Lexemes in input strings and map these lexemes to their corresponding tokens.
A lexeme is a substring that occurs in the string representation of the input program. Lexical analysis maps this substring to a token.
A token is an atomic element in the input program that corresponds to a Terminal Symbol in a Grammar.
For example, a lexical analyser for Java might read the following Java fragment:
Java
int x = (42 + 23); // Lexing usually removes/ignores comments
and identify the following lexemes and corresponding tokens:
Lexeme | Token |
---|---|
"int" |
INT |
"x" |
IDENTIFIER |
"=" |
EQ |
"(" |
LPAREN |
"42" |
INTEGER_LITERAL |
"+" |
PLUS |
"23" |
INTEGER_LITERAL |
")" |
RPAREN |
";" |
SEMICOLON |
In a compiler frontend, the lexical analyser would then pass these tokens to the Parser.
Tokens often maintain a copy (or pointer to) their corresponding lexemes for later passes; for instance,
the IDENTIFIER
token in the above would carry a reference to the string "x"
, and similarly
the INTEGER_LITERAL
tokens.