Lexical Analysis

Foundations: Compilers and Programming Languages
Skill: [0.0]

In Compiler Frontends, Lexical Analysis, also known as Lexing or Scanning, refers to the process of mapping an input program, represented as a string, to a list of Tokens.

This process usually uses highly efficient regular expression-based techniques to identify Lexemes in input strings and map these lexemes to their corresponding tokens.

Lexeme

A lexeme is a substring that occurs in the string representation of the input program. Lexical analysis maps this substring to a token.

Token

A token is an atomic element in the input program that corresponds to a Terminal Symbol in a Grammar.

For example, a lexical analyser for Java might read the following Java fragment:

Java int x = (42 + 23); // Lexing usually removes/ignores comments

and identify the following lexemes and corresponding tokens:

Lexeme	Token
`"int"`	`INT`
`"x"`	`IDENTIFIER`
`"="`	`EQ`
`"("`	`LPAREN`
`"42"`	`INTEGER_LITERAL`
`"+"`	`PLUS`
`"23"`	`INTEGER_LITERAL`
`")"`	`RPAREN`
`";"`	`SEMICOLON`

In a compiler frontend, the lexical analyser would then pass these tokens to the Parser.

Tokens often maintain a copy (or pointer to) their corresponding lexemes for later passes; for instance, the IDENTIFIER token in the above would carry a reference to the string "x", and similarly the INTEGER_LITERAL tokens.