Lexical Analysis


Lexical Analysis

In Compiler Frontends, Lexical Analysis, also known as Lexing or Scanning, refers to the process of mapping an input program, represented as a string, to a list of Tokens.

This process usually uses highly efficient regular expression-based techniques to identify Lexemes in input strings and map these lexemes to their corresponding tokens.

Lexeme

A lexeme is a substring that occurs in the string representation of the input program. Lexical analysis maps this substring to a token.

Token

A token is an atomic element in the input program that corresponds to a Terminal Symbol in a Grammar.

For example, a lexical analyser for Java might read the following Java fragment:

Java int x = (42 + 23); // Lexing usually removes/ignores comments

and identify the following lexemes and corresponding tokens:

Lexeme Token
"int" INT
"x" IDENTIFIER
"=" EQ
"(" LPAREN
"42" INTEGER_LITERAL
"+" PLUS
"23" INTEGER_LITERAL
")" RPAREN
";" SEMICOLON

In a compiler frontend, the lexical analyser would then pass these tokens to the Parser.

Tokens often maintain a copy (or pointer to) their corresponding lexemes for later passes; for instance, the IDENTIFIER token in the above would carry a reference to the string "x", and similarly the INTEGER_LITERAL tokens.