# specification of tokens in compiler design

### Mohammed

Guys, does anyone know the answer?

get specification of tokens in compiler design from screen.

## Compiler Design

Compiler Design - Lexical Analysis, Lexical analysis is the first phase of a compiler. It takes modified source code from language preprocessors that are written in the form of sentences. The lexi

## Compiler Design - Lexical Analysis

Advertisements Previous Page Next Page

Complete Python Prime Pack for 2023

9 Courses 2 eBooks Lifetime Access 30-Days Money Back Guarantee

Buy Now

Artificial Intelligence & Machine Learning Prime Pack

6 Courses 1 eBooks Lifetime Access 30-Days Money Back Guarantee

Buy Now

Java Prime Pack 2023

8 Courses 2 eBooks Lifetime Access 30-Days Money Back Guarantee

Buy Now

Lexical analysis is the first phase of a compiler. It takes modified source code from language preprocessors that are written in the form of sentences. The lexical analyzer breaks these syntaxes into a series of tokens, by removing any whitespace or comments in the source code.

If the lexical analyzer finds a token invalid, it generates an error. The lexical analyzer works closely with the syntax analyzer. It reads character streams from the source code, checks for legal tokens, and passes the data to the syntax analyzer when it demands.

## Tokens

Lexemes are said to be a sequence of characters (alphanumeric) in a token. There are some predefined rules for every lexeme to be identified as a valid token. These rules are defined by grammar rules, by means of a pattern. A pattern explains what can be a token, and these patterns are defined by means of regular expressions.

In programming language, keywords, constants, identifiers, strings, numbers, operators and punctuations symbols can be considered as tokens.

For example, in C language, the variable declaration line

int value = 100;

contains the tokens:

int (keyword), value (identifier), = (operator), 100 (constant) and ; (symbol).

## Specifications of Tokens

Let us understand how the language theory undertakes the following terms:

### Alphabets

Any finite set of symbols {0,1} is a set of binary alphabets, {0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set of Hexadecimal alphabets, {a-z, A-Z} is a set of English language alphabets.

### Strings

Any finite sequence of alphabets (characters) is called a string. Length of the string is the total number of occurrence of alphabets, e.g., the length of the string tutorialspoint is 14 and is denoted by |tutorialspoint| = 14. A string having no alphabets, i.e. a string of zero length is known as an empty string and is denoted by ε (epsilon).

### Special symbols

A typical high-level language contains the following symbols:-

Arithmetic Symbols Addition(+), Subtraction(-), Modulo(%), Multiplication(*), Division(/)

Punctuation Comma(,), Semicolon(;), Dot(.), Arrow(->)

Assignment =

Special Assignment +=, /=, *=, -=

Comparison ==, !=, <, <=, >, >=

Preprocessor #

Location Specifier &

Logical &, &&, |, ||, !

Shift Operator >>, >>>, <<, <<<

### Language

A language is considered as a finite set of strings over some finite set of alphabets. Computer languages are considered as finite sets, and mathematically set operations can be performed on them. Finite languages can be described by means of regular expressions.

## Regular Expressions

The lexical analyzer needs to scan and identify only a finite set of valid string/token/lexeme that belong to the language in hand. It searches for the pattern defined by the language rules.

Regular expressions have the capability to express finite languages by defining a pattern for finite strings of symbols. The grammar defined by regular expressions is known as **regular grammar**. The language defined by regular grammar is known as **regular language**.

Regular expression is an important notation for specifying patterns. Each pattern matches a set of strings, so regular expressions serve as names for a set of strings. Programming language tokens can be described by regular languages. The specification of regular expressions is an example of a recursive definition. Regular languages are easy to understand and have efficient implementation.

There are a number of algebraic laws that are obeyed by regular expressions, which can be used to manipulate regular expressions into equivalent forms.

## Operations

The various operations on languages are:

Union of two languages L and M is written as

L U M = {s | s is in L or s is in M}

Concatenation of two languages L and M is written as

LM = {st | s is in L and t is in M}

The Kleene Closure of a language L is written as

L* = Zero or more occurrence of language L.

## Notations

If r and s are regular expressions denoting the languages L(r) and L(s), then

**Union**: (r)|(s) is a regular expression denoting L(r) U L(s)

**Concatenation**: (r)(s) is a regular expression denoting L(r)L(s)

**Kleene closure**: (r)* is a regular expression denoting (L(r))*

(r) is a regular expression denoting L(r)

## Precedence and Associativity

*, concatenation (.), and | (pipe sign) are left associative

* has the highest precedence

Concatenation (.) has the second highest precedence.

| (pipe sign) has the lowest precedence of all.

### Representing valid tokens of a language in regular expression

If x is a regular expression, then:

x* means zero or more occurrence of x.

i.e., it can generate { e, x, xx, xxx, xxxx, … }

x+ means one or more occurrence of x.

स्रोत : **www.tutorialspoint.com**

## What is Specification of Tokens? Regula Expression & Definition

Specification of tokens depends on the pattern of the lexeme. Here, we will use regular expressions to specify patterns that can form tokens.

## Specification of Tokens

2nd March 2022 by Neha T Leave a Comment

Specification of tokens depends on the pattern of the lexeme. Here we will be using regular expressions to specify the different types of patterns that can actually form tokens.

Although the regular expressions are inefficient in specifying all the patterns forming tokens. Yet it reveals almost all types of pattern that forms a token.

### Content: Specification of Tokens

String and Languages

Operation on Languages

Regular Expression Regular Definition

### String and Languages

### String

The string is a finite set of alphabets. Alphabet is a finite set of symbols. Symbols can be letters, digits and punctuation.

**Example 1:**

The set of digits (symbols) {0, 1} forms a binary alphabet. As there are only two symbols to form an alphabet.

If you can remember ASCII system that is used in almost every computer, denotes the alphabet A using the set of digits {0, 1} i.e. A = 01000001.

**Example 2:**

The Unicode system defines an alphabet by assigning a unique number to each alphabet. On average, it has 100000 alphabets from around the world including emojis.

### Length of String

The length of the string can be determined by the number of alphabets in the string. The string is represented by the letter ‘s’ and |s| represents the length of the string. Let’s consider the string:

s = banana|s| = 6

**Note**: The empty string or the string with length 0 is represented by ‘∈’.

### Language

Language is a set of strings over some fixed alphabets. Like the English language is a set of strings over the fixed alphabets ‘a to z’.

### Terms Related to String

**1. Prefix of String**

The prefix of the string is the preceding symbols present in the string and the string s itself.

For example: s = abcd

The prefix of the string abcd: ∈, a, ab, abc, abcd

**2. Suffix of String**

Suffix of the string is the ending symbols of the string and the string s itself.

For example: s = abcd

Suffix of the string abcd: ∈, d, cd, bcd, abcd

**3. Proper Prefix of String**

The proper prefix of the string includes all the prefixes of the string excluding ∈ and the string s itself.

Proper Prefix of the string abcd: a, ab, abc

**4. Proper Suffix of String**

The proper suffix of the string includes all the suffixes excluding ∈ and the string s itself.

Proper Suffix of the string abcd: d, cd, bcd

**5. Substring of String**

The substring of a string s is obtained by deleting any prefix or suffix from the string.

Substring of the string abcd: ∈, abcd, bcd, abc, …

**6. Proper Substring of String**

The proper substring of a string s includes all the substrings of s excluding ∈ and the string s itself.

Proper Substring of the string abcd: bcd, abc, cd, ab…

**7. Subsequence of String**

The subsequence of the string is obtained by eliminating zero or more (not necessarily consecutive) symbols from the string.

A subsequence of the string abcd: abd, bcd, bd, …

**8. Concatenation of String**

If s and t are two strings, then st denotes concatenation.

s = abct = def

Concatenation of string s and t i.e. st = abcdef

### Operation on Languages

As we have learnt language is a set of strings that are constructed over some fixed alphabets. Now the operation that can be performed on languages are:

**1. Union**

Union is the most common set operation. Consider the two languages L and M. Then the union of these two languages is denoted by:

L [∪ M = { s | s is in L or s is in M}

That means the string s from the union of two languages can either be from language L or from language M.

If L = {a, b} and M = {c, d}Then L ∪ M = {a, b, c, d}

**2. Concatenation**

Concatenation links the string from one language to the string of another language in a series in all possible ways. The concatenation of two different languages is denoted by:

L **⋅ **M = {st | s is in L and t is in M}If L = {a, b} and M = {c, d}

Then L **⋅ **M = {ac, ad, bc, bd}

**3. Kleene Closure**

Kleene closure of a language L provides you with a set of strings. This set of strings is obtained by concatenating L zero or more time. The Kleene closure of the language L is denoted by:

If L = {a, b}L* = {∈, a, b, aa, bb, aaa, bbb, …}

**4. Positive Closure**

The positive closure on a language L provides a set of strings. This set of strings is obtained by concatenating ‘L’ one or more times. It is denoted by:

It is similar to the Kleene closure. Except for the term L0, i.e. L+ excludes ∈ until it is in L itself.

If L = {a, b}L+ = {a, b, aa, bb, aaa, bbb, …}

So, these are the four operations that can be performed on the languages in the lexical analysis phase.

### Regular Expression

A regular expression is a sequence of symbols used to specify lexeme patterns. A regular expression is helpful in describing the languages that can be built using operators such as union, concatenation, and closure over the symbols.

A regular expression ‘r’ that denotes a language L(r) is built recursively over the smaller regular expression using the rules given below.

## Specification of Tokens

There are 3 specifications of tokens: 1)Strings 2) Language 3)Regular expression

## Chapter: **Principles of Compiler Design : Lexical Analysis**

Chapter: **Principles of Compiler Design : Lexical Analysis** **Specification of Tokens**

There are 3 specifications of tokens: 1)Strings 2) Language 3)Regular expression

**SPECIFICATION OF TOKENS**

There are 3 specifications of tokens:

1)Strings 2) Language

3)Regular expression

**Strings and Languages**

v An **alphabet** or character class is a finite set of symbols.

v A **string** over an alphabet is a finite sequence of symbols drawn from that alphabet.

v A **language** is any countable set of strings over some fixed alphabet.

In language theory, the terms "sentence" and "word" are often used as synonyms for

"string." The length of a string s, usually written |s|, is the number of occurrences of symbols in s. For example, banana is a string of length six. The empty string, denoted ε, is the string of length zero.

**Operations on strings**

The following string-related terms are commonly used:

1. A **prefix** of string s is any string obtained by removing zero or more symbols from the end of string s. For example, ban is a prefix of banana.

2. A suffix of string s is any string obtained by removing zero or more symbols from the beginning of s. For example, nana is a suffix of banana.

3. A substring of s is obtained by deleting any prefix and any suffix from s. For example, nan is a substring of banana.

4. The **proper prefixes, suffixes, and substrings** of a string s are those prefixes, suffixes, and substrings, respectively of s that are not ε or not equal to s itself.

5. A subsequence of s is any string formed by deleting zero or more not necessarily consecutive positions of s

6. For example, baan is a subsequence of banana.

**Operations on languages:**

The following are the operations that can be applied to languages:

1. Union 2. Concatenation 3. Kleene closure 4. Positive closure

The following example shows the operations on strings: Let L={0,1} and S={a,b,c}

**Regular Expressions**

· Each regular expression r denotes a language L(r).

· Here are the rules that define the regular expressions over some alphabet Σ and the languages that those expressions denote:

1.ε is a regular expression, and L(ε) is { ε }, that is, the language whose sole member is the empty string.

2. If ‘a’ is a symbol in Σ, then ‘a’ is a regular expression, and L(a) = {a}, that is, the language with one string, of length one, with ‘a’ in its one position.

3.Suppose r and s are regular expressions denoting the languages L(r) and L(s). Then, a) (r)|(s) is a regular expression denoting the language L(r) U L(s).

b) (r)(s) is a regular expression denoting the language L(r)L(s). c) (r)* is a regular expression denoting (L(r))*.

d) (r) is a regular expression denoting L(r).

4.The unary operator * has highest precedence and is left associative.

5.Concatenation has second highest precedence and is left associative.

6. | has lowest precedence and is left associative.

**Regular set**

A language that can be defined by a regular expression is called a regular set. If two regular expressions r and s denote the same regular set, we say they are equivalent and write r = s.

There are a number of algebraic laws for regular expressions that can be used to manipulate into equivalent forms.

For instance, r|s = s|r is commutative; r|(s|t)=(r|s)|t is associative.

**Regular Definitions**

Giving names to regular expressions is referred to as a Regular definition. If Σ is an alphabet of basic symbols, then a regular definition is a sequence of definitions of the form

dl → r 1 d2 → r2 ……… dn → rn

1.Each di is a distinct name.

2.Each ri is a regular expression over the alphabet Σ U {dl, d2,. . . , di-l}.

Example: Identifiers is the set of strings of letters and digits beginning with a letter. Regular

definition for this set:

letter → A | B | …. | Z | a | b | …. | z | digit → 0 | 1 | …. | 9

id → letter ( letter | digit ) *

**Shorthands**

Certain constructs occur so frequently in regular expressions that it is convenient to introduce notational short hands for them.

**1. :**

- The unary postfix operator + means “ one or more instances of” .

- If r is a regular expression that denotes the language L(r), then ( r )+ is a regular expression that denotes the language (L (r ))+

- Thus the regular expression a+ denotes the set of all strings of one or more a’s.

- The operator + has the same precedence and associativity as the operator *.

**2:**

- The unary postfix operator ? means “zero or one instance of”.

- The notation r? is a shorthand for r | ε.

- If ‘r’ is a regular expression, then ( r )? is a regular expression that denotes the language

**3:**

- The notation [abc] where a, b and c are alphabet symbols denotes the regular expression a | b | c.

- Character class such as [a – z] denotes the regular expression a | b | c | d | ….|z.

- We can describe identifiers as being strings generated by the regular expression, [A–Za–z][A– Za–z0–9]*

**Non-regular Set**

A language which cannot be described by any regular expression is a non-regular set. Example: The set of all strings of balanced parentheses and repeating strings cannot be described by a regular expression. This set can be specified by a context-free grammar.

Guys, does anyone know the answer?