specification of tokens in compiler design
Mohammed
Guys, does anyone know the answer?
get specification of tokens in compiler design from screen.
Compiler Design
Compiler Design - Lexical Analysis, Lexical analysis is the first phase of a compiler. It takes modified source code from language preprocessors that are written in the form of sentences. The lexi
Compiler Design - Lexical Analysis
Advertisements Previous Page Next Page
Complete Python Prime Pack for 2023
9 Courses 2 eBooks Lifetime Access 30-Days Money Back Guarantee
Buy Now
Artificial Intelligence & Machine Learning Prime Pack
6 Courses 1 eBooks Lifetime Access 30-Days Money Back Guarantee
Buy Now
Java Prime Pack 2023
8 Courses 2 eBooks Lifetime Access 30-Days Money Back Guarantee
Buy Now
Lexical analysis is the first phase of a compiler. It takes modified source code from language preprocessors that are written in the form of sentences. The lexical analyzer breaks these syntaxes into a series of tokens, by removing any whitespace or comments in the source code.
If the lexical analyzer finds a token invalid, it generates an error. The lexical analyzer works closely with the syntax analyzer. It reads character streams from the source code, checks for legal tokens, and passes the data to the syntax analyzer when it demands.
Tokens
Lexemes are said to be a sequence of characters (alphanumeric) in a token. There are some predefined rules for every lexeme to be identified as a valid token. These rules are defined by grammar rules, by means of a pattern. A pattern explains what can be a token, and these patterns are defined by means of regular expressions.
In programming language, keywords, constants, identifiers, strings, numbers, operators and punctuations symbols can be considered as tokens.
For example, in C language, the variable declaration line
int value = 100;
contains the tokens:
int (keyword), value (identifier), = (operator), 100 (constant) and ; (symbol).
Specifications of Tokens
Let us understand how the language theory undertakes the following terms:
Alphabets
Any finite set of symbols {0,1} is a set of binary alphabets, {0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set of Hexadecimal alphabets, {a-z, A-Z} is a set of English language alphabets.
Strings
Any finite sequence of alphabets (characters) is called a string. Length of the string is the total number of occurrence of alphabets, e.g., the length of the string tutorialspoint is 14 and is denoted by |tutorialspoint| = 14. A string having no alphabets, i.e. a string of zero length is known as an empty string and is denoted by ε (epsilon).
Special symbols
A typical high-level language contains the following symbols:-
Arithmetic Symbols Addition(+), Subtraction(-), Modulo(%), Multiplication(*), Division(/)
Punctuation Comma(,), Semicolon(;), Dot(.), Arrow(->)
Assignment =
Special Assignment +=, /=, *=, -=
Comparison ==, !=, <, <=, >, >=
Preprocessor #
Location Specifier &
Logical &, &&, |, ||, !
Shift Operator >>, >>>, <<, <<<
Language
A language is considered as a finite set of strings over some finite set of alphabets. Computer languages are considered as finite sets, and mathematically set operations can be performed on them. Finite languages can be described by means of regular expressions.
Regular Expressions
The lexical analyzer needs to scan and identify only a finite set of valid string/token/lexeme that belong to the language in hand. It searches for the pattern defined by the language rules.
Regular expressions have the capability to express finite languages by defining a pattern for finite strings of symbols. The grammar defined by regular expressions is known as regular grammar. The language defined by regular grammar is known as regular language.
Regular expression is an important notation for specifying patterns. Each pattern matches a set of strings, so regular expressions serve as names for a set of strings. Programming language tokens can be described by regular languages. The specification of regular expressions is an example of a recursive definition. Regular languages are easy to understand and have efficient implementation.
There are a number of algebraic laws that are obeyed by regular expressions, which can be used to manipulate regular expressions into equivalent forms.
Operations
The various operations on languages are:
Union of two languages L and M is written as
L U M = {s | s is in L or s is in M}
Concatenation of two languages L and M is written as
LM = {st | s is in L and t is in M}
The Kleene Closure of a language L is written as
L* = Zero or more occurrence of language L.
Notations
If r and s are regular expressions denoting the languages L(r) and L(s), then
Union : (r)|(s) is a regular expression denoting L(r) U L(s)Concatenation : (r)(s) is a regular expression denoting L(r)L(s)Kleene closure : (r)* is a regular expression denoting (L(r))*(r) is a regular expression denoting L(r)
Precedence and Associativity
*, concatenation (.), and | (pipe sign) are left associative
* has the highest precedence
Concatenation (.) has the second highest precedence.
| (pipe sign) has the lowest precedence of all.
Representing valid tokens of a language in regular expression
If x is a regular expression, then:
x* means zero or more occurrence of x.
i.e., it can generate { e, x, xx, xxx, xxxx, … }
x+ means one or more occurrence of x.
स्रोत : www.tutorialspoint.com
What is Specification of Tokens? Regula Expression & Definition
Specification of tokens depends on the pattern of the lexeme. Here, we will use regular expressions to specify patterns that can form tokens.
Specification of Tokens
2nd March 2022 by Neha T Leave a Comment
Specification of tokens depends on the pattern of the lexeme. Here we will be using regular expressions to specify the different types of patterns that can actually form tokens.
Although the regular expressions are inefficient in specifying all the patterns forming tokens. Yet it reveals almost all types of pattern that forms a token.
Content: Specification of Tokens
String and Languages
Operation on Languages
Regular Expression Regular Definition
String and Languages
String
The string is a finite set of alphabets. Alphabet is a finite set of symbols. Symbols can be letters, digits and punctuation.
Example 1:The set of digits (symbols) {0, 1} forms a binary alphabet. As there are only two symbols to form an alphabet.
If you can remember ASCII system that is used in almost every computer, denotes the alphabet A using the set of digits {0, 1} i.e. A = 01000001.
Example 2:The Unicode system defines an alphabet by assigning a unique number to each alphabet. On average, it has 100000 alphabets from around the world including emojis.
Length of String
The length of the string can be determined by the number of alphabets in the string. The string is represented by the letter ‘s’ and |s| represents the length of the string. Let’s consider the string:
s = banana|s| = 6
Note: The empty string or the string with length 0 is represented by ‘∈’.Language
Language is a set of strings over some fixed alphabets. Like the English language is a set of strings over the fixed alphabets ‘a to z’.
Terms Related to String
1. Prefix of StringThe prefix of the string is the preceding symbols present in the string and the string s itself.
For example: s = abcd
The prefix of the string abcd: ∈, a, ab, abc, abcd
2. Suffix of StringSuffix of the string is the ending symbols of the string and the string s itself.
For example: s = abcd
Suffix of the string abcd: ∈, d, cd, bcd, abcd
3. Proper Prefix of StringThe proper prefix of the string includes all the prefixes of the string excluding ∈ and the string s itself.
Proper Prefix of the string abcd: a, ab, abc
4. Proper Suffix of StringThe proper suffix of the string includes all the suffixes excluding ∈ and the string s itself.
Proper Suffix of the string abcd: d, cd, bcd
5. Substring of StringThe substring of a string s is obtained by deleting any prefix or suffix from the string.
Substring of the string abcd: ∈, abcd, bcd, abc, …
6. Proper Substring of StringThe proper substring of a string s includes all the substrings of s excluding ∈ and the string s itself.
Proper Substring of the string abcd: bcd, abc, cd, ab…
7. Subsequence of StringThe subsequence of the string is obtained by eliminating zero or more (not necessarily consecutive) symbols from the string.
A subsequence of the string abcd: abd, bcd, bd, …
8. Concatenation of StringIf s and t are two strings, then st denotes concatenation.
s = abct = def
Concatenation of string s and t i.e. st = abcdef
Operation on Languages
As we have learnt language is a set of strings that are constructed over some fixed alphabets. Now the operation that can be performed on languages are:
1. UnionUnion is the most common set operation. Consider the two languages L and M. Then the union of these two languages is denoted by:
L [∪ M = { s | s is in L or s is in M}
That means the string s from the union of two languages can either be from language L or from language M.
If L = {a, b} and M = {c, d}Then L ∪ M = {a, b, c, d}
2. ConcatenationConcatenation links the string from one language to the string of another language in a series in all possible ways. The concatenation of two different languages is denoted by:
L ⋅ M = {st | s is in L and t is in M}If L = {a, b} and M = {c, d}
Then L ⋅ M = {ac, ad, bc, bd}
3. Kleene ClosureKleene closure of a language L provides you with a set of strings. This set of strings is obtained by concatenating L zero or more time. The Kleene closure of the language L is denoted by:
If L = {a, b}L* = {∈, a, b, aa, bb, aaa, bbb, …}
4. Positive ClosureThe positive closure on a language L provides a set of strings. This set of strings is obtained by concatenating ‘L’ one or more times. It is denoted by:
It is similar to the Kleene closure. Except for the term L0, i.e. L+ excludes ∈ until it is in L itself.
If L = {a, b}L+ = {a, b, aa, bb, aaa, bbb, …}
So, these are the four operations that can be performed on the languages in the lexical analysis phase.
Regular Expression
A regular expression is a sequence of symbols used to specify lexeme patterns. A regular expression is helpful in describing the languages that can be built using operators such as union, concatenation, and closure over the symbols.
A regular expression ‘r’ that denotes a language L(r) is built recursively over the smaller regular expression using the rules given below.
Specification of Tokens
There are 3 specifications of tokens: 1)Strings 2) Language 3)Regular expression
Chapter: Principles of Compiler Design : Lexical Analysis
Chapter: Principles of Compiler Design : Lexical Analysis Specification of Tokens
There are 3 specifications of tokens: 1)Strings 2) Language 3)Regular expression
SPECIFICATION OF TOKENSThere are 3 specifications of tokens:
1)Strings 2) Language
3)Regular expression
Strings and Languagesv An alphabet or character class is a finite set of symbols.
v A string over an alphabet is a finite sequence of symbols drawn from that alphabet.
v A language is any countable set of strings over some fixed alphabet.
In language theory, the terms "sentence" and "word" are often used as synonyms for
"string." The length of a string s, usually written |s|, is the number of occurrences of symbols in s. For example, banana is a string of length six. The empty string, denoted ε, is the string of length zero.
Operations on stringsThe following string-related terms are commonly used:
1. A prefix of string s is any string obtained by removing zero or more symbols from the end of string s. For example, ban is a prefix of banana.
2. A suffix of string s is any string obtained by removing zero or more symbols from the beginning of s. For example, nana is a suffix of banana.
3. A substring of s is obtained by deleting any prefix and any suffix from s. For example, nan is a substring of banana.
4. The proper prefixes, suffixes, and substrings of a string s are those prefixes, suffixes, and substrings, respectively of s that are not ε or not equal to s itself.
5. A subsequence of s is any string formed by deleting zero or more not necessarily consecutive positions of s
6. For example, baan is a subsequence of banana.
Operations on languages:The following are the operations that can be applied to languages:
1. Union 2. Concatenation 3. Kleene closure 4. Positive closure
The following example shows the operations on strings: Let L={0,1} and S={a,b,c}
Regular Expressions· Each regular expression r denotes a language L(r).
· Here are the rules that define the regular expressions over some alphabet Σ and the languages that those expressions denote:
1.ε is a regular expression, and L(ε) is { ε }, that is, the language whose sole member is the empty string.
2. If ‘a’ is a symbol in Σ, then ‘a’ is a regular expression, and L(a) = {a}, that is, the language with one string, of length one, with ‘a’ in its one position.
3.Suppose r and s are regular expressions denoting the languages L(r) and L(s). Then, a) (r)|(s) is a regular expression denoting the language L(r) U L(s).
b) (r)(s) is a regular expression denoting the language L(r)L(s). c) (r)* is a regular expression denoting (L(r))*.
d) (r) is a regular expression denoting L(r).
4.The unary operator * has highest precedence and is left associative.
5.Concatenation has second highest precedence and is left associative.
6. | has lowest precedence and is left associative.
Regular setA language that can be defined by a regular expression is called a regular set. If two regular expressions r and s denote the same regular set, we say they are equivalent and write r = s.
There are a number of algebraic laws for regular expressions that can be used to manipulate into equivalent forms.
For instance, r|s = s|r is commutative; r|(s|t)=(r|s)|t is associative.
Regular DefinitionsGiving names to regular expressions is referred to as a Regular definition. If Σ is an alphabet of basic symbols, then a regular definition is a sequence of definitions of the form
dl → r 1 d2 → r2 ……… dn → rn
1.Each di is a distinct name.
2.Each ri is a regular expression over the alphabet Σ U {dl, d2,. . . , di-l}.
Example: Identifiers is the set of strings of letters and digits beginning with a letter. Regular
definition for this set:
letter → A | B | …. | Z | a | b | …. | z | digit → 0 | 1 | …. | 9
id → letter ( letter | digit ) *
ShorthandsCertain constructs occur so frequently in regular expressions that it is convenient to introduce notational short hands for them.
1. :- The unary postfix operator + means “ one or more instances of” .
- If r is a regular expression that denotes the language L(r), then ( r )+ is a regular expression that denotes the language (L (r ))+
- Thus the regular expression a+ denotes the set of all strings of one or more a’s.
- The operator + has the same precedence and associativity as the operator *.
2:- The unary postfix operator ? means “zero or one instance of”.
- The notation r? is a shorthand for r | ε.
- If ‘r’ is a regular expression, then ( r )? is a regular expression that denotes the language
3:- The notation [abc] where a, b and c are alphabet symbols denotes the regular expression a | b | c.
- Character class such as [a – z] denotes the regular expression a | b | c | d | ….|z.
- We can describe identifiers as being strings generated by the regular expression, [A–Za–z][A– Za–z0–9]*
Non-regular SetA language which cannot be described by any regular expression is a non-regular set. Example: The set of all strings of balanced parentheses and repeating strings cannot be described by a regular expression. This set can be specified by a context-free grammar.
Guys, does anyone know the answer?