max@pseudogen:~$: A Simple Lexical Analyzer .... Enjoy.

I got bored and wrote a simple lexical analyzer and thought about walking through explaining it but in all honesty its rather self explanatory especially with the inclusion of the diagrams I made and the BNF grammar I am supplying to go with the code snippet.

The grammar simply parses words or phrases in a file with white space as a delimiter and requiring that all words or phrases start with a letter and are followed by any combination of letters and digits.

BNF:

G[<Word>]

<Word> ::= <Letter> | <Letter> <LetterDigit>
<LetterDigit> ::= <Letter> | <Digit> | <LetterDigit> <Letter> | <LetterDigit> <Digit>
<Letter> ::= a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y | z
<Digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

A graphical representation of the grammar:

And the finite state automaton:

Finally, the code ... in python of course :)

Note: If you copy and paste this code it will not work because for one reason or another I can't get blogger to format the tabs correctly, please download the code from the link provided at the bottom.

UPDATE: Now featuring code indentation goodness.

###############################


#!/usr/bin/python
import sys

class SimpleLex:
    """@Author: Adam Miller - Simple lexical anyzer"""
 
    #allowed alphabetic characters in grammar
    alpha = 'abcdefghijklmnopqrstuvwxyz'
 
    #allowed digits in grammar
    digit = '0123456789'
 
    #white space symbolic representation
    wSpace = ' \n \t'
 
    #grammar table of relations
    table = [
            [1,3,0],
            [1,1,2],
            [2,2,2],
            [3,3,3],
            ]
   
    #relations of type and their state index
    relations = {
        "alpha":0, 
 "digit":1,
 "wSpace":2
 }


    def __init__(self, file_name):
        """open a file, extract contents, close file,
        initialize character list, and state variable"""
        self.f = open(file_name, 'r')
        self.lines = self.f.readlines()
        self.f.close()
 self.chars = []
 self.state = 0
  
    def scan(self):
 """process the file's contents one character at a time"""
 for line in self.lines:
     for c in line:
                # 3 is the error state
  if self.state == 3:
      print "Error"
      self.state = 0
      self.chars = []  
  # 2 is the success state  
  elif self.state == 2:
      print ''.join(self.chars)
      self.state = 0
      self.chars = []
     
  #let table drive state transitions 
  self.state = self.table[self.state][self.lex(c)]
  
        """print final character buffer because the loop will end
 execution and not let the last success state be checked"""
 print ''.join(self.chars)
    
    def lex(self, c):
 """return the correct relational index to the type of c"""
 self.chars.append(c)
 if c in self.alpha:
            return self.relations["alpha"]
 elif c in self.digit:
     return self.relations["digit"]
 elif c in self.wSpace:
     return self.relations["wSpace"]
 else:
     return 0

#use the code i just wrote
if len(sys.argv) < 2 or len(sys.argv) > 2:
    sys.stderr.write("Usage: sampleLex  \n")
elif len(sys.argv) == 2:
    lexer = SimpleLex(sys.argv[1])
    lexer.scan()

###############################

For your viewing pleasure, a small example of the "little lexer" in action:

###############################
adam@pseudogen:~$ ./simpleLex.py lexFile.txt
hello
world
from
my
ub3r
l33t
lexical
analyzer

###############################

The code, the diagrams and the file used in this example are all available here.

max@pseudogen:~$

Thursday, September 06, 2007

A Simple Lexical Analyzer .... Enjoy.

1 comment: