alex gaynor's blago-blog

Posts tagged with lex

Using PLY for Parsing Without Using it for Lexing

Posted November 23rd, 2009. Tagged with python, ply, lex, yacc, parse.

Over the past week or so I've been struggling with attempting to write my own parser (or parser generator) by hand. A few days ago I finally decided to give up on this notion (after all the parser isn't my end goal) as it was draining my time from the interesting work to be done. However, I wanted to keep my existing lexer. I wrote the lexer by hand in the method I described in a previous post, it's fast, easy to read, and I rather like my handiwork, so I wanted to keep it if possible. I've used PLY before (as I described last year) so I set out to see if it would be possible to use it for parsing without using it for lexing.

As it turns out PLY expects only a very minimal interface from it's lexer. In fact it only needs one method, token(), which returns a new token (or None at the end). Tokens are expected to have just 4 attributes. Having this knowledge I now set out to write a pair of compatibility classes for my existing lexer and token classes, I wanted to do this without altering the lexer/token API so that if and when I finally write my own parser I don't have to remove legacy compatibility stuff. My compatibility classes are very small, just this:

class PLYCompatLexer(object):
    def __init__(self, text):
        self.text = text
        self.token_stream = Lexer(text).parse()

    def token(self):
        try:
            return PLYCompatToken(self.token_stream.next())
        except StopIteration:
            return None


class PLYCompatToken(object):
    def __init__(self, token):
        self.type = token.name
        self.value = token.value
        self.lineno = None
        self.lexpos = None

    def __repr__(self):
        return "<Token: %r %r>" % (self.type, self.value)

This is the entirety of the API that PLY needs. Now I can write my parser exactly as I would normally with PLY.

You can find the rest here. There are view comments.

Things College Taught me that the "Real World" Didn't

Posted November 21st, 2009. Tagged with django, python, compile, ply, lex, yacc, c++, response, compiler, pypy, unladen-swallow, programming-languages, parse, college.

A while ago Eric Holscher blogged about things he didn't learn in college. I'm going to take a different spin on it, looking at both things that I did learn in school that I wouldn't have learned else where (henceforth defined as my job, or open source programming), as well as thinks I learned else where instead of at college.

Things I learned in college:

  • Big O notation, and algorithm analysis. This is the biggest one, I've had little cause to consider this in my open source or professional work, stuff is either fast or slow and that's usually enough. Learning rigorous algorithm analysis doesn't come up all the time, but every once in a while it pops up, and it's handy.
  • C++. I imagine that I eventually would have learned it myself, but my impetus to learn it was that's what was used for my CS2 class, so I started learning with the class then dove in head first. Left to my own devices I may very well have stayed in Python/Javascript land.
  • Finite automaton and push down automaton. I actually did lexing and parsing before I ever started looking at these in class (see my blog posts from a year ago) using PLY, however, this semester I've actually been learning about the implementation of these things (although sadly for class projects we've been using Lex/Yacc).

Things I learned in the real world:

  • Compilers. I've learned everything I know about compilers from reading my papers from my own interest and hanging around communities like Unladen Swallow and PyPy (and even contributing a little).
  • Scalability. Interesting this is a concept related to algorithm analysis/big O, however this is something I've really learned from talking about this stuff with guys like Mike Malone and Joe Stump.
  • APIs, Documentation. These are the core of software development (in my opinion), and I've definitely learned these skills in the open source world. You don't know what a good API or documentation is until it's been used by someone you've never met and it just works for them, and they can understand it perfectly. One of the few required, advanced courses at my school is titled, "Software Design and Documentation" and I'm deathly afraid it's going to waste my time with stuff like UML, instead of focusing on how to write APIs that people want to use and documentation that people want to read.

So these are my short lists. I've tried to highlight items that cross the boundaries between what people traditionally expect are topics for school and topics for the real world. I'd be curious to hear what other people's experience with topics like these are.</div>

You can find the rest here. There are view comments.

Writing a Lexer

Posted November 17th, 2009. Tagged with python, compile, lex, programming-languages, parse.

People who have been reading this blog since last year (good lord) may recall that once upon a time I did a short series of posts on lexing and parsing using PLY. Back then I was working on a language named Al. This past week or so I've started working on another personal project (not public yet) and I've once again had the need to lex things, but this time I wrote my lexer by hand, instead of using any sort of generator. This has been an exceptional learning experience, so I'd like to pass some of that on to you.

The first thing to note is that writing a lexer is a great place to TDD (test driven development), I've rewritten various parts of my lexer five or more times, I've needed my tests to keep me sane. Got your tests written? Ok it's time to dive right into our lexer.

I've structured my lexer as a single class that takes an input string, and has a parse method which returns a generator that yields tokens (tokens are just a namedtuple with a name and value field). The parser has two important attributes, state which is a string that says what state the lexer is in (this is used for tokens that are more than one character long), and current_val which is a list containing characters that will eventually become the value for the current token being found.

The parse method iterates through characters in the text and then it checks, if the parser has a state (self.state is not None) it does getattr(self, self.state)(character). Otherwise it calls self.generic(character). Then the various "state methods" are responsible for mutating self.current_val and self.state and returning a Token. So for example the string state looks like this:

def string(self, ch):
    if ch == '"':
        sym = Symbol("STRING", "".join(self.current_val))
        self.current_val = []
        self.state = None
        return sym
    elif ch == "\\":
        self.state = "string_escaped"
    else:
        self.current_val.append(ch)

If the character is a quote then we're closing our string so we return our string Symbol, reset the current_val and state. If the character is a then we switch into a string_escaped state which knows to handle the character as a literal and then go back to string state. If the character is anything else then we just append it to the current_val, it will get handled at the end of the string.

I've found this to be an exceptionally powerful method, and it makes my end result code very readable. Hopefully I'll be able to reveal my project in the coming weeks, as I'm very excited about it, even if it's not ready I'll continue to share these lessons learned as I go.

You can find the rest here. There are view comments.

Getting Started With PLY - Part 3

Posted November 10th, 2008. Tagged with python, ply, lex, yacc.

As promised, today we'll be looking at implementing additional arithmetic operations, dealing with order of operations, and adding variables to our languages, so without further ado, let's jump into the code.

We can replace our old addition rule with this:

import operator
def p_expression_arithmetic(p):
   '''
   expression : expression PLUS expression
              | expression MINUS expression
              | expression TIMES expression
              | expression DIVIDE expression
   '''
   OPS = {
       '+': operator.add,
       '-': operator.sub,
       '*': operator.mul,
       '/': operator.div
   }
   p[0] = OPS[p[2]](p[1], p[3])

Hopefully what this code does is pretty clear, the | operator in the rule is an or option. So if we match any of these, we get the correct function out of our ops dictionary(if you aren't familiar with operator module check it out, it's awesome), and then call it with the two arguments.

This handles the arithmetic correctly, but doesn't handle order of operations, so lets add that in:

precedence = (
   ('left', 'PLUS', 'MINUS'),
   ('left', 'TIMES', 'DIVIDE'),
)

What this says is all these operations are left-associative, and TIMES and DIVIDE have a high precedence than PLUS and MINUS(both groupings have equal precedence, and thus read left to right).

Now that we have a fully functioning calculator, let's add in variables, first we need to add a token for NAMES (variables) and for the assignment operator:

def t_NAME(t):
   r'[a-zA-Z_][a-zA-Z_0-9]*'
   return t

t_EQ = r'='

And of course add NAME and EQ to the list of tokens, and now a few parsing rules:

names = {}

def p_expression_name(p):
   '''
   expression : NAME
   '''
   p[0] = names[p[1]]

def p_assignment(p):
   '''
   assignment : NAME EQ expression
   '''
   names[p[1]] = p[3]

So here we define a names dictionary, it will map variables to values. Hopefully the parse rules are fairly obvious, and everything makes sense.

You can find the rest here. There are view comments.

Getting Started With PLY

Posted November 8th, 2008. Tagged with python, ply, lex.

The other day I mentioned I was using PLY in my post about building a language, so today I'm going to describe getting started with PLY, specifically the tokenization phase. For those who don't know much about parsing a language, the tokenization phase is where we take the source file, and turn it into a series of tokens. For example, turning a = 3 + 4 into, NAME EQUALS 3 PLUS 4. As you can see that simple assignment becomes 5 tokens, each number is a token, both numbers are tokens, and a is a NAME token. So how do we do this in PLY?

PLY's method for defining tokenization rules is very creative. First you define a list of tokens, for example:

tokens = (
   'NUMBER',
   'PLUS',
   'MINUS',
   'TIMES',
   'DIVIDE',
)

Here we have defined the types of tokens we will define, what each of these is should be self explanatory. Then we define some rules, they look like this:

t_PLUS    = r'\+'
t_MINUS   = r'-'
t_TIMES   = r'\*'
t_DIVIDE  = r'/'
def t_NUMBER(t):
    r'\d+'
    try:
        t.value = int(t.value)
    except ValueError:
        t.value = 0
    return t

This is probably less obvious. There are 2 ways to define the rules for a token, either as a string, or as a function. Either way they are named t_TOKEN_NAME. For a lot of tokens you can just do the string, those are the ones that don't require processing, and the string is just a regex that matches the token. For things that do need processing, we can define a function. The function takes 1 parameter, which is a lexer object, as you can see in our example, we take in t, and since we are defining a number token we set the value to be the integer for the string representation from the source code. The interesting thing here is how we define the rule for, PLY uses the docstring for a function to get the regex for it.

Now that we have all of our rules set up we need to actually build the lexer object:

lexer = lex.lex()

And then we can use the input() function on a lexer to provide the source code, and the token function to pop the next token off the lexer.

That's all for today, in the future we'll take a look at the other components of building the grammar of a language, and at how we implement it. For more information now, PLY has excellent documentation, available here.

You can find the rest here. There are view comments.