alex gaynor's blago-blog

Posts tagged with parse

Using PLY for Parsing Without Using it for Lexing

Posted November 23rd, 2009. Tagged with python, ply, lex, yacc, parse.

Over the past week or so I've been struggling with attempting to write my own parser (or parser generator) by hand. A few days ago I finally decided to give up on this notion (after all the parser isn't my end goal) as it was draining my time from the interesting work to be done. However, I wanted to keep my existing lexer. I wrote the lexer by hand in the method I described in a previous post, it's fast, easy to read, and I rather like my handiwork, so I wanted to keep it if possible. I've used PLY before (as I described last year) so I set out to see if it would be possible to use it for parsing without using it for lexing.

As it turns out PLY expects only a very minimal interface from it's lexer. In fact it only needs one method, token(), which returns a new token (or None at the end). Tokens are expected to have just 4 attributes. Having this knowledge I now set out to write a pair of compatibility classes for my existing lexer and token classes, I wanted to do this without altering the lexer/token API so that if and when I finally write my own parser I don't have to remove legacy compatibility stuff. My compatibility classes are very small, just this:

class PLYCompatLexer(object):
    def __init__(self, text):
        self.text = text
        self.token_stream = Lexer(text).parse()

    def token(self):
        try:
            return PLYCompatToken(self.token_stream.next())
        except StopIteration:
            return None


class PLYCompatToken(object):
    def __init__(self, token):
        self.type = token.name
        self.value = token.value
        self.lineno = None
        self.lexpos = None

    def __repr__(self):
        return "<Token: %r %r>" % (self.type, self.value)

This is the entirety of the API that PLY needs. Now I can write my parser exactly as I would normally with PLY.

You can find the rest here. There are view comments.

Things College Taught me that the "Real World" Didn't

Posted November 21st, 2009. Tagged with django, python, compile, ply, lex, yacc, c++, response, compiler, pypy, unladen-swallow, programming-languages, parse, college.

A while ago Eric Holscher blogged about things he didn't learn in college. I'm going to take a different spin on it, looking at both things that I did learn in school that I wouldn't have learned else where (henceforth defined as my job, or open source programming), as well as thinks I learned else where instead of at college.

Things I learned in college:

  • Big O notation, and algorithm analysis. This is the biggest one, I've had little cause to consider this in my open source or professional work, stuff is either fast or slow and that's usually enough. Learning rigorous algorithm analysis doesn't come up all the time, but every once in a while it pops up, and it's handy.
  • C++. I imagine that I eventually would have learned it myself, but my impetus to learn it was that's what was used for my CS2 class, so I started learning with the class then dove in head first. Left to my own devices I may very well have stayed in Python/Javascript land.
  • Finite automaton and push down automaton. I actually did lexing and parsing before I ever started looking at these in class (see my blog posts from a year ago) using PLY, however, this semester I've actually been learning about the implementation of these things (although sadly for class projects we've been using Lex/Yacc).

Things I learned in the real world:

  • Compilers. I've learned everything I know about compilers from reading my papers from my own interest and hanging around communities like Unladen Swallow and PyPy (and even contributing a little).
  • Scalability. Interesting this is a concept related to algorithm analysis/big O, however this is something I've really learned from talking about this stuff with guys like Mike Malone and Joe Stump.
  • APIs, Documentation. These are the core of software development (in my opinion), and I've definitely learned these skills in the open source world. You don't know what a good API or documentation is until it's been used by someone you've never met and it just works for them, and they can understand it perfectly. One of the few required, advanced courses at my school is titled, "Software Design and Documentation" and I'm deathly afraid it's going to waste my time with stuff like UML, instead of focusing on how to write APIs that people want to use and documentation that people want to read.

So these are my short lists. I've tried to highlight items that cross the boundaries between what people traditionally expect are topics for school and topics for the real world. I'd be curious to hear what other people's experience with topics like these are.</div>

You can find the rest here. There are view comments.

Writing a Lexer

Posted November 17th, 2009. Tagged with python, compile, lex, programming-languages, parse.

People who have been reading this blog since last year (good lord) may recall that once upon a time I did a short series of posts on lexing and parsing using PLY. Back then I was working on a language named Al. This past week or so I've started working on another personal project (not public yet) and I've once again had the need to lex things, but this time I wrote my lexer by hand, instead of using any sort of generator. This has been an exceptional learning experience, so I'd like to pass some of that on to you.

The first thing to note is that writing a lexer is a great place to TDD (test driven development), I've rewritten various parts of my lexer five or more times, I've needed my tests to keep me sane. Got your tests written? Ok it's time to dive right into our lexer.

I've structured my lexer as a single class that takes an input string, and has a parse method which returns a generator that yields tokens (tokens are just a namedtuple with a name and value field). The parser has two important attributes, state which is a string that says what state the lexer is in (this is used for tokens that are more than one character long), and current_val which is a list containing characters that will eventually become the value for the current token being found.

The parse method iterates through characters in the text and then it checks, if the parser has a state (self.state is not None) it does getattr(self, self.state)(character). Otherwise it calls self.generic(character). Then the various "state methods" are responsible for mutating self.current_val and self.state and returning a Token. So for example the string state looks like this:

def string(self, ch):
    if ch == '"':
        sym = Symbol("STRING", "".join(self.current_val))
        self.current_val = []
        self.state = None
        return sym
    elif ch == "\\":
        self.state = "string_escaped"
    else:
        self.current_val.append(ch)

If the character is a quote then we're closing our string so we return our string Symbol, reset the current_val and state. If the character is a then we switch into a string_escaped state which knows to handle the character as a literal and then go back to string state. If the character is anything else then we just append it to the current_val, it will get handled at the end of the string.

I've found this to be an exceptionally powerful method, and it makes my end result code very readable. Hopefully I'll be able to reveal my project in the coming weeks, as I'm very excited about it, even if it's not ready I'll continue to share these lessons learned as I go.

You can find the rest here. There are view comments.