Over the past week or so I've been struggling with attempting to write my own parser (or parser generator) by hand. A few days ago I finally decided to give up on this notion (after all the parser isn't my end goal) as it was draining my time from the interesting work to be done. However, I wanted to keep my existing lexer. I wrote the lexer by hand in the method I described in a previous post, it's fast, easy to read, and I rather like my handiwork, so I wanted to keep it if possible. I've used PLY before (as I described last year) so I set out to see if it would be possible to use it for parsing without using it for lexing.
As it turns out PLY expects only a very minimal interface from it's lexer. In fact it only needs one method, token(), which returns a new token (or None at the end). Tokens are expected to have just 4 attributes. Having this knowledge I now set out to write a pair of compatibility classes for my existing lexer and token classes, I wanted to do this without altering the lexer/token API so that if and when I finally write my own parser I don't have to remove legacy compatibility stuff. My compatibility classes are very small, just this:
class PLYCompatLexer(object):
def __init__(self, text):
self.text = text
self.token_stream = Lexer(text).parse()
def token(self):
try:
return PLYCompatToken(self.token_stream.next())
except StopIteration:
return None
class PLYCompatToken(object):
def __init__(self, token):
self.type = token.name
self.value = token.value
self.lineno = None
self.lexpos = None
def __repr__(self):
return "<Token: %r %r>" % (self.type, self.value)
This is the entirety of the API that PLY needs. Now I can write my parser exactly as I would normally with PLY.
You can find the rest here. There are view comments.
A while ago Eric Holscher blogged about things he didn't learn in college. I'm going to take a different spin on it, looking at both things that I did learn in school that I wouldn't have learned else where (henceforth defined as my job, or open source programming), as well as thinks I learned else where instead of at college.
- Big O notation, and algorithm analysis. This is the biggest one, I've had little cause to consider this in my open source or professional work, stuff is either fast or slow and that's usually enough. Learning rigorous algorithm analysis doesn't come up all the time, but every once in a while it pops up, and it's handy.
- C++. I imagine that I eventually would have learned it myself, but my impetus to learn it was that's what was used for my CS2 class, so I started learning with the class then dove in head first. Left to my own devices I may very well have stayed in Python/Javascript land.
- Finite automaton and push down automaton. I actually did lexing and parsing before I ever started looking at these in class (see my blog posts from a year ago) using PLY, however, this semester I've actually been learning about the implementation of these things (although sadly for class projects we've been using Lex/Yacc).
- Compilers. I've learned everything I know about compilers from reading my papers from my own interest and hanging around communities like Unladen Swallow and PyPy (and even contributing a little).
- Scalability. Interesting this is a concept related to algorithm analysis/big O, however this is something I've really learned from talking about this stuff with guys like Mike Malone and Joe Stump.
- APIs, Documentation. These are the core of software development (in my opinion), and I've definitely learned these skills in the open source world. You don't know what a good API or documentation is until it's been used by someone you've never met and it just works for them, and they can understand it perfectly. One of the few required, advanced courses at my school is titled, "Software Design and Documentation" and I'm deathly afraid it's going to waste my time with stuff like UML, instead of focusing on how to write APIs that people want to use and documentation that people want to read.
So these are my short lists. I've tried to highlight items that cross the boundaries between what people traditionally expect are topics for school and topics for the real world. I'd be curious to hear what other people's experience with topics like these are.</div>
As promised, today we'll be looking at implementing additional arithmetic operations, dealing with order of operations, and adding variables to our languages, so without further ado, let's jump into the code.
We can replace our old addition rule with this:
import operator
def p_expression_arithmetic(p):
'''
expression : expression PLUS expression
| expression MINUS expression
| expression TIMES expression
| expression DIVIDE expression
'''
OPS = {
'+': operator.add,
'-': operator.sub,
'*': operator.mul,
'/': operator.div
}
p[0] = OPS[p[2]](p[1], p[3])
Hopefully what this code does is pretty clear, the | operator in the rule is an or option. So if we match any of these, we get the correct function out of our ops dictionary(if you aren't familiar with operator module check it out, it's awesome), and then call it with the two arguments.
This handles the arithmetic correctly, but doesn't handle order of operations, so lets add that in:
precedence = (
('left', 'PLUS', 'MINUS'),
('left', 'TIMES', 'DIVIDE'),
)
What this says is all these operations are left-associative, and TIMES and DIVIDE have a high precedence than PLUS and MINUS(both groupings have equal precedence, and thus read left to right).
Now that we have a fully functioning calculator, let's add in variables, first we need to add a token for NAMES (variables) and for the assignment operator:
def t_NAME(t):
r'[a-zA-Z_][a-zA-Z_0-9]*'
return t
t_EQ = r'='
And of course add NAME and EQ to the list of tokens, and now a few parsing rules:
names = {}
def p_expression_name(p):
'''
expression : NAME
'''
p[0] = names[p[1]]
def p_assignment(p):
'''
assignment : NAME EQ expression
'''
names[p[1]] = p[3]
So here we define a names dictionary, it will map variables to values. Hopefully the parse rules are fairly obvious, and everything makes sense.
You can find the rest here. There are view comments.
Yesterday we created our tokens, and using these we can parse our language (which right now is a calculator) into some tokens. Unfortunately this isn't very useful. So today we are going to start writing a grammar, and building an interpreter around it.
In PLY, grammar rules are defined similarly to tokens, that is, using docstrings. Here's what a few grammar rules for out language might look like:
def p_expression_plus(p):
'''
expression : expression PLUS expression
'''
p[0] = p[1] + t[3]
def p_expression_number(p):
'''
expression : NUMBER
'''
p[0] = [1]
So the first docstring works is, an expression is defined as expression PLUS expression. Here PLUS is the token we defined earlier, and expression is any other way we've defined expression, so an expression is also a number (which is the token we defined earlier). The way the code works is essentially that p[0] is the result, and each piece of the definition is it's own subscript, so p[1] and p[3] refer to the two expression in the plus expression we defined.
To actually use this parser we've defined we do:
parser = yacc.yacc()
if __name__ == '__main__':
while True:
try:
s = raw_input('calc > ')
except EOFError:
break
if not s:
continue
result = parser.parse(s)
print result
Try it out! As an exercise, the reader can implement other operations (remember the order of operations!), and perhaps variables. Tomorrow, I'll be discussing implementing these. As always, the PLY documentation is excellent, and available here.
You can find the rest here. There are view comments.