Over the past week or so I've been struggling with attempting to write my own parser (or parser generator) by hand. A few days ago I finally decided to give up on this notion (after all the parser isn't my end goal) as it was draining my time from the interesting work to be done. However, I wanted to keep my existing lexer. I wrote the lexer by hand in the method I described in a previous post, it's fast, easy to read, and I rather like my handiwork, so I wanted to keep it if possible. I've used PLY before (as I described last year) so I set out to see if it would be possible to use it for parsing without using it for lexing.
As it turns out PLY expects only a very minimal interface from it's lexer. In fact it only needs one method, token(), which returns a new token (or None at the end). Tokens are expected to have just 4 attributes. Having this knowledge I now set out to write a pair of compatibility classes for my existing lexer and token classes, I wanted to do this without altering the lexer/token API so that if and when I finally write my own parser I don't have to remove legacy compatibility stuff. My compatibility classes are very small, just this:
class PLYCompatLexer(object):
def __init__(self, text):
self.text = text
self.token_stream = Lexer(text).parse()
def token(self):
try:
return PLYCompatToken(self.token_stream.next())
except StopIteration:
return None
class PLYCompatToken(object):
def __init__(self, token):
self.type = token.name
self.value = token.value
self.lineno = None
self.lexpos = None
def __repr__(self):
return "<Token: %r %r>" % (self.type, self.value)
This is the entirety of the API that PLY needs. Now I can write my parser exactly as I would normally with PLY.
You can find the rest here. There are view comments.
A while ago Eric Holscher blogged about things he didn't learn in college. I'm going to take a different spin on it, looking at both things that I did learn in school that I wouldn't have learned else where (henceforth defined as my job, or open source programming), as well as thinks I learned else where instead of at college.
- Big O notation, and algorithm analysis. This is the biggest one, I've had little cause to consider this in my open source or professional work, stuff is either fast or slow and that's usually enough. Learning rigorous algorithm analysis doesn't come up all the time, but every once in a while it pops up, and it's handy.
- C++. I imagine that I eventually would have learned it myself, but my impetus to learn it was that's what was used for my CS2 class, so I started learning with the class then dove in head first. Left to my own devices I may very well have stayed in Python/Javascript land.
- Finite automaton and push down automaton. I actually did lexing and parsing before I ever started looking at these in class (see my blog posts from a year ago) using PLY, however, this semester I've actually been learning about the implementation of these things (although sadly for class projects we've been using Lex/Yacc).
- Compilers. I've learned everything I know about compilers from reading my papers from my own interest and hanging around communities like Unladen Swallow and PyPy (and even contributing a little).
- Scalability. Interesting this is a concept related to algorithm analysis/big O, however this is something I've really learned from talking about this stuff with guys like Mike Malone and Joe Stump.
- APIs, Documentation. These are the core of software development (in my opinion), and I've definitely learned these skills in the open source world. You don't know what a good API or documentation is until it's been used by someone you've never met and it just works for them, and they can understand it perfectly. One of the few required, advanced courses at my school is titled, "Software Design and Documentation" and I'm deathly afraid it's going to waste my time with stuff like UML, instead of focusing on how to write APIs that people want to use and documentation that people want to read.
So these are my short lists. I've tried to highlight items that cross the boundaries between what people traditionally expect are topics for school and topics for the real world. I'd be curious to hear what other people's experience with topics like these are.</div>
As promised, today we'll be looking at implementing additional arithmetic operations, dealing with order of operations, and adding variables to our languages, so without further ado, let's jump into the code.
We can replace our old addition rule with this:
import operator
def p_expression_arithmetic(p):
'''
expression : expression PLUS expression
| expression MINUS expression
| expression TIMES expression
| expression DIVIDE expression
'''
OPS = {
'+': operator.add,
'-': operator.sub,
'*': operator.mul,
'/': operator.div
}
p[0] = OPS[p[2]](p[1], p[3])
Hopefully what this code does is pretty clear, the | operator in the rule is an or option. So if we match any of these, we get the correct function out of our ops dictionary(if you aren't familiar with operator module check it out, it's awesome), and then call it with the two arguments.
This handles the arithmetic correctly, but doesn't handle order of operations, so lets add that in:
precedence = (
('left', 'PLUS', 'MINUS'),
('left', 'TIMES', 'DIVIDE'),
)
What this says is all these operations are left-associative, and TIMES and DIVIDE have a high precedence than PLUS and MINUS(both groupings have equal precedence, and thus read left to right).
Now that we have a fully functioning calculator, let's add in variables, first we need to add a token for NAMES (variables) and for the assignment operator:
def t_NAME(t):
r'[a-zA-Z_][a-zA-Z_0-9]*'
return t
t_EQ = r'='
And of course add NAME and EQ to the list of tokens, and now a few parsing rules:
names = {}
def p_expression_name(p):
'''
expression : NAME
'''
p[0] = names[p[1]]
def p_assignment(p):
'''
assignment : NAME EQ expression
'''
names[p[1]] = p[3]
So here we define a names dictionary, it will map variables to values. Hopefully the parse rules are fairly obvious, and everything makes sense.
You can find the rest here. There are view comments.
Yesterday we created our tokens, and using these we can parse our language (which right now is a calculator) into some tokens. Unfortunately this isn't very useful. So today we are going to start writing a grammar, and building an interpreter around it.
In PLY, grammar rules are defined similarly to tokens, that is, using docstrings. Here's what a few grammar rules for out language might look like:
def p_expression_plus(p):
'''
expression : expression PLUS expression
'''
p[0] = p[1] + t[3]
def p_expression_number(p):
'''
expression : NUMBER
'''
p[0] = [1]
So the first docstring works is, an expression is defined as expression PLUS expression. Here PLUS is the token we defined earlier, and expression is any other way we've defined expression, so an expression is also a number (which is the token we defined earlier). The way the code works is essentially that p[0] is the result, and each piece of the definition is it's own subscript, so p[1] and p[3] refer to the two expression in the plus expression we defined.
To actually use this parser we've defined we do:
parser = yacc.yacc()
if __name__ == '__main__':
while True:
try:
s = raw_input('calc > ')
except EOFError:
break
if not s:
continue
result = parser.parse(s)
print result
Try it out! As an exercise, the reader can implement other operations (remember the order of operations!), and perhaps variables. Tomorrow, I'll be discussing implementing these. As always, the PLY documentation is excellent, and available here.
You can find the rest here. There are view comments.
The other day I mentioned I was using PLY in my post about building a language, so today I'm going to describe getting started with PLY, specifically the tokenization phase. For those who don't know much about parsing a language, the tokenization phase is where we take the source file, and turn it into a series of tokens. For example, turning a = 3 + 4 into, NAME EQUALS 3 PLUS 4. As you can see that simple assignment becomes 5 tokens, each number is a token, both numbers are tokens, and a is a NAME token. So how do we do this in PLY?
PLY's method for defining tokenization rules is very creative. First you define a list of tokens, for example:
tokens = (
'NUMBER',
'PLUS',
'MINUS',
'TIMES',
'DIVIDE',
)
Here we have defined the types of tokens we will define, what each of these is should be self explanatory. Then we define some rules, they look like this:
t_PLUS = r'\+'
t_MINUS = r'-'
t_TIMES = r'\*'
t_DIVIDE = r'/'
def t_NUMBER(t):
r'\d+'
try:
t.value = int(t.value)
except ValueError:
t.value = 0
return t
This is probably less obvious. There are 2 ways to define the rules for a token, either as a string, or as a function. Either way they are named t_TOKEN_NAME. For a lot of tokens you can just do the string, those are the ones that don't require processing, and the string is just a regex that matches the token. For things that do need processing, we can define a function. The function takes 1 parameter, which is a lexer object, as you can see in our example, we take in t, and since we are defining a number token we set the value to be the integer for the string representation from the source code. The interesting thing here is how we define the rule for, PLY uses the docstring for a function to get the regex for it.
Now that we have all of our rules set up we need to actually build the lexer object:
lexer = lex.lex()
And then we can use the input() function on a lexer to provide the source code, and the token function to pop the next token off the lexer.
That's all for today, in the future we'll take a look at the other components of building the grammar of a language, and at how we implement it. For more information now, PLY has excellent documentation, available here.
You can find the rest here. There are view comments.
One of my side projects of late has been building a programming language in Python, using the PLY library. PLY is essentially a Python implementation of the classic Lex and Yacc tools. The language, at present, has a syntax almost exactly the same as Python's (the notable difference, in so far as features that have been implemented, is that you are not allowed to have multiple statements on the same line, or to put anything following a colon on the same line). The language(currently called 'Al' although that's more of a working name), is a dynamic language that builds up an syntax tree for the code, and than executes it. However, the long term goal is to have it actually be a compiled language, similar to Lisp or C. Essentially the mechanism for doing this will be the same as how a C++ compiler handles multiple dispatch, which is dynamically at run time.
At present however, this mythical fully compiled language is far from complete, I haven't even began to think about the assembly generation, mostly because I don't know assembly at all, and one of the courses I will be taking next semester is one which covers assembly code. However, the question that has to be asked, are what are the advantages of a compiled language, and what are the costs?
First the benefits:
- It's faster, even a worst case C++ program that fully utilizes multiple dispatch at runtime will go faster than a program using the same algorithms in Python.
- You get an executable at the end. This is a huge advantage for distribution, you don't need to distribute the source code, and you have an exe to give to people.
There are probably others, but I'm assuming the semantics of a language similar to Python, so I haven't included things like compile time type checking. And now the disadvantages:
- You lose some of the dynamicism. Doing things like eval(), or dynamic imports is inherently harder, if not impossible.
- You lose the REPL (interactive interpreter).
So can we overcome those? As far as I can tell the first should be doable, eval() necessitates the inclusion of an interpreter with the language, the thought of this already has to be making people think this is just going to end up as a VM. But, I think this can be overcome, we can know, at compile time, whether or not a user will be using eval, and decide then whether or not to compile the interpreter and link against it. Dynamic imports are, if anything harder, I think, I think this is just an issue of doing run time linking, but I'm not sure. As for the issue of the REPL, this is a non-issue as far as I'm concerned, there is no inherent reason a compiled language can't have a REPL, we just often don't, languages like Common Lisp have long had both.
So now, let's see some code. I hope to have some code to show off, that handles at least a subset of Python, for PyCon 2009, as work begins on assembly generation I will post here. For anyone interested in the code at present, you can see it here.
You can find the rest here. There are view comments.