alex gaynor's blago-blog

Posts tagged with unladen-swallow

PyCon Roundup - Days 2-4

Posted March 8th, 2010. Tagged with pypy, unladen-swallow, pycon.

As I said in my last post PyCon was completely and utterly awesome, but it was also a bit of a blur. Here's my attempt at summarizing everything that happened during the conference itself.

Day 2

Day 2 (the first day of the conference itself) started out with Guido's keynote. He did something rather unorthodox, instead of delivering a formal talk he just took audience questions via Twitter, starting out using Twitterfall, but it was a tad slow so Jacob Kaplan-Moss switched it out for the PyCon Live Stream that Eric Florenzano, Brian Rosner, and I built, switch was very awesome (seeing your creation on four projectors in front of an audience is a pretty good ego boost). Next up I had to deliver my own talk, it went well I think, you can find my slides and a video on the PyCon website. The rest of the day was a bit of a blur, but I enjoyed James Bennet's talk, the Form generator's panel, Jonathan Ellis's talk on database scalability, and Alex Martelli's talk.

That evening I got to visit the Django restaurant, I can't help but imagine what the staff there thought about the crazy groups of programmers visiting (I think we mobbed the place every night of the conference), many of us wearing our Django shirts.

Day 3

Day 3 was really about the VMs. It started with the Iron Python and PyPy keynotes. These were followed by Mark Shuttleworth's keynote, his slides weren't working (the perils of using an operating system alpha release), but he still delivered an awesome talk on software development processes. From there I went to the single best stretch of talks at PyCon. The Speed of PyPy, Unladen Swallow: fewer coconuts, faster Python, and Understanding the Python GIL. Three great topics, three great speakers, three great talks, one room. Simply put, you should drop whatever you're doing and watch each of these talks, they're all great. I was sad to miss Raymond Hettinger's talk for the Django Software Foundation meeting, but I caught the video and it was excellent (as usual for Raymond). The DSF meeting was interesting, but not super exciting, a lot of "action needed" tasks, some of which are already happening (like getting better buildbots running). Finally I caught the tail end of the Neo4j talk, I'd need to watch the full thing, because the ending caught my attention.

I ended up back at the Django restaurant again, this time for the speakers, volunteers, and sponsors dinner. Django wasn't quite equipped to handle the sheer number of us that showed us, but it was a good time nonetheless. Once again I was blown away by how many Google people there were. There were far too many awesome people to list, but suffice it to say that you should volunteer or speak at PyCon, if for no other reason than to get to attend this dinner, there are awesome people to have a conversation with as far as the eyes can see! After dinner I ended up staying up far too late with another group of awesome people. I'm told the testing birds of a feather was also great (I suppose events that aren't completely awesome don't invent things like the Testing Goat).

Day 4

Here I suffered the effects of the aforementioned late night. I missed all of the keynotes, which is unfortunately considering they all looked quite good, I'll have to catch the videos. The poster session was very cool, I only got to see about half of it, but it's definitely something I'm going to look forward to at future PyCons. Next I saw Donovan Preston's talk on eventlet, a talk on teaching compilers with Python. I missed Scott Chacon's talk on hg-git, though I can't remember what for. I'll have to catch it in video because I was really looking forward to it.

After that there was time for a final pizza with friends, and then I had to head home. I sprinted remotely, but it's not quite the same as being there. It's my hope that next year I can attend the sprints in person, but school never seems to work out for me in that respect.

You can find the rest here. There are view comments.

PyCon Roundup - Days 0 and 1

Posted February 26th, 2010. Tagged with unladen-swallow, pycon.

This year's PyCon was completely and utterly awesome, to the point where anything I point in writing won't do it justice. But I'm going to try to anyways.

Day 0

Day 0 was Wednesday, I got in pretty late and didn't do much of anything besides meet up with people in the hotel lobby for an hour or two. Still, it was good to be able to catch up with people, and put some faces to people I knew exclusively via the internet. Finally I had to finish up some homework (in Ruby on Rails of all things) so it wouldn't be an albatross around my neck for the rest of the conference.

Day 1

Day 1 for me was Thursday, for most conference atendees this was the last day of tutorials, however I was going to the language summit instead. The language summit impressed me with the Python community in ways I can barely describe. We had a laundry list of issues to cover, ranging from the state of packaging, to the Unladen Swallow PEP, to what policies for pure Python and C modules should be in the standard library. Besides a heated discussion about setuptools, distutils, and distribute every decision was handled extremely professionally, covering pros, cons, and impact on users. A few of the conclusions (which you've probably already heard about) are that The Hitchhikers Guide to Packaging is awesome, Unladen Swallow PEP accepted, stdlib modules that come with a C variety exclusively for speed must also have a pure Python version, stdlib modules written in C to interface with other systems are ok (but optionally having a ctypes version should be nice for alternate distributions). At lunch I met Antonio Rodriguez (one of the keynote speakers) and two Red Hat (and Fedora) developers. It was interesting to discuss the state of Linux, Ubuntu, and other distributions with people who are professionally involved in them.

I'll be trying to dedicate a full post to each of the remaining days, though as I've said there's a ton to cover. Really the takeaway for you should be: PyCon is awesome and if it's humanly possible for you to make it, you should.

You can find the rest here. There are view comments.

Committer Models of Unladen Swallow, PyPy, and Django

Posted February 25th, 2010. Tagged with pypy, python, unladen-swallow, django, open-source.

During this year's PyCon I became a committer on both PyPy and Unladen Swallow, in addition I've been a contributer to Django for quite a long time (as well as having commit privileges to my branch during the Google Summer of Code). One of the things I've observed is the very different models these projects have for granting commit privileges, and what the expectations and responsibilities are for committers.

Unladen Swallow

Unladen Swallow is a Google funded branch of CPython focused on speed. One of the things I've found is that the developers of this project carry over some of the development process from Google, specifically doing code review on every patch. All patches are posted to Rietveld, and reviewed, often by multiple people in the case of large patches, before being committed. Because there is a high level of review it is possible to grant commit privileges to people without requiring perfection in their patches, as long as they follow the review process the project is well insulated against a bad patch.

PyPy

PyPy is also an implementation of Python, however its development model is based largely around aggressive branching (I've never seen a project handle SVN's branching failures as well as PyPy) as well as sprints and pair programming. By branching aggressively PyPy avoids the overhead of reviewing every single patch, and instead only requires review when something is already believed to be "trunk-ready", further this model encourages experimentation (in the same way git's light weight branches do). PyPy's use of sprints and pair programming are two ways to avoid formal code reviews and instead approach code quality as more of a collaborative effort.

Django

Django is the project I've been involved with for the longest, and also the only one I don't have commit privileges on. Django is extremely conservative in giving out commit privileges (there about a dozen Django committers, and about 500 names in the AUTHORS file). Django's development model is based neither on branching (only changes as large in scope as multiple database support, or an admin UI refactor get their own branch) nor on code review (most commits are reviewed by no one besides the person who commits them). Django's committers maintain a level of autonomy that isn't seen in either of the other two projects. This fact comes from the period before Django 1.0 was released when Django's trunk was often used in production, and the need to keep it stable at all times, combined with the fact that Django has no paid developers who can guarantee time to do code review on patches. Therefore Django has maintained code quality by being extremely conservative in granting commit privileges and allowing developers with commit privileges to exercise their own judgment at all times.

Conclusion

Each of these projects uses different methods for maintaining code quality, and all seem to be successful in doing so. It's not clear whether there's any one model that's better than the others, or that any of these projects could work with another's model. Lastly, it's worth noting that all of these models are fairly orthogonal to the centralized VCS vs. DVCS debate which often surrounds such discussions.

You can find the rest here. There are view comments.

Thoughts on HipHop PHP

Posted February 2nd, 2010. Tagged with c++, python, unladen-swallow, compile, php, compiler, programming-languages.

This morning Facebook announced, as had been rumored for several weeks, a new, faster, implementation of PHP. To someone, like me, who loves dynamic languages and virtual machines this type of announcement is pretty exciting, after all, if they have some new techniques for optimizing dynamic languages they can almost certainly be ported to a language I care about. However, because of everything I've read (and learned about PHP, the language) since the announcement I'm not particularly excited about HipHop.

Firstly, there's the question of what problem HipHop solves. It aims to improve the CPU usage of PHP applications. For all practical purposes PHP exists exclusively to serve websites (yes, it can do other things, no one does them). Almost every single website on the internet is I/O bound, not CPU bound, web applications spend their time waiting on external resources (databases, memcache, other HTTP resources, etc.). So the part of me that develops websites professionally isn't super interested, Facebook is in the exceptionally rare circumstance that they've optimized their I/O to the point that optimizing CPU gives worthwhile returns. However, the part of me that spends his evenings contributing to Unladen Swallow and hanging around PyPy still thought that there might be some interesting VM technology to explore.

The next issue for consideration was the "VM" design Facebook choose. They've elected to compile PHP into C++, and then use a C++ compiler to get a binary out of it. This isn't a particularly new technique, in the Python world projects like Shedskin and Cython have exploited a similar technique to get good speed ups. However, Facebook also noted that in doing so they had dropped support for "some rarely used features — such as eval()". An important question is which features exactly, they had dropped support for. After all, the reason compiling a dynamic language to efficient machine code is difficult is because the dynamicism defeats the compiler's ability to optimize, but if you remove the dynamicism you remove the obstacles to efficient compilation. However, you're also not compiling the same language. PHP without eval(), and whatever else they've removed is quite simply a different language, for this reason I don't consider either Shedskin or Cython to be an implementation of Python, because they don't implement the entire language.

This afternoon, while I was idling in the Unladen Swallow IRC channel a discussion about HipHop came up, and I learned a few things about PHP I hadn't previous realized. The biggest of these is that a name bound to a function in PHP cannot be undefined, or redefined. If you've ever seen Collin Winter give a talk about Unladen Swallow, the canonical example of Python's dynamicism defeating a static compiler is the len() function. For lists, tuples, or dicts a call to len() should be able to be optimized to a single memory read out of a field on the object, plus a call to instantiate an integer object. However, in CPython today it's actually about 3 function calls, and 3 memory reads to get this data (plus the call to instantiate an integer object), plus the dictionary lookup in the global builtins to see what the object named len is. That's a hell of a lot more work than a single memory read (which is one instruction on an x86 CPU). The reason CPython needs to do all that work is that it a) doesn't know what the len object is, and b) when len is called it has no idea what its arguments will be.

As I've written about previously, Unladen Swallow has some creative ways to solve these problems, to avoid the dictionary lookups, and, eventually, to inline the body of the len() into the caller and optimize if for the types it's called with. However, this requires good runtime feedback, since the compiler simply cannot know statically what any of the objects will actually be at runtime. However, if len could be know to be the len() function at compile time Unladen Swallow could inline the body of the function, unconditionally, into the caller. Even with only static specialization for lists, dicts, and tuples like:

if isinstance(obj, list):
    return obj.ob_size
elif isinstance(obj, tuple):
    return obj.ob_size
elif isisinstance(obj, dict):
    return obj.ma_fill
else:
    return obj.__len__()

This would be quite a bit faster than the current amount of indirection. In PHP's case it's actually even easier, it only has one builtin array type, which acts as both a typical array as well as a hash table. Now extend this possibly optimization to not only every builtin, but every single function call. Instead of the dictionary lookups Python has to do for every global these can just become direct function calls.

Because of differences like this (and the fact that PHP only has machine sized integers, not arbitrary sized ones, and not implementing features of the language such as eval()) I believe that the work done on HipHop represents a fundamentally smaller challenge than that taken on by the teams working to improve the implementations of languages like Python, Ruby, or Javascript. At the time of the writing the HipHop source has not been released, however I am interested to see how they handle garbage collection.

You can find the rest here. There are view comments.

Things College Taught me that the "Real World" Didn't

Posted November 21st, 2009. Tagged with pypy, parse, python, unladen-swallow, compile, django, ply, programming-languages, c++, response, lex, compiler, yacc, college.

A while ago Eric Holscher blogged about things he didn't learn in college. I'm going to take a different spin on it, looking at both things that I did learn in school that I wouldn't have learned else where (henceforth defined as my job, or open source programming), as well as thinks I learned else where instead of at college.

Things I learned in college:

  • Big O notation, and algorithm analysis. This is the biggest one, I've had little cause to consider this in my open source or professional work, stuff is either fast or slow and that's usually enough. Learning rigorous algorithm analysis doesn't come up all the time, but every once in a while it pops up, and it's handy.
  • C++. I imagine that I eventually would have learned it myself, but my impetus to learn it was that's what was used for my CS2 class, so I started learning with the class then dove in head first. Left to my own devices I may very well have stayed in Python/Javascript land.
  • Finite automaton and push down automaton. I actually did lexing and parsing before I ever started looking at these in class (see my blog posts from a year ago) using PLY, however, this semester I've actually been learning about the implementation of these things (although sadly for class projects we've been using Lex/Yacc).

Things I learned in the real world:

  • Compilers. I've learned everything I know about compilers from reading my papers from my own interest and hanging around communities like Unladen Swallow and PyPy (and even contributing a little).
  • Scalability. Interesting this is a concept related to algorithm analysis/big O, however this is something I've really learned from talking about this stuff with guys like Mike Malone and Joe Stump.
  • APIs, Documentation. These are the core of software development (in my opinion), and I've definitely learned these skills in the open source world. You don't know what a good API or documentation is until it's been used by someone you've never met and it just works for them, and they can understand it perfectly. One of the few required, advanced courses at my school is titled, "Software Design and Documentation" and I'm deathly afraid it's going to waste my time with stuff like UML, instead of focusing on how to write APIs that people want to use and documentation that people want to read.

So these are my short lists. I've tried to highlight items that cross the boundaries between what people traditionally expect are topics for school and topics for the real world. I'd be curious to hear what other people's experience with topics like these are.</div>

You can find the rest here. There are view comments.

Another Pair of Unladen Swallow Optimizations

Posted November 19th, 2009. Tagged with pypy, python, unladen-swallow, django, programming-languages.

Today a patch of mine was committed to Unladen Swallow. In the past weeks I've described some of the optimizations that have gone into Unladen Swallow, in specific I looked at removing the allocation of an argument tuple for C functions. One of the "on the horizon" things I mentioned was extending this to functions with a variable arity (that is the number of arguments they take can change). This has been implemented for functions that take a finite range of argument numbers (that is, they don't take *args, they just have a few arguments with defaults). This support was used to optimize a number of builtin functions (dict.get, list.pop, getattr for example).

However, there were still a number of functions that weren't updated for this support. I initially started porting any functions I saw, but it wasn't a totally mechanical translation so I decided to do a little profiling to better direct my efforts. I started by using the cProfile module to see what functions were called most frequently in Unladen Swallow's Django template benchmark. Imagine my surprise when I saw that unicode.encode was called over 300,000 times! A quick look at that function showed that it was a perfect contender for this optimization, it was currently designated as a METH_VARARGS, but in fact it's argument count was a finite range. After about of dozen lines of code, to change the argument parsing, I ran the benchmark again, comparing it a control version of Unladen Swallow, and it showed a consistent 3-6% speedup on the Django benchmark. Not bad for 30 minutes of work.

Another optimization I want to look at, which hasn't landed yet, is one of optimize various operations. Right now Unladen Swallow tracks various data about the types seen in the interpreter loop, however for various operators this data isn't actually used. What this patch does is check at JIT compilation time whether the operator site is monomorphic (that is there is only one pair of types ever seen there), and if it is, and it is one of a few pairings that we have optimizations for (int + int, list[int], float - float for example) then optimized code is emitted. This optimized code checks the types of both the arguments that they are the expected ones, if they are then the optimized code is executed, otherwise the VM bails back to the interpreter (various literature has shown that a single compiled optimized path is better than compiling both the fast and slow paths). For simple algorithm code this optimization can show huge improvements.

The PyPy project has recently blogged about the results of the results of some benchmarks from the Computer Language Shootout run on PyPy, Unladen Swallow, and CPython. In these benchmarks Unladen Swallow showed that for highly algorithmic code (read: mathy) it could use some work, hopefully patches like this can help improve the situation markedly. Once this patch lands I'm going to rerun these benchmarks to see how Unladen Swallow improves, I'm also going to add in some of the more macro benchmarks Unladen Swallow uses to see how it compares with PyPy in those. Either way, seeing the tremendous improvements PyPy and Unladen Swallow have over CPython gives me tremendous hope for the future.

You can find the rest here. There are view comments.

Another Unladen Swallow Optimization

Posted November 8th, 2009. Tagged with python, internals, unladen-swallow.

This past week I described a few optimizations that the Unladen Swallow team have done in order to speed up CPython. In particular one of the optimizations I described was to emit direct calls to C functions that took either zero or one argument. This improves the speed of Python when calling functions like len() or repr(), who only take one argument. However, there are plenty of builtin functions that take a fixed number of arguments that is greater than one. This is the source of the latest optimization.

As I discussed previously there were two relevant flags, METH_O and METH_NOARGS. These described functions that take either one or zero arguments. However, this doesn't cover a wide gamut of functions. Therefore the first stage of these optimizations was to replace these two flags with METH_FIXED, which indicates that the function takes a fixed number of arguments. There was also an additional slot added to the struct that holds C functions to store the arity of the function (the number of arguments it takes). Therefore something like:

{"id", builtin_id, METH_O, id_doc}

Which is what the struct for a C function looks like would be replaced with:

{"id", builtin_id, METH_FIXED, id_doc, 1}

This allows Unladen Swallow to emit direct calls to functions that take more than 1 argument, specifically up to 3 arguments. This results in functions like hasattr() and setattr() to be better optimized. This change ultimately results in a 7% speed increase in Unladen Swallow's Django benchmark. Here the speed gains will largely come from avoiding allocating a tuple for the arguments, as Python used to have to do since the functions were defined as METH_VARARGS (which results in it receiving it's arguments as a tuple), as well as avoiding parsing that tuple.

This change isn't as powerful as it could be, specifically it requires that the function always take the same number of arguments. This prevents optimizing calls to getattr() for example, which can take either 2 or 3 arguments. This optimization doesn't hold because C doesn't have any way of expressing default arguments for the function, therefore the CPython runtime must pass all of the needed arguments to a function, which means C functions need to have a way to encode their defaults in a way that CPython can understand. One of the proposed solutions to this problem is to have functions be able to provide the minimum number of arguments they take and then CPython could pad the provided arguments with NULLs to achieve the correct number of arguments to the function (interestingly the C standard allows more arguments to be passed to a function than it takes). This type of optimization would speed up calls to things like dict.get() and getattr().

As you can see the speed of a Python application can be fairly sensitive to how various internal things are handled, in this case the speed increase can be shown to come exclusively from eliminating a tuple allocation and some extra logic on certain function calls. If you're interested in seeing the full changeset it's available on the internet.

You can find the rest here. There are view comments.

Diving into Unladen Swallow's Optimizations

Posted November 3rd, 2009. Tagged with python, unladen-swallow, compile, internals, compiler.

Yesterday I described the general architecture of Unladen Swallow, and I said that just by switching to a JIT compiler and removing the interpretation overhead Unladen Swallow was able to get a performance gain. However, that gain is nowhere near what the engineers at Google are hoping to accomplish, and as such they've been working on building various optimizations into their JIT. Here I'm going to describe two particularly interesting ones they implemented during the 3rd quarter (they're doing quarterly releases).

Before diving into the optimizations themselves I should note there's one piece of the Unladen Swallow architecture I didn't discuss in yesterday's post. The nature of dynamic languages is that given code can do nearly anything depending on the types of the variables present, however in practice usually very few types are seen. Therefore it is necessary to collect information about the types seen in practice in order to perform optimizations. Therefore what Unladen Swallow has done is added data collection to the interpreter while it is executing bytecode. For example the BINARY_ADD opcode records the types of both of it's operands, the CALL_FUNCTION opcode records the function it is calling, and the UNPACK_SEQUENCE opcode records the type of the sequence it's unpacking. This data is then used when the function is compiled to generate optimal code for the most likely scenarios.

The first optimization I'm going to look at is one for the CALL_FUNCTION opcode. Python has a number of flags that functions defined in C can have, the two relevant to this optimization are METH_NOARGS and METHO_O. These flags indicate that the function (or method) in question take either 0 or 1 argument respectively (this is excluding the self argument on methods). Normally when Python calls a function it builds up a tuple of the arguments, and a dictionary for keyword arguments. For functions defined in Python CPython lines up the arguments with those the function takes and then sets them as local variables for the new function. For C functions they are given the tuple and dictionary directly and are responsible for parsing them themselves. By contrast functions with METH_NOARGS or METH_O receive their arguments (or nothing in the case of METH_NOARGS) directly.

Because calling METH_NOARGS and METH_O functions is so much easier than the alternate case (which involves several allocations and complex logic) when possible it is best to special case them in the generated assembly. Therefore, when compiling a CALL_FUNCTION opcode if using the data recorded there is only ever one function called (imagine a call to len, it is going to be the same len function every time), and that function is METH_NOARGS or METH_O then instead of generating a call to the usual function call machinery Unladen Swallow instead emits a check to make sure the function is actually the expected one and if it passes emits a call directly to the function with the correct arguments. If this guard fails then Unladen Swallow jumps back to the regular interpreter, leaving the optimized assembly. The reason for this is that the ultimately generated assembly can be more efficient when it only has to consider one best case scenario, as opposed to needing to deal with a large series of if else statements, which catalogue every single best case and the corresponding normal case. Ultimately, this results in more efficient code for calls to functions like len(), which are basically never redefined.

The next optimization we're going to look at is one for the LOAD_GLOBAL function. The LOAD_GLOBAL opcode is used for getting the value of a global variable, such as a builtin function, an imported class, or a global variable in the same module. In the interpreter the code for this opcode looks something like:

name = OPARG()
try:
    value = globals[name]
except KeyError:
    try:
        value = builtins[name]
    except KeyError:
        raise_exception(KeyError, name)
PUSH(value)

As you can see in the case of a builtin object (something like len, str, or dict) there are two dictionary lookups. While the Python dictionary is an exceptionally optimized datastructure it still isn't fast compared to a lookup of a local value (which is a single lookup in a C array). Therefore the goal of this optimization is to reduce the number of dictionary lookups needed to find the value for a global or builtin.

The way this was done was for code objects (the datastructures that hold the opcodes and various other internal details for functions) to register themselves with the globals and builtin dictionaries. By registering themselves the dictionaries are able to notify the code objects (similar to Django signals) whenever they are modified. The result of this is that the generated assembly for a LOAD_GLOBAL can perform the dictionary lookup once at compilation time and then the resulting assembly will be valid until the globals or builtins dictionary notifies the code object that they have been modified, thus rendering the assembly invalid. In practice this is very efficient because globals and builtins are very rarely modified.

Hopefully you've gotten a sense of the type of work that the people behind Unladen Swallow are doing. If you're interested in reading more on this type of work I'd highly recommend taking a look at the literature listed on the Unladen Swallow wiki, as they note that there is no attempt to do any original research, all the work being done is simply the application of existing, proven techniques to the CPython interpreter.

For the rest of this month I'm going to try to give a preview of the next day's post with each post, that way I can start thinking about it well in advance. Tomorrow I'm going to shift gears a little bit and write about the ManyToManyField refactoring I did over the summer and which was just committed to Django.

You can find the rest here. There are view comments.

Introduction to Unladen Swallow

Posted November 2nd, 2009. Tagged with python, unladen-swallow, compile, internals, compiler.

Unless you've been living under a rock for the past year (or have zero interest in either Python or dynamic languages, it which case why are you here?) you've probably heard of Unladen Swallow. Unladen Swallow is a Google funded branch of the CPython interpreter, with a goal of making CPython significantly faster while retaining both API and ABI compatibility. In this post I'm going to try to explain what it is Unladen Swallow is doing to bring a new burst of speed to the Python world.

In terms of virtual machines there are a few levels of complexity, which roughly correspond to their speed. The simplest type of interpreter is an AST evaluator, these are more or less the lowest of the low on the speed totem pole, up until YARV was merged into the main Ruby interpreter, MRI (Matz Ruby Interpreter) was this type of virtual machine. The next level of VM is a bytecode interpreter, this means that the language is compiled to an intermediary format (bytecode) which is then executed. Strictly speaking this is an exceptionally broad category which encompasses most virtual machines today, however for the purposes of this article I'm going to exclude any VMs with a Just-In-Time compiler from this section (more on them later). The current CPython VM is this type of interpreter. The most complex (and fastest) type of virtual machine is one with a Just-In-Time compiler, this means that the bytecode that the virtual machine interprets is also dynamically compiled into assembly and executed. This type of VM includes modern Javascript interpreters such as V8, Tracemonkey, and Squirellfish, as well as other VMs like the Hotspot Java virtual machine.

Now that we know where CPython is, and what the top of the totem pole looks like it's probably clear what Unladen Swallow is looking to accomplish, however there is a bit of prior art here that's worthy of taking a look. There is actually currently a JIT for CPython, named Psyco. Psyco is pretty commonly used to speed up numerical code, as that's what it's best at, but it can speed up most of the Python language. However, Psyco is extremely difficult to maintain and update. It only recently gained support for modern Python language features like generators, and it still only supports x86 CPUs. For these reasons the developers at Google chose to build their JIT rather than work to improve the existing solution (they also chose not to use one of the alternative Python VMs, I'll be discussing these in another post).

I just said that Unladen Swallow looked to build their own JIT, but that's not entirely true. The developers have chosen not to develop their own JIT (meaning their own assembly generator, and register allocator, and optimizer, and everything else that goes along with a JIT), they have instead chosen to utilize the LLVM (Low Level Virtual Machine) JIT for all the code generation. What this means is that instead of doing all the work I've alluded the devs can instead translate the CPython bytecode into LLVM IR (intermediate representation) and then use LLVM's existing JIT infrastructure to do some of the heavy lifting. This gives the devs more time to focus on the interesting work of how to optimize the Python language.

Now that I've layed out the background I'm going to dive into what exactly it is that Unladen Swallow does. Right now the CPython virtual machine looks something like this:

for opcode in opcodes:
    if opcode == BINARY_ADD:
        x, y = POP(), POP()
        z = x + y
        PUSH(z)
    elif opcode == JUMP_ABSOLUTE:
        pc = OPARG()
    # ...

This is both hugely simplified and translated into a Pythonesque psuedocode, but hopefully it makes the point clear, right now the CPython VM runs through the opcodes and based on what the opcode is executes some C code. This is particularly inefficient because there is a fairly substantial overhead to actually doing the dispatch on the opcode. What Unladen Swallow does is count the number of times a given Python function is called (the heuristic is actually slightly more complicated than this, but it's a good approximation of what happens), and when it reaches 10000 (the same value the JVM uses) it stops to compile the function using LLVM. Here what it does is essentially unrolls the interpreter loop, into the LLVM IR. So if you had the bytecode:

BINARY_ADD

Unladen Swallow would generate code like:

x, y = POP(), POP()
z = x + y
PUSH(z)

This eliminates all of the overhead of the large loop in the interpreter. Unladen Swallow also performs a number of optimizations based on Python's semantics, but I'll be getting into those in another post, for now LLVM run it's optimizers, which can improve the generated code somewhat, and then CPython executes the generated function. Now whenever this function is called in the future the optimized, assembly version of it is called.

This concludes the introduction to Unladen Swallow. Hopefully you've learned something about the CPython VM, Unladen Swallow, or virtual machines in general. In future posts I'm going to be diving in to some of the optimizations Unladen Swallow does, as well as what other players are doing in this space (particularly PyPy).

You can find the rest here. There are view comments.

Optimising compilers are there so that you can be a better programmer

Posted October 10th, 2009. Tagged with pypy, python, unladen-swallow, django, compiler.

In a discussion on the Django developers mailing list I recently commented that the performance impact of having logging infrastructure, in the case where the user doesn't want the logging, could essentially be disregarded because Unladen Swallow (and PyPy) are bringing us a proper optimising (Just in Time) compiler that would essentially remove that consideration. Shortly thereafter someone asked me if I really thought it was the job of the interpreter/compiler to make us not think about performance. And my answer is: the job of a compiler is to let me program with best practices and not suffer performance consequences for doing things the right way.

Let us consider the most common compiler optimisations. A relatively simple one is function inlining, in the case where including the body of the function would be more efficient than actually calling it, a compiler can simply move the functions body into its caller. However, we can actually do this optimisation in our own code. We could rewrite:

def times_2(x):
    return x * 2

def do_some_stuff(i):
    for x in i:
        # stuff
        z = times_2(x)
    # more stuff

as:

def do_some_stuff(i):
    for x in i:
        # stuff
        z = x * 2
    # more stuff

And this is a trivial change to make. However in the case where times_2 is slightly less trivial, and is used a lot in our codebase it would be exceptionally more programming practice to repeat this logic all over the place, what if we needed to change it down the road? Then we'd have to review our entire codebase to make sure we changed it everywhere. Needless to say that would suck. However, we don't want to give up the performance gain from inlining this function either. So here it's the job of the compiler to make sure functions are inlined when possible, that way we get the best possible performance, as well as allowing us to maintain our clean codebase.

Another common compiler optimisation is to transform multiplications by powers of 2 into binary shifts. Thus x * 2 becomes x << 1 A final optimisation we will consider is constant propagation. Many program have constants that are used throughout the codebase. These are often simple global variables. However, once again, inlining them into methods that use them could provide a significant benefit, by not requiring the code to making a lookup in the global scope whenever they are used. But we really don't want to do that by hand, as it makes our code less readable ("Why are we multiplying this value by this random float?", "You mean pi?", "Oh."), and makes it more difficult to update down the road. Once again our compiler is capable of saving the day, when it can detect a value is a constant it can propagate it throughout the code.

So does all of this mean we should never have to think about writing optimal code, the compiler can solve all problems for us? The answer to this is a resounding no. A compiler isn't going to rewrite your insertion sort into Tim sort, nor is it going to fix the fact that you do 700 SQL queries to render your homepage. What the compiler can do is allow you to maintain good programming practices.

So what does this mean for logging in Django? Fundamentally it means that we shouldn't be concerned with possible overhead from calls that do nothing (in the case where we don't care about the logging) since a good compiler will be able to eliminate those for us. In the case where we actually do want to do something (say write the log to a file) the overhead is unavoidable, we have to open a file and write to it, there's no way to optimise it out.

You can find the rest here. There are view comments.