alex gaynor's blago-blog

Posts tagged with compiler

Thoughts on HipHop PHP

Posted February 2nd, 2010. Tagged with python, compile, c++, compiler, unladen-swallow, programming-languages, php.

This morning Facebook announced, as had been rumored for several weeks, a new, faster, implementation of PHP. To someone, like me, who loves dynamic languages and virtual machines this type of announcement is pretty exciting, after all, if they have some new techniques for optimizing dynamic languages they can almost certainly be ported to a language I care about. However, because of everything I've read (and learned about PHP, the language) since the announcement I'm not particularly excited about HipHop.

Firstly, there's the question of what problem HipHop solves. It aims to improve the CPU usage of PHP applications. For all practical purposes PHP exists exclusively to serve websites (yes, it can do other things, no one does them). Almost every single website on the internet is I/O bound, not CPU bound, web applications spend their time waiting on external resources (databases, memcache, other HTTP resources, etc.). So the part of me that develops websites professionally isn't super interested, Facebook is in the exceptionally rare circumstance that they've optimized their I/O to the point that optimizing CPU gives worthwhile returns. However, the part of me that spends his evenings contributing to Unladen Swallow and hanging around PyPy still thought that there might be some interesting VM technology to explore.

The next issue for consideration was the "VM" design Facebook choose. They've elected to compile PHP into C++, and then use a C++ compiler to get a binary out of it. This isn't a particularly new technique, in the Python world projects like Shedskin and Cython have exploited a similar technique to get good speed ups. However, Facebook also noted that in doing so they had dropped support for "some rarely used features — such as eval()". An important question is which features exactly, they had dropped support for. After all, the reason compiling a dynamic language to efficient machine code is difficult is because the dynamicism defeats the compiler's ability to optimize, but if you remove the dynamicism you remove the obstacles to efficient compilation. However, you're also not compiling the same language. PHP without eval(), and whatever else they've removed is quite simply a different language, for this reason I don't consider either Shedskin or Cython to be an implementation of Python, because they don't implement the entire language.

This afternoon, while I was idling in the Unladen Swallow IRC channel a discussion about HipHop came up, and I learned a few things about PHP I hadn't previous realized. The biggest of these is that a name bound to a function in PHP cannot be undefined, or redefined. If you've ever seen Collin Winter give a talk about Unladen Swallow, the canonical example of Python's dynamicism defeating a static compiler is the len() function. For lists, tuples, or dicts a call to len() should be able to be optimized to a single memory read out of a field on the object, plus a call to instantiate an integer object. However, in CPython today it's actually about 3 function calls, and 3 memory reads to get this data (plus the call to instantiate an integer object), plus the dictionary lookup in the global builtins to see what the object named len is. That's a hell of a lot more work than a single memory read (which is one instruction on an x86 CPU). The reason CPython needs to do all that work is that it a) doesn't know what the len object is, and b) when len is called it has no idea what its arguments will be.

As I've written about previously, Unladen Swallow has some creative ways to solve these problems, to avoid the dictionary lookups, and, eventually, to inline the body of the len() into the caller and optimize if for the types it's called with. However, this requires good runtime feedback, since the compiler simply cannot know statically what any of the objects will actually be at runtime. However, if len could be know to be the len() function at compile time Unladen Swallow could inline the body of the function, unconditionally, into the caller. Even with only static specialization for lists, dicts, and tuples like:

if isinstance(obj, list):
    return obj.ob_size
elif isinstance(obj, tuple):
    return obj.ob_size
elif isisinstance(obj, dict):
    return obj.ma_fill
else:
    return obj.__len__()

This would be quite a bit faster than the current amount of indirection. In PHP's case it's actually even easier, it only has one builtin array type, which acts as both a typical array as well as a hash table. Now extend this possibly optimization to not only every builtin, but every single function call. Instead of the dictionary lookups Python has to do for every global these can just become direct function calls.

Because of differences like this (and the fact that PHP only has machine sized integers, not arbitrary sized ones, and not implementing features of the language such as eval()) I believe that the work done on HipHop represents a fundamentally smaller challenge than that taken on by the teams working to improve the implementations of languages like Python, Ruby, or Javascript. At the time of the writing the HipHop source has not been released, however I am interested to see how they handle garbage collection.

You can find the rest here. There are view comments.

A Bit of Benchmarking

Posted November 22nd, 2009. Tagged with django, python, compiler, pypy, programming-languages.

PyPy recently posted some interesting benchmarks from the computer language shootout, and in my last post about Unladen Swallow I described a patch that would hopefully be landing soon. I decided it would be interesting to benchmarks something with this patch. For this I used James Tauber's Mandelbulb application, at both 100x100 and 200x200. I tested CPython, Unladen Swallow Trunk, Unladen Swallow Trunk with the patch, and a recent PyPy trunk (compiled with the JIT). My results were as follows:

VM 100 200
CPython 2.6.4 17s 64s
Unladen Swallow Trunk 16s 52s
Unladen swallow Trunk + Patch 13s 49s
PyPy Trunk 10s 46s

Interesting results. At 100x100 PyPy smokes everything else, and the patch shows a clear benefit for Unladen. However, at 200x200 both PyPy and the patch show diminishing returns. I'm not clear on why this is, but my guess is that something about the increased size causes a change in the parameters that makes the generated code less efficient for some reason.

It's important to note that Unladen Swallow has been far less focussed on numeric benchmarks than PyPy, instead focusing on more web app concerns (like template languages). I plan to benchmark some of these as time goes on, particularly after PyPy merges their "faster-raise" branch, which I'm told improves PyPy's performance on Django's template language dramatically.

You can find the rest here. There are view comments.

Things College Taught me that the "Real World" Didn't

Posted November 21st, 2009. Tagged with django, python, compile, ply, lex, yacc, c++, response, compiler, pypy, unladen-swallow, programming-languages, parse, college.

A while ago Eric Holscher blogged about things he didn't learn in college. I'm going to take a different spin on it, looking at both things that I did learn in school that I wouldn't have learned else where (henceforth defined as my job, or open source programming), as well as thinks I learned else where instead of at college.

Things I learned in college:

  • Big O notation, and algorithm analysis. This is the biggest one, I've had little cause to consider this in my open source or professional work, stuff is either fast or slow and that's usually enough. Learning rigorous algorithm analysis doesn't come up all the time, but every once in a while it pops up, and it's handy.
  • C++. I imagine that I eventually would have learned it myself, but my impetus to learn it was that's what was used for my CS2 class, so I started learning with the class then dove in head first. Left to my own devices I may very well have stayed in Python/Javascript land.
  • Finite automaton and push down automaton. I actually did lexing and parsing before I ever started looking at these in class (see my blog posts from a year ago) using PLY, however, this semester I've actually been learning about the implementation of these things (although sadly for class projects we've been using Lex/Yacc).

Things I learned in the real world:

  • Compilers. I've learned everything I know about compilers from reading my papers from my own interest and hanging around communities like Unladen Swallow and PyPy (and even contributing a little).
  • Scalability. Interesting this is a concept related to algorithm analysis/big O, however this is something I've really learned from talking about this stuff with guys like Mike Malone and Joe Stump.
  • APIs, Documentation. These are the core of software development (in my opinion), and I've definitely learned these skills in the open source world. You don't know what a good API or documentation is until it's been used by someone you've never met and it just works for them, and they can understand it perfectly. One of the few required, advanced courses at my school is titled, "Software Design and Documentation" and I'm deathly afraid it's going to waste my time with stuff like UML, instead of focusing on how to write APIs that people want to use and documentation that people want to read.

So these are my short lists. I've tried to highlight items that cross the boundaries between what people traditionally expect are topics for school and topics for the real world. I'd be curious to hear what other people's experience with topics like these are.</div>

You can find the rest here. There are view comments.

Why I'm not very excited about Go

Posted November 12th, 2009. Tagged with c++, compiler, go, programming-languages.

When I first heard Rob Pike and Ken Thompson had created a new programming language my first instinct is, "I'm sure it'll be cool, these are brilliant guys!" Unfortunately that is not the case, mostly because I think they made some bad design choices, and because the "cool new things" aren't actually that new, or cool.

The first major mistake was using a C derived syntax: braces and semicolons. I know some people don't like significant whitespace, but if you aren't properly indenting your code it's already beyond the point of hopelessness. The parser ought to use the existing structure instead of adding more punctuation. Readability is one of the most important attributes of code, and this syntax detracts from that.

The next mistake is having a separate declaration and assignment operator. I understand that the point of this operator is to reduce the repetition of typing out the types name both in declaring the variable and in initializing the value. Yet it seems the purpose of the := operator is to avoid typos in variable names that are possible by making all assignment an implicit declaration if the variable wasn't already declared. I can see myself making many more typos by forgetting to use the := operator, and in cases where I make a typo in a variable name it would inevitably be caught by the compiler when I attempted to actually use it (the fact that this is an attempted declaration means the variable won't have been declared elsewhere).

The final mistake was not providing generics. C++'s templates is one of the things that make the language head and shoulders more useful for me than C for tasks I need to preform; generics allow one to provide reusable data structures. While Go seems to have something akin to generics with their map data structure, it disappointingly doesn't appear to be exposed in any way to user code. One of the things I've found makes me most productive in Python is that any time I need to perform a task I simply pick the data structure that does what I want and it is efficiently implemented. Without generics, I don't see a way for a statically typed language to offer this same breadth of data structures without each of them being a special case.

In terms of features which I believe are overhyped the most important one is the "goroutine". As best I can tell these are an implementation of fibers. Constructs like concurrency should not be granted their own syntax, especially when they can be cleanly implemented using the other constructs of a language, look at the C library libtask as an example.

Further, the handling of interfaces, though interesting, appears to be an implementation of C++0x's proposed concepts (which won't be in the final standard). I view this feature as something that is most useful in the context of generics, which Go doesn't have. The reason for this is to be able to make compile time assertions about the types that are being templated over. Doing anything else is more clearly implemented as abstract inheritance.

I'm not writing off Go permanently, but I'm not enthused about it, someone will probably have to let me know when I need to look again as I won't be following along. Enjoy your internet travels.

You can find the rest here. There are view comments.

Diving into Unladen Swallow's Optimizations

Posted November 3rd, 2009. Tagged with python, compile, internals, compiler, unladen-swallow.

Yesterday I described the general architecture of Unladen Swallow, and I said that just by switching to a JIT compiler and removing the interpretation overhead Unladen Swallow was able to get a performance gain. However, that gain is nowhere near what the engineers at Google are hoping to accomplish, and as such they've been working on building various optimizations into their JIT. Here I'm going to describe two particularly interesting ones they implemented during the 3rd quarter (they're doing quarterly releases).

Before diving into the optimizations themselves I should note there's one piece of the Unladen Swallow architecture I didn't discuss in yesterday's post. The nature of dynamic languages is that given code can do nearly anything depending on the types of the variables present, however in practice usually very few types are seen. Therefore it is necessary to collect information about the types seen in practice in order to perform optimizations. Therefore what Unladen Swallow has done is added data collection to the interpreter while it is executing bytecode. For example the BINARY_ADD opcode records the types of both of it's operands, the CALL_FUNCTION opcode records the function it is calling, and the UNPACK_SEQUENCE opcode records the type of the sequence it's unpacking. This data is then used when the function is compiled to generate optimal code for the most likely scenarios.

The first optimization I'm going to look at is one for the CALL_FUNCTION opcode. Python has a number of flags that functions defined in C can have, the two relevant to this optimization are METH_NOARGS and METHO_O. These flags indicate that the function (or method) in question take either 0 or 1 argument respectively (this is excluding the self argument on methods). Normally when Python calls a function it builds up a tuple of the arguments, and a dictionary for keyword arguments. For functions defined in Python CPython lines up the arguments with those the function takes and then sets them as local variables for the new function. For C functions they are given the tuple and dictionary directly and are responsible for parsing them themselves. By contrast functions with METH_NOARGS or METH_O receive their arguments (or nothing in the case of METH_NOARGS) directly.

Because calling METH_NOARGS and METH_O functions is so much easier than the alternate case (which involves several allocations and complex logic) when possible it is best to special case them in the generated assembly. Therefore, when compiling a CALL_FUNCTION opcode if using the data recorded there is only ever one function called (imagine a call to len, it is going to be the same len function every time), and that function is METH_NOARGS or METH_O then instead of generating a call to the usual function call machinery Unladen Swallow instead emits a check to make sure the function is actually the expected one and if it passes emits a call directly to the function with the correct arguments. If this guard fails then Unladen Swallow jumps back to the regular interpreter, leaving the optimized assembly. The reason for this is that the ultimately generated assembly can be more efficient when it only has to consider one best case scenario, as opposed to needing to deal with a large series of if else statements, which catalogue every single best case and the corresponding normal case. Ultimately, this results in more efficient code for calls to functions like len(), which are basically never redefined.

The next optimization we're going to look at is one for the LOAD_GLOBAL function. The LOAD_GLOBAL opcode is used for getting the value of a global variable, such as a builtin function, an imported class, or a global variable in the same module. In the interpreter the code for this opcode looks something like:

name = OPARG()
try:
    value = globals[name]
except KeyError:
    try:
        value = builtins[name]
    except KeyError:
        raise_exception(KeyError, name)
PUSH(value)

As you can see in the case of a builtin object (something like len, str, or dict) there are two dictionary lookups. While the Python dictionary is an exceptionally optimized datastructure it still isn't fast compared to a lookup of a local value (which is a single lookup in a C array). Therefore the goal of this optimization is to reduce the number of dictionary lookups needed to find the value for a global or builtin.

The way this was done was for code objects (the datastructures that hold the opcodes and various other internal details for functions) to register themselves with the globals and builtin dictionaries. By registering themselves the dictionaries are able to notify the code objects (similar to Django signals) whenever they are modified. The result of this is that the generated assembly for a LOAD_GLOBAL can perform the dictionary lookup once at compilation time and then the resulting assembly will be valid until the globals or builtins dictionary notifies the code object that they have been modified, thus rendering the assembly invalid. In practice this is very efficient because globals and builtins are very rarely modified.

Hopefully you've gotten a sense of the type of work that the people behind Unladen Swallow are doing. If you're interested in reading more on this type of work I'd highly recommend taking a look at the literature listed on the Unladen Swallow wiki, as they note that there is no attempt to do any original research, all the work being done is simply the application of existing, proven techniques to the CPython interpreter.

For the rest of this month I'm going to try to give a preview of the next day's post with each post, that way I can start thinking about it well in advance. Tomorrow I'm going to shift gears a little bit and write about the ManyToManyField refactoring I did over the summer and which was just committed to Django.

You can find the rest here. There are view comments.

Introduction to Unladen Swallow

Posted November 2nd, 2009. Tagged with python, compile, internals, compiler, unladen-swallow.

Unless you've been living under a rock for the past year (or have zero interest in either Python or dynamic languages, it which case why are you here?) you've probably heard of Unladen Swallow. Unladen Swallow is a Google funded branch of the CPython interpreter, with a goal of making CPython significantly faster while retaining both API and ABI compatibility. In this post I'm going to try to explain what it is Unladen Swallow is doing to bring a new burst of speed to the Python world.

In terms of virtual machines there are a few levels of complexity, which roughly correspond to their speed. The simplest type of interpreter is an AST evaluator, these are more or less the lowest of the low on the speed totem pole, up until YARV was merged into the main Ruby interpreter, MRI (Matz Ruby Interpreter) was this type of virtual machine. The next level of VM is a bytecode interpreter, this means that the language is compiled to an intermediary format (bytecode) which is then executed. Strictly speaking this is an exceptionally broad category which encompasses most virtual machines today, however for the purposes of this article I'm going to exclude any VMs with a Just-In-Time compiler from this section (more on them later). The current CPython VM is this type of interpreter. The most complex (and fastest) type of virtual machine is one with a Just-In-Time compiler, this means that the bytecode that the virtual machine interprets is also dynamically compiled into assembly and executed. This type of VM includes modern Javascript interpreters such as V8, Tracemonkey, and Squirellfish, as well as other VMs like the Hotspot Java virtual machine.

Now that we know where CPython is, and what the top of the totem pole looks like it's probably clear what Unladen Swallow is looking to accomplish, however there is a bit of prior art here that's worthy of taking a look. There is actually currently a JIT for CPython, named Psyco. Psyco is pretty commonly used to speed up numerical code, as that's what it's best at, but it can speed up most of the Python language. However, Psyco is extremely difficult to maintain and update. It only recently gained support for modern Python language features like generators, and it still only supports x86 CPUs. For these reasons the developers at Google chose to build their JIT rather than work to improve the existing solution (they also chose not to use one of the alternative Python VMs, I'll be discussing these in another post).

I just said that Unladen Swallow looked to build their own JIT, but that's not entirely true. The developers have chosen not to develop their own JIT (meaning their own assembly generator, and register allocator, and optimizer, and everything else that goes along with a JIT), they have instead chosen to utilize the LLVM (Low Level Virtual Machine) JIT for all the code generation. What this means is that instead of doing all the work I've alluded the devs can instead translate the CPython bytecode into LLVM IR (intermediate representation) and then use LLVM's existing JIT infrastructure to do some of the heavy lifting. This gives the devs more time to focus on the interesting work of how to optimize the Python language.

Now that I've layed out the background I'm going to dive into what exactly it is that Unladen Swallow does. Right now the CPython virtual machine looks something like this:

for opcode in opcodes:
    if opcode == BINARY_ADD:
        x, y = POP(), POP()
        z = x + y
        PUSH(z)
    elif opcode == JUMP_ABSOLUTE:
        pc = OPARG()
    # ...

This is both hugely simplified and translated into a Pythonesque psuedocode, but hopefully it makes the point clear, right now the CPython VM runs through the opcodes and based on what the opcode is executes some C code. This is particularly inefficient because there is a fairly substantial overhead to actually doing the dispatch on the opcode. What Unladen Swallow does is count the number of times a given Python function is called (the heuristic is actually slightly more complicated than this, but it's a good approximation of what happens), and when it reaches 10000 (the same value the JVM uses) it stops to compile the function using LLVM. Here what it does is essentially unrolls the interpreter loop, into the LLVM IR. So if you had the bytecode:

BINARY_ADD

Unladen Swallow would generate code like:

x, y = POP(), POP()
z = x + y
PUSH(z)

This eliminates all of the overhead of the large loop in the interpreter. Unladen Swallow also performs a number of optimizations based on Python's semantics, but I'll be getting into those in another post, for now LLVM run it's optimizers, which can improve the generated code somewhat, and then CPython executes the generated function. Now whenever this function is called in the future the optimized, assembly version of it is called.

This concludes the introduction to Unladen Swallow. Hopefully you've learned something about the CPython VM, Unladen Swallow, or virtual machines in general. In future posts I'm going to be diving in to some of the optimizations Unladen Swallow does, as well as what other players are doing in this space (particularly PyPy).

You can find the rest here. There are view comments.

Optimising compilers are there so that you can be a better programmer

Posted October 10th, 2009. Tagged with django, python, compiler, pypy, unladen-swallow.

In a discussion on the Django developers mailing list I recently commented that the performance impact of having logging infrastructure, in the case where the user doesn't want the logging, could essentially be disregarded because Unladen Swallow (and PyPy) are bringing us a proper optimising (Just in Time) compiler that would essentially remove that consideration. Shortly thereafter someone asked me if I really thought it was the job of the interpreter/compiler to make us not think about performance. And my answer is: the job of a compiler is to let me program with best practices and not suffer performance consequences for doing things the right way.

Let us consider the most common compiler optimisations. A relatively simple one is function inlining, in the case where including the body of the function would be more efficient than actually calling it, a compiler can simply move the functions body into its caller. However, we can actually do this optimisation in our own code. We could rewrite:

def times_2(x):
    return x * 2

def do_some_stuff(i):
    for x in i:
        # stuff
        z = times_2(x)
    # more stuff

as:

def do_some_stuff(i):
    for x in i:
        # stuff
        z = x * 2
    # more stuff

And this is a trivial change to make. However in the case where times_2 is slightly less trivial, and is used a lot in our codebase it would be exceptionally more programming practice to repeat this logic all over the place, what if we needed to change it down the road? Then we'd have to review our entire codebase to make sure we changed it everywhere. Needless to say that would suck. However, we don't want to give up the performance gain from inlining this function either. So here it's the job of the compiler to make sure functions are inlined when possible, that way we get the best possible performance, as well as allowing us to maintain our clean codebase.

Another common compiler optimisation is to transform multiplications by powers of 2 into binary shifts. Thus x * 2 becomes x << 1 A final optimisation we will consider is constant propagation. Many program have constants that are used throughout the codebase. These are often simple global variables. However, once again, inlining them into methods that use them could provide a significant benefit, by not requiring the code to making a lookup in the global scope whenever they are used. But we really don't want to do that by hand, as it makes our code less readable ("Why are we multiplying this value by this random float?", "You mean pi?", "Oh."), and makes it more difficult to update down the road. Once again our compiler is capable of saving the day, when it can detect a value is a constant it can propagate it throughout the code.

So does all of this mean we should never have to think about writing optimal code, the compiler can solve all problems for us? The answer to this is a resounding no. A compiler isn't going to rewrite your insertion sort into Tim sort, nor is it going to fix the fact that you do 700 SQL queries to render your homepage. What the compiler can do is allow you to maintain good programming practices.

So what does this mean for logging in Django? Fundamentally it means that we shouldn't be concerned with possible overhead from calls that do nothing (in the case where we don't care about the logging) since a good compiler will be able to eliminate those for us. In the case where we actually do want to do something (say write the log to a file) the overhead is unavoidable, we have to open a file and write to it, there's no way to optimise it out.

You can find the rest here. There are view comments.