alex gaynor's blago-blog

Posts tagged with pypy

The compiler rarely knows best

Posted July 12th, 2012. Tagged with pypy, python, response.

This is a response to http://pwang.wordpress.com/2012/07/11/does-the-compiler-know-best/ if you haven't read it yet, start there.

For lack of any other way to say it, I disagree with nearly every premise presented and conclusion derived in Peter's blog post. The post itself doesn't appear to have any coherent theme, besides that PyPy is not the future of Python, so I'll attempt to reply to Peter's statements more or less in order.

First, and perhaps most erroneously, he claims that "PyPy is an even more drastic change to the Python language than Python3". This is wrong. Complete and utterly. PyPy is in fact not a change to Python at all, PyPy faithfully implements the Python language as described by the Python language reference, and as verified by the test suite. Moreover, this is a statement that would apply equally to Jython and IronPython. It is pure, unadulterated FUD. Peter is trying to extend the definition of the Python language to things that it simple doesn't cover, such as the C-API and what he thinks the interpreter should look like (to be discussed more).

Second, he writes, "What is the core precept of PyPy? It’s that “the compiler knows best”." This too, is wrong. First, PyPy's central thesis is, "any task repeatedly performed manually will be done incorrectly", this is why we have things like automatic insertion of the garbage collector, in preference to CPython's "reference counting everywhere", and automatically generating the just in time compiler from the interpreter, in preference to Unladen Swallow's (and almost every other language's) manual construction of it. Second, the PyPy developers would never argue that the compiler knows best, as I alluded to in this post's title. That doesn't mean you should quit trying to write intelligent compilers, 1) the compiler often knows better than the user, just like with C, while it's possible to write better x86 assembler than GCC for specific functions, over the course of a large project GCC will always win, 2) they aren't mutually exclusive, having an intelligent compiler does not prohibit giving the user more control, in fact it's a necessity! There are no pure-python hints that you can give to CPython to improve performance, but these can easily be added with PyPy's JIT.

He follows this by saying that in contrast to PyPy's (nonexistent) principle of "compiler knows best" CPython's strength is that it can communicate with other platforms because its inner workings are simple. These three things have nothing to do with each other. CPython's interoperability with other platforms is a function of it's C-API. You can build an API like this on top of something monstrously complicated too, look at JNI for the JVM. (I don't accept that PyPy is so complex, but that's another post for another time.) In any event, the PyPy developers are deeply committed to interoperability with other platforms, which is why Armin and Maciej have been working on cffi: http://cffi.readthedocs.org/en/latest/index.html

The next paragraph is one of the most bizarre things I've ever read. He suggests that if you do want the free performance gains PyPy promises you should just build a a Python to JS compiler and use Node.js. I have to assume this paragraph is a joke not meant for publication, because it's nonsense. First, I've been told by the scientific Python community (of which Peter is a member) that any solution that isn't backwards compatible with a mountain of older platforms will never be adopted. So naturally his proposed solution is to throw away all existing work. Next, he implies that Google, Mozilla, Apple, and Microsoft are all collaborating on a single Javascript runtime which is untrue, in fact they each have their own VM. And V8, the one runtime specifically alluded to via Node.js, is not, as he writes, designed to be concurrent; Evan Phoenix, lead developer of Rubinius, comments, "It's probably the least concurrent runtime I've seen."

He then moves on to discussing the transparency of the levels involved in a runtime. Here I think he's 100% correct. Being able to understand how a VM is operating, what it's doing, what it's optimizing, how it's executing is enormously important. That's why I'm confused that he's positioning this as an argument against PyPy, as we've made transparency of our system incredibly important. We have the jitviewer, a tool which exposes the exact internal operations and machine code generated for everything PyPy compiles, which can be correlated to a individual line of Python code. We also have a set of hooks into the JIT to be able to programatically inspect what's happening, including writing your own, pure Python, optimization passes: http://pypy.readthedocs.org/en/latest/jit-hooks.html!

That's all I have. Hope you enjoyed.

You can find the rest here. There are view comments.

RCOS NumPy Talk

Posted November 18th, 2011. Tagged with pypy, rcos, numpy.

I just delivered a talk at RCOS (the open source center on campus) on some of the work I've been doing this semester on PyPy's implementation of NumPy. You can find the slides here, they're built using html5slides, which is pretty swell (for a tool that makes me write HTML directly that is).

You can find the rest here. There are view comments.

So you want to write a fast Python?

Posted July 10th, 2011. Tagged with pypy, programming, python.

Thinking about writing your own Python implementation? Congrats, there are plenty out there [1], but perhaps you have something new to bring to the table. Writing a fast Python is a pretty hard task, and there's a lot of stuff you need to keep in mind, but if you're interested in forging ahead, keep reading!

First, you'll need to write yourself an interpreter. A static compiler for Python doesn't have enough information to do the right things [2] [3], and a multi-stage JIT compiler is probably more trouble than it's worth [4]. It doesn't need to be super fast, but it should be within 2x of CPython or so, or you'll have lost too much ground to make up later. You'll probably need to write yourself a garbage collector as well, it should probably be a nice, generational collector [5].

Next you'll need implementations for all the builtins. Be careful here! You need to be every bit as good as CPython's algorithms if you want to stand a chance, this means things like list.sort() keeping up with Timsort [6], str.__contains__ keeping up with fast search [7], and dict.__getitem__ keeping up with the extremely carefully optimized Python dict [8].

Now you've got the core language, take a bow, most people don't make it nearly this far! However, there's still tons of work to go, for example you need the standard library if you want people to actually use this thing. A lot of the stdlib is in Python, so you can just copy that, but some stuff isn't, for that you'll need to reimplement it yourself (you can "cheat" on a lot of stuff and just write it in Python though, rather than C, or whatever language your interpreter is written in).

At this point you should have yourself a complete Python that's basically a drop-in replacement for CPython, but that's a bit slower. Now it's time for the real work to begin. You need to write a Just in Time compiler, and it needs to be a good one. You'll need a great optimizer that can simultaneously understand some of the high level semantics of Python, as well as the low level nitty gritty of your CPU [9].

If you've gotten this far, you deserve a round of applause, not many projects make it this far. But your Python probably still isn't going to be used by the world, you may execute Python code 10x faster, but the Python community is more demanding than that. If you want people to really use this thing you're going to have to make sure their C extensions run. Sure, CPython's C-API was never designed to be run on other platforms, but you can make it work, even if it's not super fast, it might be enough for some people [10].

Finally, remember that standard library you wrote earlier? Did you make sure to take your time to optimize it? You're probably going to need to take a step back and do that now, sure it's huge, and people use every nook and cranny of it, but if you want to be faster, you need it to be faster too. It won't do to have your bz2 module be slower, tarnishing your beautiful speed results [11].

Still with me? Congratulations, you're in a class of your own. You've got a blazing fast Python, a nicely optimized standard library, and you can run anyone's code, Python or C. If this ballad sounds a little familiar, that's because it is, it's the story of PyPy. If you think this was a fun journey, you can join in. There are ways for Python programmers at every level to help us, such as:

  • Contributing to our performance analysis tool, this is actually a web app written using Flask.
  • Contribute to speed.pypy.org which is a Django site.
  • Provide pure Python versions of your C-extensions, to ensure they run on alternative Pythons.
  • Test and benchmark your code on PyPy, let us know if you think we should be faster! (We're always interested in slower code, and we consider it a bug)
  • Contribute to PyPy itself, we've got tons of things to do, you could work on the standard library, the JIT compiler, the GC, or anything in between.

Hope to see you soon [12]!

[1]CPython, IronPython, Jython, PyPy, at least!
[2]http://code.google.com/p/shedskin/
[3]http://cython.org/
[4]http://code.google.com/p/v8/
[5]http://docs.python.org/c-api/refcounting.html
[6]http://hg.python.org/cpython/file/2.7/Objects/listsort.txt
[7]http://effbot.org/zone/stringlib.htm
[8]http://hg.python.org/cpython/file/2.7/Objects/dictnotes.txt
[9]http://code.google.com/p/unladen-swallow/
[10]http://code.google.com/p/ironclad/
[11]https://bugs.pypy.org/issue733
[12]http://pypy.org/contact.html

You can find the rest here. There are view comments.

DjangoCon Europe 2011 Slides

Posted June 7th, 2011. Tagged with talk, djanogcon, pypy, python, django.

I gave a talk at DjangoCon Europe 2011 on using Django and PyPy (with another talk to be delivered tomorrow!), you can get the slides right on bitbucket, and for those who saw the talk, the PyPy compatibility wiki is here.

You can find the rest here. There are view comments.

This Summer

Posted May 6th, 2011. Tagged with pypy, python, self.

For the past nearly two years I've been an independent contractor working primarily for Eldarion, and it's been fantastic, I can't say enough good things. However, this summer I'm shaking things up a bit and heading west. I'll be an intern at Quora this summer. They're a Python shop, and it's going to be my job to make everything zippy. Specifically I'll be making the site run on PyPy, and then tuning the hell out of both their codebase and PyPy itself. I'm super excited about the chance to get a major site running on PyPy, and to contribute as many performance improvements upstream as possible.

You can find the rest here. There are view comments.

My experience with the computer language shootout

Posted April 3rd, 2011. Tagged with pypy, programming, python, programming-languages.

For a long time we, the PyPy developers, have known the Python implementations on the Computer Language Shootout were not optimal under PyPy, and in fact had been ruthlessly microoptimized for CPython, to the detriment of PyPy. But we didn't really care or do anything about it, because we figured those weren't really representative of what people like to do with Python, and who really cares what it says, CPython is over 30 times slower than C, and people use it just the same. However, I've recently have a number of discussions about language implementation speed and people tend to cite the language shootout as the definitive source for cross-language comparisons. So I decided to see what I could do about making it faster.

The first benchmark I took a stab at was reverse-complement, PyPy was doing crappily on it, and it was super obviously optimized for CPython: every loop possible was pushed down into functions known to be implemented in C, various memory allocation tricks are played (e.g. del some_list[:] removes the contents of the list, but doesn't deallocate the memory), and bound method allocation is pulled out of loops. The first one is the most important for PyPy, on PyPy your objective is generally to make sure your hot loops are in Python, the exact opposite of what you want on CPython. So I started coding up my own version, optimized for PyPy, I spent some time with our debugging and profiling tools, and whipped up a nice implementation that was something like 3x faster than the current one on PyPy, which you can see here. Generally the objective here was to make sure the program does as little memory allocation in the hot loops as possible, all of which are in Python. Try that with your average interpreter.

So I went ahead and submitted it, thinking PyPy would be looking 3 times better when I woke up. Naturally I wake up to an email from the shootout, which says that I should provide a Python 3 implementation, and that it doesn't work on CPython. What the hell? I try to run it myself and indeed it doesn't. It turns out on CPython sys.stdout.write(buffer(array.array("c"), 0, idx)) raises an exception. Which is a tad unfortunate because it should be an easy way to print out part of an array of characters without needing to allocate memory. After speaking with some CPython core developers, it appears that it is indeed a bug in CPython. And I noticed on PyPy buffer objects aren't nearly as efficient as they should be, so I set out in search of a new way to work on CPython and PyPy, and be faster if possible. I happened to stuble across the method array.buffer_info which returns a tuple of the memory address of the array's internal storage and its length, and a brilliant hack occurred to me: use ctypes to call libc's write() function. I coded it up, and indeed it worked on PyPy and CPython and was 40% faster on PyPy to boot. Fantastic I thought, I'll just submit this and PyPy will look rocking! Only 3.5x slower than C, not bad for an interpreter, in a language that is notoriously hard to optimize. You can see the implementation right here, it contains a few other performance tricks as well, but nothing too exciting.

So I submitted this, thinking, "Aha! I've done it". Shortly, I had an email saying this has been accepted as an "interesting alternative" because it used ctypes, which is to say it won't be included in the cumulative timings for each implementation, nor will it be listed with the normal implementations for the per-benchmark scores. Well crap, that's no good, the whole point of this was to look good, what's the point if no one is going to see this glorious work. So I sent a message asking why this implementation was considered alternative, since it appeared fairly legitimate. I received a confusing message questioning why this optimization was necessary, followed by a suggestion that perhaps PyPy wasn't compatible enough with (with what I dare not ask, but the answer obviously isn't Python the abstract language, since CPython had the bug!).

Overall it was a pretty crappy experience. The language shootout appears to be governed by arbitrary rules. For example the C implementations use GCC builtins, which are not part of the C standard, making them not implementation portable. The CPython pidigits version uses a C extension which is obviously not implementation portable (by comparison every major Python implementation includes ctypes, only CPython, and to varying extents IronPython and PyPy, support the CPython C-API), although here PyPy was allowed to use ctypes. It's also not possible to send any messages once your ticket has been marked as closed, meaning to dispute a decision you basically need to pray the maintainer reopens it for some reason. The full back and forth is available here. I'm still interested in improving the PyPy submissions there (and further optimizing PyPy where needed). However given the seemingly capricious and painful submission process I'm not really inclined to continue work, nor can I take the shootout seriously as an honest comparison of languages.

You can find the rest here. There are view comments.

PyPy San Francisco Tour Recap

Posted March 9th, 2011. Tagged with python, pypy.

Yesterday was the last day of my (our) tour of the Bay Area, so I figured I'd post a recap of all the cool stuff that happened.

Sprints

I got in on Friday night (technically Saturday morning, thanks for the ride Noah!) and on Saturday and Sunday we held sprints at Noisebridge. They have a very cool location and we had some pretty productive sprints. The turnout was less than expected, based on the people who said they'd show at the Python user group, however we were pretty productive. We fixed a number of remaining issues before we can do a Python 2.7 compatible release. These included: Armin Rigo and I implementing PYTHONIOENCODING (look it up!), Dan Roberts fixing the sqlite3 tests, and Armin fixing a subtle JIT bug. I also spent some time doing some performance profiling with Greg from Quora. Importing pytz was incredibly slow on PyPy, and it turned out the issue was when pytz was in a .egg its attempts to load the timezone files cause repeated reads from the ZIP file, and PyPy wasn't caching the zipfile metadata properly. So we fixed that and now it's much faster.

Google

Monday morning we (Armin Rigo, Maciej Fijalkowski, and myself) gave a tech talk at Google. The first thing I noticed was the Google campus is obscenely gorgeous, and large. Unfortunately, our talk didn't seem to go very well. Part of this was we were unsure if our audience was Python developers looking to make their code run faster, or compiler people who wanted to hear all the gory details of our internals. Even now I am still a little unsure about who showed up to our talk, but they didn't seem very enthused. I'm hoping we can construct some sort of post-mortem on the talk, because Google is precisely the type of company with a lot of Python code and performance aspirations that we think we can help with. (And Google's cafeterias are quite delicious).

Mozilla

After lunch at Google we shuffled over to Mozilla's offices for another talk. The slides we delivered were the same as the Google ones, however this talk went over much better. Our audience was primarily compiler people (who work on the Tracemonkey/Jaegermonkey/other Javascript engines), with a few Python developers who were interested. We had a ton of good questions during and after the talk, and then hung around for some great discussions with Brendan Eich, Andreas Gal, David Mandelin, and Chris Leary, kudos to all those guys. Some of the major topics were:

  • Whether or not we had issues of trace explosion (over specializing traces and thus compiling a huge number of paths that were in practice rarely executed), and why not.
  • The nature of the code we have to optimize, a good bit of real world Python is mostly idiomatic, whereas Mozilla really has to deal with a problem of making any old crap on the internet fast.
  • The fact that for Javascript script execution time is generally very short, whereas the Python applications we target tend to have longer execution times. While we still work to startup quickly (e.g. no 5 minute JVM startup times), it's less of an issue if it takes 30 seconds before the code starts running at top speed. For us this means that targeting browser Javascript is possibly less sensible than server-side Javascript for a possible Javascript VM written on top of our infrastructure.

Overall it was a very positive experience, and I'll be excited to speak with David some more at the VM summit at PyCon.

Recap

Overall the trip was a big success in my view. I believe both the Google and Mozilla talks were recorded, so once I know where they are online hopefully other people can enjoy the talks. Hopefully Armin will blog about some of the talks he gave before I got to the Bay Area. See you at PyCon!

You can find the rest here. There are view comments.

PyOhio Slides

Posted August 2nd, 2010. Tagged with pypy, talk, unladenswallow, python, pyohio.

I've (finally) uploaded the slides from my PyOhio talk (I gave a similar talk at ChiPy). You can get them right here, I'll be putting all my slides on Scribd from here on out, they were much easier to upload to than SlideShare, plus HTML5 is awesome!

You can find the rest here. There are view comments.

PyPy is the Future of Python

Posted May 15th, 2010. Tagged with python, pypy.

Currently the most common implementation of Python is known as CPython, and it's the version of Python you get at python.org, probably 99.9% of Python developers are using it. However, I think over the next couple of years we're going to see a move away from this towards PyPy, Python written in Python. This is going to happen because PyPy offers better speed, more flexibility, and is a better platform for Python's growth, and the most important thing is you can make this transition happen.

The first thing to consider: speed. PyPy is a lot faster than CPython for a lot of tasks, and they've got the benchmarks to prove it. There's room for improvement, but it's clear that for a lot of benchmarks PyPy screams, and it's not just number crunching (although PyPy is good at that too). Although Python performance might not be a bottleneck for a lot of us (especially us web developers who like to push performance down the stack to our database), would you say no to having your code run 2x faster?

The next factor is the flexibility. By writing their interpreter in RPython PyPy can automatically generate C code (like CPython), but also JVM and .NET versions of the interpreter. Instead of writing entirely separate Jython and IronPython implementations of Python, just automatically generate them from one shared codebase. PyPy can also have its binary generated with a stackless option, just like stackless Python, again no separate implementations to maintain. Lastly, PyPy's JIT is almost totally separate from the interpreter, this means changes to the language itself can be made without needing to update the JIT, contrast this with many JITs that need to statically define fast-paths for various operations.

And finally that it's a better platform for growth. The last point is a good example of this: one can keep the speed from the JIT while making changes to the language, you don't need to be an assembly expert to write a new bytecode, or play with the builtin types, the JIT generator takes care of it for you. Also, it's written in Python, it may be RPython which isn't as high level as regular Python, but compare the implementations of of map from CPython and PyPy:

static PyObject *
builtin_map(PyObject *self, PyObject *args)
{
    typedef struct {
        PyObject *it;           /* the iterator object */
        int saw_StopIteration;  /* bool:  did the iterator end? */
    } sequence;

    PyObject *func, *result;
    sequence *seqs = NULL, *sqp;
    Py_ssize_t n, len;
    register int i, j;

    n = PyTuple_Size(args);
    if (n < 2) {
        PyErr_SetString(PyExc_TypeError,
                        "map() requires at least two args");
        return NULL;
    }

    func = PyTuple_GetItem(args, 0);
    n--;

    if (func == Py_None) {
        if (PyErr_WarnPy3k("map(None, ...) not supported in 3.x; "
                           "use list(...)", 1) < 0)
            return NULL;
        if (n == 1) {
            /* map(None, S) is the same as list(S). */
            return PySequence_List(PyTuple_GetItem(args, 1));
        }
    }

    /* Get space for sequence descriptors.  Must NULL out the iterator
     * pointers so that jumping to Fail_2 later doesn't see trash.
     */
    if ((seqs = PyMem_NEW(sequence, n)) == NULL) {
        PyErr_NoMemory();
        return NULL;
    }
    for (i = 0; i < n; ++i) {
        seqs[i].it = (PyObject*)NULL;
        seqs[i].saw_StopIteration = 0;
    }

    /* Do a first pass to obtain iterators for the arguments, and set len
     * to the largest of their lengths.
     */
    len = 0;
    for (i = 0, sqp = seqs; i < n; ++i, ++sqp) {
        PyObject *curseq;
        Py_ssize_t curlen;

        /* Get iterator. */
        curseq = PyTuple_GetItem(args, i+1);
        sqp->it = PyObject_GetIter(curseq);
        if (sqp->it == NULL) {
            static char errmsg[] =
                "argument %d to map() must support iteration";
            char errbuf[sizeof(errmsg) + 25];
            PyOS_snprintf(errbuf, sizeof(errbuf), errmsg, i+2);
            PyErr_SetString(PyExc_TypeError, errbuf);
            goto Fail_2;
        }

        /* Update len. */
        curlen = _PyObject_LengthHint(curseq, 8);
        if (curlen > len)
            len = curlen;
    }

    /* Get space for the result list. */
    if ((result = (PyObject *) PyList_New(len)) == NULL)
        goto Fail_2;

    /* Iterate over the sequences until all have stopped. */
    for (i = 0; ; ++i) {
        PyObject *alist, *item=NULL, *value;
        int numactive = 0;

        if (func == Py_None && n == 1)
            alist = NULL;
        else if ((alist = PyTuple_New(n)) == NULL)
            goto Fail_1;

        for (j = 0, sqp = seqs; j < n; ++j, ++sqp) {
            if (sqp->saw_StopIteration) {
                Py_INCREF(Py_None);
                item = Py_None;
            }
            else {
                item = PyIter_Next(sqp->it);
                if (item)
                    ++numactive;
                else {
                    if (PyErr_Occurred()) {
                        Py_XDECREF(alist);
                        goto Fail_1;
                    }
                    Py_INCREF(Py_None);
                    item = Py_None;
                    sqp->saw_StopIteration = 1;
                }
            }
            if (alist)
                PyTuple_SET_ITEM(alist, j, item);
            else
                break;
        }

        if (!alist)
            alist = item;

        if (numactive == 0) {
            Py_DECREF(alist);
            break;
        }

        if (func == Py_None)
            value = alist;
        else {
            value = PyEval_CallObject(func, alist);
            Py_DECREF(alist);
            if (value == NULL)
                goto Fail_1;
        }
        if (i >= len) {
            int status = PyList_Append(result, value);
            Py_DECREF(value);
            if (status < 0)
                goto Fail_1;
        }
        else if (PyList_SetItem(result, i, value) < 0)
            goto Fail_1;
    }

    if (i < len && PyList_SetSlice(result, i, len, NULL) < 0)
        goto Fail_1;

    goto Succeed;

Fail_1:
    Py_DECREF(result);
Fail_2:
    result = NULL;
Succeed:
    assert(seqs);
    for (i = 0; i < n; ++i)
        Py_XDECREF(seqs[i].it);
    PyMem_DEL(seqs);
    return result;
}

That's a lot of code! It wouldn't be bad, for C code, except for the fact that there's far too much boilerplate: every single call into the C-API needs to check for an exception, and INCREF and DECREF calls are littered throughout the code. Compare this with PyPy's RPython implementation:

def map(space, w_func, collections_w):
    """does 3 separate things, hence this enormous docstring.
       1.  if function is None, return a list of tuples, each with one
           item from each collection.  If the collections have different
           lengths,  shorter ones are padded with None.

       2.  if function is not None, and there is only one collection,
           apply function to every item in the collection and return a
           list of the results.

       3.  if function is not None, and there are several collections,
           repeatedly call the function with one argument from each
           collection.  If the collections have different lengths,
           shorter ones are padded with None
    """
    if not collections_w:
        msg = "map() requires at least two arguments"
        raise OperationError(space.w_TypeError, space.wrap(msg))
    num_collections = len(collections_w)
    none_func = space.is_w(w_func, space.w_None)
    if none_func and num_collections == 1:
        return space.call_function(space.w_list, collections_w[0])
    result_w = []
    iterators_w = [space.iter(w_seq) for w_seq in collections_w]
    num_iterators = len(iterators_w)
    while True:
        cont = False
        args_w = [space.w_None] * num_iterators
        for i in range(len(iterators_w)):
            try:
                args_w[i] = space.next(iterators_w[i])
            except OperationError, e:
                if not e.match(space, space.w_StopIteration):
                    raise
            else:
                cont = True
        w_args = space.newtuple(args_w)
        if cont:
            if none_func:
                result_w.append(w_args)
            else:
                w_res = space.call(w_func, w_args)
                result_w.append(w_res)
        else:
            return space.newlist(result_w)
map.unwrap_spec = [ObjSpace, W_Root, "args_w"]

It's not exactly what you'd write for a pure Python implementation of map, but it's a hell of a lot closer than the C version.

The case for PyPy being the future is strong, I think, however it's not all sunshine are roses, there are a few issues. It lags behind CPython's version (right now Python 2.5 is implemented), C extension compatibility isn't there yet, and not enough people are trying it out yet. But PyPy is getting there, and you can help.

Right now the single biggest way to help for most people is to test their code. Any pure Python code targeting Python 2.5 should run perfectly under PyPy, and if it doesn't: it's a bug, if it's slower than Python: let us know (unless it involves re, we know it's slow). Maybe try out your C-extensions, however cpyext is very alpha and even a segfault isn't surprising (but let us know so we can investigate). Of course help on development is always appreciated, right now most of the effort is going into speeding up the JIT even more, however I believe there is also going to be work on moving up to Python 2.7 (currently pre-release) this summer. If you're interested in helping out with either you should hop into #pypy on irc.freenode.net, or send a message to pypy-dev. PyPy's doing good work, Python doesn't need to be slow, and we don't all need to write C code!

You can find the rest here. There are view comments.

Making Django and PyPy Play Nice (Part 1)

Posted April 16th, 2010. Tagged with pypy, django, python.

If you track Django's commits aggressivly (ok so just me...), you may have noticed that there have been a number of commits to improve the compatibility of Django and PyPy in the last couple of days. In the run up to Django 1.0 there were a ton of commits to make sure Django didn't have too many assumptions that it was running on CPython, and that it could work with systems like Jython and PyPy. Unfortunately, since then, our support has laxed a little, and a number of tests have begun to fail. In the past couple of days I've been working to correct this. Here are some of the things that were wrong

The first issue I ran into was, in various tests, response.context and response.template being None, instead of the lists that were expected. This was a pain to diagnose, but the ultimate source of the bug is that Django registers a signal handler in the test client that listens for templates being rendered. However, it doesn't actually unregister that signal receiver. Instead it relies on the fact that signals are stored as weakrefs, and when the function ends, the receivers that were registered (which were local variables) would be automatically deallocated. On PyPy, Jython, and any other system with a garbage collector more advanced than CPython's reference counting, the local variables aren't guaranteed to be deallocated at the end of the function, and therefore the weakref can still be alive. Truthfully, I'm not 100% sure how this results in the next signal being sent not to store the appropriate data, but it does. The solution is the make sure that the signals are manually disconnected at the end of the run. This was fixed in r12964 of Django.

The next issue was actually a problem in PyPy, specifically it was crashing with a UnicodeDecodeError. When I say crashing I mean crashing in the C sense of the word, not the Python, nice exception and stack trace sense... sort of. PyPy is written in a language named RPython, RPython is Pythonesque (all valid RPython is valid Python), and has exceptions. However if they aren't caught they sort of propagate to the top, they give you a kind-of-ok stacktrace, but its function names are all the generated function names from the C source, not useful ones from the RPython source. Internally, PyPy uses an OperationError to keep track of exceptions at the interpreter level. A trick to debugging RPython is, if running the code on top of CPython works, than running it translated to C will work, and the contrapositive appears true as well, if the C doesn't work, running on CPython won't work. After trying to run the code on CPython, the location of the exception bubbled right to the top, and the fix followed easily.

These are the first two issues I fixed, a couple others have been fixed and committed, and a further few have also been fixed, but not committed yet. I'll be writing about those as I find the time.

You can find the rest here. There are view comments.

Languages Don't Have Speeds, Or Do They?

Posted March 15th, 2010. Tagged with pypy, python, compiler, programming-languages, psyco.

Everyone knows languages don't have speeds, implementations do. Python isn't slow; CPython is. Javascript isn't fast; V8, Squirrelfish, and Tracemonkey are. But what if a language was designed in such a way that it appeared that it was impossible to be implemented efficiently? Would it be fair to say that language is slow, or would we still have to speak in terms of implementations? For a long time I followed the conventional wisdom, that languages didn't have speeds, but lately I've come to believe that we can learn something by thinking about what the limits on how fast a language could possibly be, given a perfect implementation.

For example consider the following Python function:

def f(n):
    i = 0
    while i < n:
        i += 1
        n += i
    return n

And the equivilant C function:

int f(int n) {
    int i = 0;
    while (i < n) {
        i += 1;
        n += i;
    }
    return n;
}

CPython probably runs this code 100 times slower than the GCC compiled version of the C code. But we all know CPython is slow right? PyPy or Psyco probably runs this code 2.5 times slower than the C version (I'm just spitballing here). Psyco and PyPy are, and contain, really good just in time compilers that can profile this code, see that f is always called with an integer, and therefore a much more optimized version can be generated in assembly. For example the optimized version could generate just a few add instructions in the inner loop (plus a few more instructions to check for overflow), this would skip all the indirection of calling the __add__ function on integers, allocating the result on the heap, and the indirection of calling the __lt__ function on integers, and maybe even some other things I missed.

But there's one thing no JIT for Python can do, no matter how brilliant. It can't skip the check if n is an integer, because it can't prove it always will be an integer, someone else could import this function and call it with strings, or some custom type, or anything they felt like, so the JIT must verify that n is an integer before running the optimized version. C doesn't have to do that. It knows that n will always be an integer, even if you do nothing but call f until the end of the earth, GCC can be 100% positive that n is an integer.

The absolute need for at least a few guards in the resulting assembly guarantees that even the most perfect JIT compiler ever, could not generate code that was strictly faster than the GCC version. Of course, a JIT has some advantages over a static compiler, for example, it can inline at dynamic call sites. However, in practice I don't believe this ability is ever likely to beat a static compiler for a real world program. On the other hand I'm not going to stop using Python any time soon, and it's going to continue to get faster, a lot faster.

You can find the rest here. There are view comments.

PyCon Roundup - Days 2-4

Posted March 8th, 2010. Tagged with pypy, pycon, unladen-swallow.

As I said in my last post PyCon was completely and utterly awesome, but it was also a bit of a blur. Here's my attempt at summarizing everything that happened during the conference itself.

Day 2

Day 2 (the first day of the conference itself) started out with Guido's keynote. He did something rather unorthodox, instead of delivering a formal talk he just took audience questions via Twitter, starting out using Twitterfall, but it was a tad slow so Jacob Kaplan-Moss switched it out for the PyCon Live Stream that Eric Florenzano, Brian Rosner, and I built, switch was very awesome (seeing your creation on four projectors in front of an audience is a pretty good ego boost). Next up I had to deliver my own talk, it went well I think, you can find my slides and a video on the PyCon website. The rest of the day was a bit of a blur, but I enjoyed James Bennet's talk, the Form generator's panel, Jonathan Ellis's talk on database scalability, and Alex Martelli's talk.

That evening I got to visit the Django restaurant, I can't help but imagine what the staff there thought about the crazy groups of programmers visiting (I think we mobbed the place every night of the conference), many of us wearing our Django shirts.

Day 3

Day 3 was really about the VMs. It started with the Iron Python and PyPy keynotes. These were followed by Mark Shuttleworth's keynote, his slides weren't working (the perils of using an operating system alpha release), but he still delivered an awesome talk on software development processes. From there I went to the single best stretch of talks at PyCon. The Speed of PyPy, Unladen Swallow: fewer coconuts, faster Python, and Understanding the Python GIL. Three great topics, three great speakers, three great talks, one room. Simply put, you should drop whatever you're doing and watch each of these talks, they're all great. I was sad to miss Raymond Hettinger's talk for the Django Software Foundation meeting, but I caught the video and it was excellent (as usual for Raymond). The DSF meeting was interesting, but not super exciting, a lot of "action needed" tasks, some of which are already happening (like getting better buildbots running). Finally I caught the tail end of the Neo4j talk, I'd need to watch the full thing, because the ending caught my attention.

I ended up back at the Django restaurant again, this time for the speakers, volunteers, and sponsors dinner. Django wasn't quite equipped to handle the sheer number of us that showed us, but it was a good time nonetheless. Once again I was blown away by how many Google people there were. There were far too many awesome people to list, but suffice it to say that you should volunteer or speak at PyCon, if for no other reason than to get to attend this dinner, there are awesome people to have a conversation with as far as the eyes can see! After dinner I ended up staying up far too late with another group of awesome people. I'm told the testing birds of a feather was also great (I suppose events that aren't completely awesome don't invent things like the Testing Goat).

Day 4

Here I suffered the effects of the aforementioned late night. I missed all of the keynotes, which is unfortunately considering they all looked quite good, I'll have to catch the videos. The poster session was very cool, I only got to see about half of it, but it's definitely something I'm going to look forward to at future PyCons. Next I saw Donovan Preston's talk on eventlet, a talk on teaching compilers with Python. I missed Scott Chacon's talk on hg-git, though I can't remember what for. I'll have to catch it in video because I was really looking forward to it.

After that there was time for a final pizza with friends, and then I had to head home. I sprinted remotely, but it's not quite the same as being there. It's my hope that next year I can attend the sprints in person, but school never seems to work out for me in that respect.

You can find the rest here. There are view comments.

Committer Models of Unladen Swallow, PyPy, and Django

Posted February 25th, 2010. Tagged with pypy, python, unladen-swallow, django, open-source.

During this year's PyCon I became a committer on both PyPy and Unladen Swallow, in addition I've been a contributer to Django for quite a long time (as well as having commit privileges to my branch during the Google Summer of Code). One of the things I've observed is the very different models these projects have for granting commit privileges, and what the expectations and responsibilities are for committers.

Unladen Swallow

Unladen Swallow is a Google funded branch of CPython focused on speed. One of the things I've found is that the developers of this project carry over some of the development process from Google, specifically doing code review on every patch. All patches are posted to Rietveld, and reviewed, often by multiple people in the case of large patches, before being committed. Because there is a high level of review it is possible to grant commit privileges to people without requiring perfection in their patches, as long as they follow the review process the project is well insulated against a bad patch.

PyPy

PyPy is also an implementation of Python, however its development model is based largely around aggressive branching (I've never seen a project handle SVN's branching failures as well as PyPy) as well as sprints and pair programming. By branching aggressively PyPy avoids the overhead of reviewing every single patch, and instead only requires review when something is already believed to be "trunk-ready", further this model encourages experimentation (in the same way git's light weight branches do). PyPy's use of sprints and pair programming are two ways to avoid formal code reviews and instead approach code quality as more of a collaborative effort.

Django

Django is the project I've been involved with for the longest, and also the only one I don't have commit privileges on. Django is extremely conservative in giving out commit privileges (there about a dozen Django committers, and about 500 names in the AUTHORS file). Django's development model is based neither on branching (only changes as large in scope as multiple database support, or an admin UI refactor get their own branch) nor on code review (most commits are reviewed by no one besides the person who commits them). Django's committers maintain a level of autonomy that isn't seen in either of the other two projects. This fact comes from the period before Django 1.0 was released when Django's trunk was often used in production, and the need to keep it stable at all times, combined with the fact that Django has no paid developers who can guarantee time to do code review on patches. Therefore Django has maintained code quality by being extremely conservative in granting commit privileges and allowing developers with commit privileges to exercise their own judgment at all times.

Conclusion

Each of these projects uses different methods for maintaining code quality, and all seem to be successful in doing so. It's not clear whether there's any one model that's better than the others, or that any of these projects could work with another's model. Lastly, it's worth noting that all of these models are fairly orthogonal to the centralized VCS vs. DVCS debate which often surrounds such discussions.

You can find the rest here. There are view comments.

A Bit of Benchmarking

Posted November 22nd, 2009. Tagged with pypy, python, django, compiler, programming-languages.

PyPy recently posted some interesting benchmarks from the computer language shootout, and in my last post about Unladen Swallow I described a patch that would hopefully be landing soon. I decided it would be interesting to benchmarks something with this patch. For this I used James Tauber's Mandelbulb application, at both 100x100 and 200x200. I tested CPython, Unladen Swallow Trunk, Unladen Swallow Trunk with the patch, and a recent PyPy trunk (compiled with the JIT). My results were as follows:

VM 100 200
CPython 2.6.4 17s 64s
Unladen Swallow Trunk 16s 52s
Unladen swallow Trunk + Patch 13s 49s
PyPy Trunk 10s 46s

Interesting results. At 100x100 PyPy smokes everything else, and the patch shows a clear benefit for Unladen. However, at 200x200 both PyPy and the patch show diminishing returns. I'm not clear on why this is, but my guess is that something about the increased size causes a change in the parameters that makes the generated code less efficient for some reason.

It's important to note that Unladen Swallow has been far less focussed on numeric benchmarks than PyPy, instead focusing on more web app concerns (like template languages). I plan to benchmark some of these as time goes on, particularly after PyPy merges their "faster-raise" branch, which I'm told improves PyPy's performance on Django's template language dramatically.

You can find the rest here. There are view comments.

Things College Taught me that the "Real World" Didn't

Posted November 21st, 2009. Tagged with pypy, parse, python, unladen-swallow, compile, django, ply, programming-languages, c++, response, lex, compiler, yacc, college.

A while ago Eric Holscher blogged about things he didn't learn in college. I'm going to take a different spin on it, looking at both things that I did learn in school that I wouldn't have learned else where (henceforth defined as my job, or open source programming), as well as thinks I learned else where instead of at college.

Things I learned in college:

  • Big O notation, and algorithm analysis. This is the biggest one, I've had little cause to consider this in my open source or professional work, stuff is either fast or slow and that's usually enough. Learning rigorous algorithm analysis doesn't come up all the time, but every once in a while it pops up, and it's handy.
  • C++. I imagine that I eventually would have learned it myself, but my impetus to learn it was that's what was used for my CS2 class, so I started learning with the class then dove in head first. Left to my own devices I may very well have stayed in Python/Javascript land.
  • Finite automaton and push down automaton. I actually did lexing and parsing before I ever started looking at these in class (see my blog posts from a year ago) using PLY, however, this semester I've actually been learning about the implementation of these things (although sadly for class projects we've been using Lex/Yacc).

Things I learned in the real world:

  • Compilers. I've learned everything I know about compilers from reading my papers from my own interest and hanging around communities like Unladen Swallow and PyPy (and even contributing a little).
  • Scalability. Interesting this is a concept related to algorithm analysis/big O, however this is something I've really learned from talking about this stuff with guys like Mike Malone and Joe Stump.
  • APIs, Documentation. These are the core of software development (in my opinion), and I've definitely learned these skills in the open source world. You don't know what a good API or documentation is until it's been used by someone you've never met and it just works for them, and they can understand it perfectly. One of the few required, advanced courses at my school is titled, "Software Design and Documentation" and I'm deathly afraid it's going to waste my time with stuff like UML, instead of focusing on how to write APIs that people want to use and documentation that people want to read.

So these are my short lists. I've tried to highlight items that cross the boundaries between what people traditionally expect are topics for school and topics for the real world. I'd be curious to hear what other people's experience with topics like these are.</div>

You can find the rest here. There are view comments.

Another Pair of Unladen Swallow Optimizations

Posted November 19th, 2009. Tagged with pypy, python, unladen-swallow, django, programming-languages.

Today a patch of mine was committed to Unladen Swallow. In the past weeks I've described some of the optimizations that have gone into Unladen Swallow, in specific I looked at removing the allocation of an argument tuple for C functions. One of the "on the horizon" things I mentioned was extending this to functions with a variable arity (that is the number of arguments they take can change). This has been implemented for functions that take a finite range of argument numbers (that is, they don't take *args, they just have a few arguments with defaults). This support was used to optimize a number of builtin functions (dict.get, list.pop, getattr for example).

However, there were still a number of functions that weren't updated for this support. I initially started porting any functions I saw, but it wasn't a totally mechanical translation so I decided to do a little profiling to better direct my efforts. I started by using the cProfile module to see what functions were called most frequently in Unladen Swallow's Django template benchmark. Imagine my surprise when I saw that unicode.encode was called over 300,000 times! A quick look at that function showed that it was a perfect contender for this optimization, it was currently designated as a METH_VARARGS, but in fact it's argument count was a finite range. After about of dozen lines of code, to change the argument parsing, I ran the benchmark again, comparing it a control version of Unladen Swallow, and it showed a consistent 3-6% speedup on the Django benchmark. Not bad for 30 minutes of work.

Another optimization I want to look at, which hasn't landed yet, is one of optimize various operations. Right now Unladen Swallow tracks various data about the types seen in the interpreter loop, however for various operators this data isn't actually used. What this patch does is check at JIT compilation time whether the operator site is monomorphic (that is there is only one pair of types ever seen there), and if it is, and it is one of a few pairings that we have optimizations for (int + int, list[int], float - float for example) then optimized code is emitted. This optimized code checks the types of both the arguments that they are the expected ones, if they are then the optimized code is executed, otherwise the VM bails back to the interpreter (various literature has shown that a single compiled optimized path is better than compiling both the fast and slow paths). For simple algorithm code this optimization can show huge improvements.

The PyPy project has recently blogged about the results of the results of some benchmarks from the Computer Language Shootout run on PyPy, Unladen Swallow, and CPython. In these benchmarks Unladen Swallow showed that for highly algorithmic code (read: mathy) it could use some work, hopefully patches like this can help improve the situation markedly. Once this patch lands I'm going to rerun these benchmarks to see how Unladen Swallow improves, I'm also going to add in some of the more macro benchmarks Unladen Swallow uses to see how it compares with PyPy in those. Either way, seeing the tremendous improvements PyPy and Unladen Swallow have over CPython gives me tremendous hope for the future.

You can find the rest here. There are view comments.

Optimising compilers are there so that you can be a better programmer

Posted October 10th, 2009. Tagged with pypy, python, unladen-swallow, django, compiler.

In a discussion on the Django developers mailing list I recently commented that the performance impact of having logging infrastructure, in the case where the user doesn't want the logging, could essentially be disregarded because Unladen Swallow (and PyPy) are bringing us a proper optimising (Just in Time) compiler that would essentially remove that consideration. Shortly thereafter someone asked me if I really thought it was the job of the interpreter/compiler to make us not think about performance. And my answer is: the job of a compiler is to let me program with best practices and not suffer performance consequences for doing things the right way.

Let us consider the most common compiler optimisations. A relatively simple one is function inlining, in the case where including the body of the function would be more efficient than actually calling it, a compiler can simply move the functions body into its caller. However, we can actually do this optimisation in our own code. We could rewrite:

def times_2(x):
    return x * 2

def do_some_stuff(i):
    for x in i:
        # stuff
        z = times_2(x)
    # more stuff

as:

def do_some_stuff(i):
    for x in i:
        # stuff
        z = x * 2
    # more stuff

And this is a trivial change to make. However in the case where times_2 is slightly less trivial, and is used a lot in our codebase it would be exceptionally more programming practice to repeat this logic all over the place, what if we needed to change it down the road? Then we'd have to review our entire codebase to make sure we changed it everywhere. Needless to say that would suck. However, we don't want to give up the performance gain from inlining this function either. So here it's the job of the compiler to make sure functions are inlined when possible, that way we get the best possible performance, as well as allowing us to maintain our clean codebase.

Another common compiler optimisation is to transform multiplications by powers of 2 into binary shifts. Thus x * 2 becomes x << 1 A final optimisation we will consider is constant propagation. Many program have constants that are used throughout the codebase. These are often simple global variables. However, once again, inlining them into methods that use them could provide a significant benefit, by not requiring the code to making a lookup in the global scope whenever they are used. But we really don't want to do that by hand, as it makes our code less readable ("Why are we multiplying this value by this random float?", "You mean pi?", "Oh."), and makes it more difficult to update down the road. Once again our compiler is capable of saving the day, when it can detect a value is a constant it can propagate it throughout the code.

So does all of this mean we should never have to think about writing optimal code, the compiler can solve all problems for us? The answer to this is a resounding no. A compiler isn't going to rewrite your insertion sort into Tim sort, nor is it going to fix the fact that you do 700 SQL queries to render your homepage. What the compiler can do is allow you to maintain good programming practices.

So what does this mean for logging in Django? Fundamentally it means that we shouldn't be concerned with possible overhead from calls that do nothing (in the case where we don't care about the logging) since a good compiler will be able to eliminate those for us. In the case where we actually do want to do something (say write the log to a file) the overhead is unavoidable, we have to open a file and write to it, there's no way to optimise it out.

You can find the rest here. There are view comments.