alex gaynor's blago-blog

Posts tagged with python

There is a flash of light! Your PYTHON has evolved into ...

Posted July 4th, 2014. Tagged with python, open-source.

This year has been marked, for me, by many many discussions of Python versions. Finally, though, I've acquiesced, I've seen the light, and I'm doing what many have suggested. I'm taking the first steps: I'm changing my default Python.

Yes indeed, my global python is now something different:

$ python
Python 2.7.6 (32f35069a16d, Jun 06 2014, 20:12:47)
[PyPy 2.3.1 with GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>>

Yup, my default python is now PyPy.

Why?

Because I believe PyPy is the future, and I want to stamp out every single possible bug and annoyance a person might hit, and the best way to do that is to subject myself to them constantly. If startup performance is too slow, you know for damned sure I'll get pissed off and try to fix it.

I'm only one day into this, but thus far: I've found one bug in Mercurial's setup.py, and lots of my random scripts run faster. But this shouldn't be just me! In today's revolutionary spirit, I want to encourage you too to cast off the shackles of slow Python, and embrace the possibility of performance without compromises!

If you run into any issues at all: packages that won't install, things that are too slow, or take too much memory. You can email me personally. I'm committed to making this the most fantastic Python experience you could ever have.

Technical details

I changed by default Python, or OS X, using pyenv:

$ brew install pyenv
$ # Muck with my fish config
$ pyenv install pypy-2.3.1
$ pyenv global pypy-2.3.1
$ pip install a nice list of utilities I use

Tada.

You can find the rest here. There are view comments.

Quo Vadimus?

Posted May 26th, 2014. Tagged with open-source, community, python.

I've spent just about every single day for the last 6 months doing something with Python 3. Some days it was helping port a library, other days it was helping projects put together their porting strategies, and on others I've written prose on the subject. At this point, I am very very bored of talking about porting, and about the health of our ecosystem.

Most of all, I'm exhausted, particularly from arguing about whether or not the process is going well. So here's what I would like:

I would like to know what the success condition for Python 3 is. If we were writing a test case for this, when would it pass?

And let's do this with objective measures. Here are some ideas I have:

  • Percentage of package downloads from PyPI performed with Python 3 clients
  • Percentage of packages on PyPI which support Python 3
  • Percentage of Python builds on Travis CI which featured a Python 3 builder

I'd like a measurement, and I'd like a schedule: "At present x% of PyPI downloads use Python 3, in 3 months we'd like it to be at y%, in 12 months we'd like it to be at z%". Then we can have some way of judging whether we're on a successful path. And if we miss our goal, we'll know it's time to reevaluate this effort.

Quo vadimus?

You can find the rest here. There are view comments.

Service

Posted May 19th, 2014. Tagged with django, python, open-source, community.

If you've been around an Open Source community for any length of time, you've probably heard someone say, "We're all volunteers here". Often this is given as an explanation for why some feature hasn't been implemented, why a release has been delayed, and in general, why something hasn't happened.

I think when we say these things (and I've said them as much as anyone), often we're being dishonest. Almost always it's not a question of an absolute availability of resources, but rather how we prioritize among the many tasks we could complete. It can explain why we didn't have time to do things, but not why we did them poorly.

Volunteerism does not place us above criticism, nor should it absolve us when we err.

Beyond this however, many Open Source projects (including entirely volunteer driven ones) don't just make their codebases available to others, they actively solicit users, and make the claim that people can depend on this software.

That dependency can take many forms. It usually means an assumption that the software will still exist (and be maintained) tomorrow, that it will handle catastrophic bugs in a reasonable way, that it will be a stable base to build a platform or a business on, and that the software won't act unethically (such as by flagrantly violating expectations about privacy or integrity).

And yet, across a variety of these policy areas, such as security and backwards compatibility we often fail to properly consider the effects of our actions on our users, particularly in a context of "they have bet their businesses on this". Instead we continue to treat these projects as our hobby projects, as things we casually do on the side for fun.

Working on PyCA Cryptography, and security in general, has grealy influenced my thinking on these issues. The nature of cryptography means that when we make mistakes, we put our users' businesses, and potentially their customers' personal information at risk. This responsibility weighs heavily on me. It means we try to have policies that emphasize review, it means we utilize aggressive automated testing, it means we try to design APIs that prevent inadvertent mistakes which affect security, it means we try to write excellent documentation, and it means, should we have a security issue, we'll do everything in our power to protect our users. (I've previous written about what I think Open Source projects' security policies should look like).

Open Source projects of a certain size, scope, and importance need to take seriously the fact that we have an obligation to our users. Whether we are volunteers, or paid, we have a solemn responsibility to consider the impact of our decisions on our users. And too often in the past, we have failed, and acted negligently and recklessly with their trust.

Often folks in the Open Source community (again, myself included!) have asked why large corporations, who use our software, don't give back more. Why don't they employ developers to work on these projects? Why don't they donate money? Why don't they donate other resources (e.g. build servers)?

In truth, my salary is paid by every single user of Python and Django (though Rackspace graciously foots the bill). The software I write for these projects would be worth nothing if it weren't for the community around them, of which a large part is the companies which use them. This community enables me to have a job, to travel the world, and to meet so many people. So while companies, such as Google, don't pay a dime of my salary, I still gain a lot from their usage of Python.

Without our users, we would be nothing, and it's time we started acknowledging a simple truth: our projects exist in service of our users, and not the other way around.

You can find the rest here. There are view comments.

Best of PyCon 2014

Posted April 17th, 2014. Tagged with python, community.

This year was my 7th PyCon, I've been to every one since 2008. The most consistent trend in my attendance has been that over the years, I've gone to fewer and fewer talks, and spent more and more time volunteering. As a result, I can't tell you what the best talks to watch are (though I recommend watching absolutely anything that sounds interesting online). Nonetheless, I wanted to write down the two defining events at PyCon for me.

The first is the swag bag stuffing. This event occurs every year on the Thursday before the conference. Dozens of companies provide swag for PyCon to distribute to our attendees, and we need to get it into over 2,000 bags. This is one of the things that defines the Python community for me. By all rights, this should be terribly boring and monotonous work, but PyCon has turned it into an incredibly fun, and social event. Starting at 11AM, half a dozen of us unpacked box after box from our sponsors, and set the area up. At 3PM, over one hundred volunteers showed up to help us operate the human assembly line, and in less than two and a half hours, we'd filled the bags.

The second event I wanted to highlight was an open space session, on Composition. For over two hours, a few dozen people discussed the problems with inheritance, the need for explicit interface definition, what the most idiomatic ways to use decorators are, and other big picture software engineering topics. We talked about design mistakes we'd all made in our past, and discussed refactoring strategies to improve code.

These events are what make PyCon special for me: community, and technical excellence, in one place.

PS: You should totally watch my two talks. One is about pickle and the other is about performance.

You can find the rest here. There are view comments.

Why Crypto

Posted February 12th, 2014. Tagged with python, open-source.

People who follow me on twitter or github have probably noticed over the past six months or so: I've been talking about, and working on, cryptography a lot. Before this I had basically zero crypto experience. Not a lot of programmers know about cryptography, and many of us (myself included) are frankly a bit scared of it. So how did this happen?

At first it was simple: PyCrypto (probably the most used cryptographic library for Python) didn't work on PyPy, and I needed to perform some simple cryptographic operations on PyPy. Someone else had already started work on a cffi based cryptography library, so I started trying to help out. Unfortunately the maintainer had to stop working on it. At about the same time several other people (some with much more cryptography experience than I) expressed interest in the idea of a new cryptography library for Python, so we got started on it.

It's worth noting that at the same time this was happening, Edward Snowden's disclosures about the NSA's activities were also coming out. While this never directly motivated me to work on cryptography, I also don't think it's a coincidence.

Since then I've been in something of a frenzy, reading and learning everything I can about cryptography. And while originally my motivation was "a thing that works on PyPy", I've now grown considerably more bold:

Programmers are used to being able to pick up domain knowledge as we go. When I worked on a golf website, I learned about how people organized golf outings, when I worked at rdio I learned about music licensing, etc. Programmers will apply their trade to many different domains, so we're used to learning about these different domains with a combination of Google, asking folks for help, and looking at the result of our code and seeing if it looks right.

Unfortunately, this methodology leads us astray: Google for many cryptographic problems leaves you with a pile of wrong answers, very few of us have friends who are cryptography experts to ask for help, and one can't just look at the result of a cryptographic operation and see if it's secure. Security is a property much more subtle than we usually have to deal with:

>>> encrypt(b"a secret message")
b'n frperg zrffntr'

Is the encrypt operation secure? Who knows!

Correctness in this case is dictated by analyzing the algorithms at play, not by looking at the result. And most of us aren't trained by this. In fact we've been actively encouraged not to know how. Programmers are regularly told "don't do your own crypto" and "if you want to do any crypto, talk to a real cryptographer". This culture of ignorance about cryptography hasn't resulted in us all contacting cryptographers, it's resulted in us doing bad crypto:

Usually when we design APIs, our goal is to make it easy to do something. Cryptographic APIs seem to have been designed on the same principle. Unfortunately that something is almost never secure. In fact, with many libraries, the path of least resistance leads you to doing something that is extremely wrong.

So we set out to design a better library, with the following principles:

  • It should never be easier to do the wrong thing than it is to do the right thing.
  • You shouldn't need to be a cryptography expert to use it, our documentation should equip you to make the right decisions.
  • Things which are dangerous should be obviously dangerous, not subtly dangerous.
  • Put our users' safety and security above all else.

I'm very proud of our work so far. You can find our documentation online. We're not done. We have many more types of cryptographic operations left to expose, and more recipes left to write. But the work we've done so far has stayed true to our principles. Please let us know if our documentation ever fails to make something accessible to you.

You can find the rest here. There are view comments.

Why Travis CI is great for the Python community

Posted January 6th, 2014. Tagged with python, open-source.

In the unlikely event you're both reading my blog, and have not heard of Travis CI, it's a CI service which specifically targets open source projects. It integrates nicely with Github, and is generally a pleasure to work with.

I think it's particularly valuable for the Python community, because it makes it easy to test against a variety of Pythons, which maybe you don't have at your fingertips on your own machine, such as Python 3 or PyPy (Editor's note: Why aren't you using PyPy for all the things?).

Travis makes this drop dead simple, in your .travis.yml simply write:

language: python
python:
    - "2.6"
    - "2.7"
    - "3.2"
    - "3.3"
    - "pypy"

And you'll be whisked away into a land of magical cross-Python testing. Or, if like me you're a fan of tox, you can easily run with that:

python: 2.7
env:
    - TOX_ENV=py26
    - TOX_ENV=py27
    - TOX_ENV=py32
    - TOX_ENV=py33
    - TOX_ENV=pypy
    - TOX_ENV=docs
    - TOX_ENV=pep8

script:
    - tox -e $TOX_ENV

This approach makes it easy to include things like linting or checking your docs as well.

Travis is also pretty great because it offers you a workflow. I'm a big fan of code review, and the combination of Travis and Github's pull requests are awesome. For basically every project I work on now, I work in this fashion:

  • Create a branch, write some code, push.
  • Send a pull request.
  • Iteratively code review
  • Check for travis results
  • Merge

And it's fantastic.

Lastly, and perhaps most importantly, Travis CI consistently gets better, without me doing anything.

You can find the rest here. There are view comments.

PyPI Download Statistics

Posted January 3rd, 2014. Tagged with python.

For the past few weeks, I've been spending a bunch of time on a side project, which is to get better insight into who uses packages from PyPI. I don't mean what people, I mean what systems: how many users are on Windows, how many still use Python 2.5, do people install with pip or easy_install, questions like these; which come up all the time for open source projects.

Unfortunately until now there's been basically no way to get this data. So I sat down to solve this, and to do that I went straight to the source. PyPI! Downloads of packages are probably our best source of information about users of packages. So I set up a simple system: process log lines from the web server, parse any information I could out of the logs (user agents have tons of great stuff), and then insert it into a simple PostgreSQL database.

We don't yet have the system in production, but I've started playing with sample datasets, here's my current one:

pypi=> select count(*), min(download_time), max(download_time) from downloads;
  count  |         min         |         max
---------+---------------------+---------------------
 1981765 | 2014-01-02 14:46:42 | 2014-01-03 17:40:04
(1 row)

All of the downloads over the course of about 27 hours. There's a few caveats to the data: it only covers PyPI, packages installed with things like apt-get on Ubuntu/Debian aren't counted. Things like CI servers which frequently install the same package can "inflate" the download count, this isn't a way of directly measuring users. As with all data, knowing how to interpret it and ask good questions is at least as important as having the data.

Eventually I'm looking forwards to making this dataset available to the community; both as a way to ask one off queries ("What version of Python do people install my package with?") and as a whole dataset for running large analysis on ("How long does it take after a release before a new version of Django has widespread uptake?").

Here's a sample query:

pypi=> SELECT
pypi->     substring(python_version from 0 for 4),
pypi->     to_char(100 * COUNT(*)::numeric / (SELECT COUNT(*) FROM downloads), 'FM999.990') || '%' as percent_of_total_downloads
pypi-> FROM downloads
pypi-> GROUP BY
pypi->     substring(python_VERSION from 0 for 4)
pypi-> ORDER BY
pypi->     count(*) DESC;
 substring | percent_of_total_downloads
-----------+----------------------------
 2.7       | 75.533%
 2.6       | 15.960%
           | 5.840%
 3.3       | 2.079%
 3.2       | .350%
 2.5       | .115%
 1.1       | .054%
 2.4       | .052%
 3.4       | .016%
 3.1       | .001%
 2.1       | .000%
 2.0       | .000%
(12 rows)

Here's the schema to give you a sense of what data we have:

                                   Table "public.downloads"
          Column          |            Type             |              Modifiers
--------------------------+-----------------------------+-------------------------------------
 id                       | uuid                        | not null default uuid_generate_v4()
 package_name             | text                        | not null
 package_version          | text                        |
 distribution_type        | distribution_type           |
 python_type              | python_type                 |
 python_release           | text                        |
 python_version           | text                        |
 installer_type           | installer_type              |
 installer_version        | text                        |
 operating_system         | text                        |
 operating_system_version | text                        |
 download_time            | timestamp without time zone | not null
 raw_user_agent           | text                        |

Let your imagination run wild with the questions you can answer now that we have data!

You can find the rest here. There are view comments.

About Python 3

Posted December 30th, 2013. Tagged with python.

Python community, friends, fellow developers, we need to talk. On December 3rd, 2008 Python 3.0 was first released. At the time it was widely said that Python 3 adoption was going to be a long process, it was referred to as a five year process. We've just passed the five year mark.

At the time of Python 3's release, and for years afterwards I was very excited about it, evangelizing it, porting my projects to it, for the past year or two every new projects I've started has had Python 3 support from the get go.

Over the past six months or so, I've been reconsidering this position, and excitement has given way to despair.

For the first few years of the Python 3 migration, the common wisdom was that a few open source projects would need to migrate, and then the flood gates would open. In the Django world, that meant we needed a WSGI specification, we needed database drivers to migrate, and then we could migrate, and then our users could migrate.

By now, all that has happened, Django (and much of the app ecosystem) now supports Python 3, NumPy and the scientific ecosystem supports Python 3, several new releases of Python itself have been released, and users still aren't using it.

Looking at download statistics for the Python Package Index, we can see that Python 3 represents under 2% of package downloads. Worse still, almost no code is written for Python 3. As I said all of my new code supports Python 3, but I run it locally with Python 2, I test it locally with Python 2; Travis CI runs it under Python 3 for me; certainly none of my code is Python 3 only. At companies with large Python code bases I talk to no one is writing Python 3 code, and basically none of them are thinking about migrating their codebases to Python 3.

Since the time of the Python 3.1 it's been regularly said that the new features and additions the standard library would act as carrots to motivate people to upgrade. Don't get me wrong, Python 3.3 has some really cool stuff in it. But 99% of everybody can't actually use it, so when we tell them "that's better in Python 3", we're really telling them "Fuck You", because nothing is getting fixed for them.

Beyond all of this, it has a nasty pernicious effect on the development of Python itself: it means there's no feedback cycle. The fact that Python 3 is being used exclusively by very early adopters means that what little feedback happens on new features comes from users who may not be totally representative of the broader community. And as we get farther and farther in the 3.X series it gets worse and worse. Now we're building features on top of other features and at no level have they been subjected to actual wide usage.

Why aren't people using Python 3?

First, I think it's because of a lack of urgency. Many years ago, before I knew how to program, the decision to have Python 3 releases live in parallel to Python 2 releases was made. In retrospect this was a mistake, it resulted in a complete lack of urgency for the community to move, and the lack of urgency has given way to lethargy.

Second, I think there's been little uptake because Python 3 is fundamentally unexciting. It doesn't have the super big ticket items people want, such as removal of the GIL or better performance (for which many are using PyPy). Instead it has many new libraries (whose need is largely filled by pip install), and small cleanups which many experienced Python developers just avoid by habit at this point. Certainly nothing that would make one stop their development for any length of time to upgrade, not when Python 2 seems like it's going to be here for a while.

So where does this leave us?

Not a happy place. First and foremost I think a lot of us need to be more realistic about the state of Python 3. Particularly the fact that for the last few years, for the average developer Python, the language, has not gotten better.

The divergent paths of Python 2 and Python 3 have been bad for our community. We need to bring them back together.

Here's an idea: let's release a Python 2.8 which backports every new feature from Python 3. It will also deprecate anything which can't be changed in a backwards compatible fashion, for example str + unicode will emit a warning, as will any file which doesn't have from __future__ import unicode_literals. Users need to be able to follow a continuous process for their upgrades, Python 3 broke it, let's fix it.

That's my only idea. We need more ideas. We need to bridge this gap, because with every Python 3 release, it grows wider.

Thanks to Maciej Fijalkowski and several others for their reviews, it goes without saying that all remaining errors are my own.

You can find the rest here. There are view comments.

Security process for Open Source Projects

Posted October 19th, 2013. Tagged with django, python, open-source, community.

This post is intended to describe how open source projects should handle security vulnerabilities. This process is largely inspired by my involvement in the Django project, whose process is in turn largely drawn from the PostgreSQL project's process. For every recommendation I make I'll try to explain why I've made it, and how it serves to protect you and your users. This is largely tailored at large, high impact, projects, but you should able to apply it to any of your projects.

Why do you care?

Security vulnerabilities put your users, and often, in turn, their users at risk. As an author and distributor of software, you have a responsibility to your users to handle security releases in a way most likely to help them avoid being exploited.

Finding out you have a vulnerability

The first thing you need to do is make sure people can report security issues to you in a responsible way. This starts with having a page in your documentation (or on your website) which clearly describes an email address people can report security issues to. It should also include a PGP key fingerprint which reporters can use to encrypt their reports (this ensures that if the email goes to the wrong recipient, that they will be unable to read it).

You also need to describe what happens when someone emails that address. It should look something like this:

  1. You will respond promptly to any reports to that address, this means within 48 hours. This response should confirm that you received the issue, and ideally whether you've been able to verify the issue or more information is needed.
  2. Assuming you're able to reproduce the issue, now you need to figure out the fix. This is the part with a computer and programming.
  3. You should keep in regular contact with the reporter to update them on the status of the issue if it's taking time to resolve for any reason.
  4. Now you need to inform the reporter of your fix and the timeline (more on this later).

Timeline of events

From the moment you get the initial report, you're on the clock. Your goal is to have a new release issued within 2-weeks of getting the report email. Absolutely nothing that occurs until the final step is public. Here are the things that need to happen:

  1. Develop the fix and let the reporter know.
  2. You need to obtain a CVE (Common Vulnerabilities and Exposures) number. This is a standardized number which identifies vulnerabilities in packages. There's a section below on how this works.
  3. If you have downstream packagers (such as Linux distributions) you need to reach out to their security contact and let them know about the issue, all the major distros have contact processes for this. (Usually you want to give them a week of lead time).
  4. If you have large, high visibility, users you probably want a process for pre-notifying them. I'm not going to go into this, but you can read about how Django handles this in our documentation.
  5. You issue a release, and publicize the heck out of it.

Obtaining a CVE

In short, follow these instructions from Red Hat.

What goes in the release announcement

Your release announcement needs to have several things:

  1. A precise and complete description of the issue.
  2. The CVE number
  3. Actual releases using whatever channel is appropriate for your project (e.g. PyPI, RubyGems, CPAN, etc.)
  4. Raw patches against all support releases (these are in addition to the release, some of your users will have modified the software, and they need to be able to apply the patches easily too).
  5. Credit to the reporter who discovered the issue.

Why complete disclosure?

I've recommended that you completely disclose what the issue was. Why is that? A lot of people's first instinct is to want to keep that information secret, to give your users time to upgrade before the bad guys figure it out and start exploiting it.

Unfortunately it doesn't work like that in the real world. In practice, not disclosing gives more power to attackers and hurts your users. Dedicated attackers will look at your release and the diff and figure out what the exploit is, but your average users won't be able to. Even embedding the fix into a larger release with many other things doesn't mask this information.

In the case of yesterday's Node.JS release, which did not practice complete disclosure, and did put the fix in a larger patch, this did not prevent interested individuals from finding out the attack, it took me about five minutes to do so, and any serious individual could have done it much faster.

The first step for users in responding to a security release in something they use is to assess exposure and impact. Exposure means "Am I affected and how?", impact means "What is the result of being affected?". Denying users a complete description of the issue strips them of the ability to answer these questions.

What happens if there's a zero-day?

A zero-day is when an exploit is publicly available before a project has any chance to reply to it. Sometimes this happens maliciously (e.g. a black-hat starts using the exploit against your users) and sometimes it is accidentally (e.g. a user reports a security issue to your mailing list, instead of the security contact). Either way, when this happens, everything goes to hell in a handbasket.

When a zero-day happens basically everything happens in 16x fast-forward. You need to immediately begin preparing a patch and issuing a release. You should be aiming to issue a release on the same day as the issue is made public.

Unfortunately there's no secret to managing zero-days. They're quite simply a race between people who might exploit the issue, and you to issue a release and inform your users.

Conclusion

Your responsibility as a package author or maintainer is to protect your users. The name of the game is keeping your users informed and able to judge their own security, and making sure they have that information before the bad guys do.

You can find the rest here. There are view comments.

Effective Code Review

Posted September 26th, 2013. Tagged with openstack, python, community, django, open-source.

Maybe you practice code review, either as a part of your open source project or as a part of your team at work, maybe you don't yet. But if you're working on a software project with more than one person it is, in my view, a necessary piece of a healthy workflow. The purpose of this piece is to try to convince you its valuable, and show you how to do it effectively.

This is based on my experience doing code review both as a part of my job at several different companies, as well as in various open source projects.

What

It seems only seems fair that before I try to convince you to make code review an integral part of your workflow, I precisely define what it is.

Code review is the process of having another human being read over a diff. It's exactly like what you might do to review someone's blog post or essay, except it's applied to code. It's important to note that code review is about code. Code review doesn't mean an architecture review, a system design review, or anything like that.

Why

Why should you do code review? It's got a few benefits:

  • It raises the bus factor. By forcing someone else to have the familiarity to review a piece of code you guarantee that at least two people understand it.
  • It ensures readability. By getting someone else to provide feedback based on reading, rather than writing, the code you verify that the code is readable, and give an opportunity for someone with fresh eyes to suggest improvements.
  • It catches bugs. By getting more eyes on a piece of code, you increase the chances that someone will notice a bug before it manifests itself in production. This is in keeping with Eric Raymond's maxim that, "given enough eyeballs, all bugs are shallow".
  • It encourages a healthy engineering culture. Feedback is important for engineers to grow in their jobs. By having a culture of "everyone's code gets reviewed" you promote a culture of positive, constructive feedback. In teams without review processes, or where reviews are infrequent, code review tends to be a tool for criticism, rather than learning and growth.

How

So now that I've, hopefully, convinced you to make code review a part of your workflow how do you put it into practice?

First, a few ground rules:

  • Don't use humans to check for things a machine can. This means that code review isn't a process of running your tests, or looking for style guide violations. Get a CI server to check for those, and have it run automatically. This is for two reasons: first, if a human has to do it, they'll do it wrong (this is true of everything), second, people respond to certain types of reviews better when they come from a machine. If I leave the review "this line is longer than our style guide suggests" I'm nitpicking and being a pain in the ass, if a computer leaves that review, it's just doing it's job.
  • Everybody gets code reviewed. Code review isn't something senior engineers do to junior engineers, it's something everyone participates in. Code review can be a great equalizer, senior engineers shouldn't have special privledges, and their code certainly isn't above the review of others.
  • Do pre-commit code review. Some teams do post-commit code review, where a change is reviewed after it's already pushed to master. This is a bad idea. Reviewing a commit after it's already been landed promotes a feeling of inevitability or fait accompli, reviewers tend to focus less on small details (even when they're important!) because they don't want to be seen as causing problems after a change is landed.
  • All patches get code reviewed. Code review applies to all changes for the same reasons as you run your tests for all changes. People are really bad at guessing the implications of "small patches" (there's a near 100% rate of me breaking the build on change that are "so small, I don't need to run the tests"). It also encourages you to have a system that makes code review easy, you're going to be using it a lot! Finally, having a strict "everything gets code reviewed" policy helps you avoid arguments about just how small is a small patch.

So how do you start? First, get yourself a system. Phabricator, Github's pull requests, and Gerrit are the three systems I've used, any of them will work fine. The major benefit of having a tool (over just mailing patches around) is that it'll keep track of the history of reviews, and will let you easily do commenting on a line-by-line basis.

You can either have patch authors land their changes once they're approved, or you can have the reviewer merge a change once it's approved. Either system works fine.

As a patch author

Patch authors only have a few responsibilities (besides writing the patch itself!).

First, they need to express what the patch does, and why, clearly.

Second, they need to keep their changes small. Studies have shown that beyond 200-400 lines of diff, patch review efficacy trails off [1]. You want to keep your patches small so they can be effectively reviewed.

It's also important to remember that code review is a collaborative feedback process if you disagree with a review note you should start a conversation about it, don't just ignore it, or implement it even though you disagree.

As a review

As a patch reviewer, you're going to be looking for a few things, I recommend reviewing for these attributes in this order:

  • Intent - What change is the patch author trying to make, is the bug they're fixing really a bug? Is the feature they're adding one we want?
  • Architecture - Are they making the change in the right place? Did they change the HTML when really the CSS was busted?
  • Implementation - Does the patch do what it says? Is it possibly introducing new bugs? Does it have documentation and tests? This is the nitty-gritty of code review.
  • Grammar - The little things. Does this variable need a better name? Should that be a keyword argument?

You're going to want to start at intent and work your way down. The reason for this is that if you start giving feedback on variable names, and other small details (which are the easiest to notice), you're going to be less likely to notice that the entire patch is in the wrong place! Or that you didn't want the patch in the first place!

Doing reviews on concepts and architecture is harder than reviewing individual lines of code, that's why it's important to force yourself to start there.

There are three different types of review elements:

  • TODOs: These are things which must be addressed before the patch can be landed; for example a bug in the code, or a regression.
  • Questions: These are things which must be addressed, but don't necessarily require any changes; for example, "Doesn't this class already exist in the stdlib?"
  • Suggestions for follow up: Sometimes you'll want to suggest a change, but it's big, or not strictly related to the current patch, and can be done separately. You should still mention these as a part of a review in case the author wants to adjust anything as a result.

It's important to note which type of feedback each comment you leave is (if it's not already obvious).

Conclusion

Code review is an important part of a healthy engineering culture and workflow. Hopefully, this post has given you an idea of either how to implement it for your team, or how to improve your existing workflow.

[1]http://www.ibm.com/developerworks/rational/library/11-proven-practices-for-peer-review/

You can find the rest here. There are view comments.

Doing a release is too hard

Posted September 17th, 2013. Tagged with openstack, django, python, open-source.

I just shipped a new release of alchimia. Here are the steps I went through:

  • Manually edit version numbers in setup.py and docs/conf.py. In theory I could probably centralize this, but then I'd still have a place I need to update manually.
  • Issue a git tag (actually I forgot to do that on this project, oops).
  • python setup.py register sdist upload -s to build and upload some tarballs to PyPi
  • python setup.py register bdist_wheel upload -s to build and upload some wheels to PyPi
  • Bump the version again for the now pre-release status (I never remeber to do this)

Here's how it works for OpenStack projects:

  • git tag VERSION -s (-s) makes it be a GPG signed tag)
  • git push gerrit VERSION this sends the tag to gerrit for review

Once the tag is approved in the code review system, a release will automatically be issue including:

  • Uploading to PyPi
  • Uploading documentation
  • Landing the tag in the official repository

Version numbers are always automatically handled correctly.

This is how it should be. We need to bring this level of automation to all projects.

You can find the rest here. There are view comments.

You guys know who Philo Farnsworth was?

Posted September 15th, 2013. Tagged with django, python, open-source, community.

Friends of mine will know I'm a very big fan of the TV show Sports Night (really any of Aaron Sorkin's writing, but Sports Night in particular). Before you read anything I have to say, take a couple of minutes and watch this clip:

I doubt Sorkin knew it when he scripted this (I doubt he knows it now either), but this piece is about how Open Source happens (to be honest, I doubt he knows what Open Source Software is).

This short clip actually makes two profound observations about open source.

First, most contribution are not big things. They're not adding huge new features, they're not rearchitecting the whole system to address some limitation, they're not even fixing a super annoying bug that affects every single user. Nope, most of them are adding a missing sentence to the docs, fixing a bug in a wacky edge case, or adding a tiny hook so the software is a bit more flexible. And this is fantastic.

The common wisdom says that the thing open source is really bad at is polish. My experience has been the opposite, no one is better at finding increasingly edge case bugs than open source users. And no one is better at fixing edge case bugs than open source contributors (who overlap very nicely with open source users).

The second lesson in that clip is about how to be an effective contributor. Specifically that one of the keys to getting involved effectively is for other people to recognize that you know how to do things (this is an empirical observation, not a claim of how things ought to be). How can you do that?

  • Write good bug reports. Don't just say "it doesn't work", if you've been a programmer for any length of time, you know this isn't a useful bug report. What doesn't work? Show us the traceback, or otherwise unexpected behavior, include a test case or instructions for reproduction.
  • Don't skimp on the details. When you're writing a patch, make sure you include docs, tests, and follow the style guide, don't just throw up the laziest work possible. Attention to detail (or lack thereof) communicates very clearly to someone reviewing your work.
  • Start a dialogue. Before you send that 2,000 line patch with that big new feature, check in on the mailing list. Make sure you're working in a way that's compatible with where the project is headed, give people a chance to give you some feedback on the new APIs you're introducing.

This all works in reverse too, projects need to treat contributors with respect, and show them that the project is worth their time:

  • Follow community standards. In Python this means things like PEP8, having a working setup.py, and using Sphinx for documentation.
  • Have passing tests. Nothing throws me for a loop worse than when I checkout a project to contribute and the tests don't pass.
  • Automate things. Things like running your tests, linters, even state changes in the ticket tracker should all be automated. The alternative is making human beings manually do a bunch of "machine work", which will often be forgotten, leading to a sub-par experience for everyone.

Remember, Soylent Green Open Source is people

That's it, the blog post's over.

You can find the rest here. There are view comments.

Your project doesn't mean your playground

Posted September 8th, 2013. Tagged with community, django, python.

Having your own open source project is awesome. You get to build a thing you like, obviously. But you also get to have your own little playground, a chance to use your favorite tools: your favorite VCS, your favorite test framework, your favorite issue tracker, and so on.

And if the point of your project is to share a thing you're having fun with with the world, that's great, and that's probably all there is to the story (you may stop reading here). But if you're interested in growing a legion of contributors to build your small side project into an amazing thing, you need to forget about all of that and remember these words: Your contributors are more important than you.

Your preferences aren't that important: This means you probably shouldn't use bzr if everyone else is using git. You shouldn't use your own home grown documentation system when everyone is using Sphinx. Your playground is a tiny thing in the giant playground that is the Python (or whatever) community. And every unfamiliar thing a person needs to familiarize themselves with to contribute to your project is another barrier to entry, and another N% of potential contributors who won't actually materialize.

I'm extremely critical of the growing culture of "Github is open source", I think it's ignorant, shortsighted, and runs counter to innovation. But if your primary interest is "having more contributors", you'd be foolish to ignore the benefits of having your project on Github. It's where people are. It has tools that are better than almost anything else you'll potentially use. And most importantly it implies a workflow and toolset with which a huge number of people are familiar.

A successful open source project outgrows the preferences of its creators. It's important to prepare for that by remembering that (if you want contributors) your workflow preferences must always be subservient to those of your community.

You can find the rest here. There are view comments.

Why I support diversity

Posted August 28th, 2013. Tagged with python, diversity, community, programming, django.

I get asked from time to time why I care about diversity in the communities I'm a part of, particularly the Django, Python, and the broader software development and open source community.

There's a lot of good answers. The simplest one, and the one I imagine just about everyone can get behind: diverse groups perform better at creative tasks. A group composed of people from different backgrounds will do better work than a homogeneous group.

But that's not the main reason I care. I care because anyone who knows how to read some statistics knows that it's ridiculous that I'm where I am today. I have a very comfortable job and life, many great friends, and the opportunity to travel and to spend my time on the things I care about. And that's obscenely anomalous for a high school dropout like me.

All of that opportunity is because when I showed up to some open source communities no one cared that I was a high school dropout, they just cared about the fact that I seemed to be interested, wanted to help, and wanted to learn. I particularly benefited from the stereotype of white dropouts, which is considerably more charitable than (for example) the stereotype of African American dropouts.

Unfortunately, our communities aren't universally welcoming, aren't universally nice, and aren't universally thoughtful and caring. Not everyone has the same first experience I did. In particular people who don't look like me, aren't white males, disproportionately don't have this positive experience. But everyone ought to. (This is to say nothing of the fact that I had more access to computers at a younger age then most people.)

That's why I care. Because I benefited from so much, and many aren't able to.

This is why I support the Ada Initiative. I've had the opportunity to see their work up close twice. Once, as a participant in Ada Camp San Francisco's Allies Track. And a second time in getting their advice in writing the Code of Conduct for the Django community. They're doing fantastic work to support more diversity, and more welcoming communities.

Right now they're raising funds to support their operations for the next year, if you accord to, I hope you'll donate: http://supportada.org

You can find the rest here. There are view comments.

Your tests are not a benchmark

Posted July 15th, 2013. Tagged with python.

I get a lot of feedback about people's experiences with PyPy. And a lot of it is really great stuff for example, "We used to leave the simulation running over night, now we take a coffee break". We also get some less successful feedback, however quite a bit of that goes something like, "I ran our test suite under PyPy, not only was it not faster, it was slower!". Unfortunately, for the time being, this is really expected, we're working on improving it, but for now I'd like to explain why that is.

  • Test runs are short: Your test suite takes a few seconds, or a few minutes to run. Your program might run for hours, days, or even weeks. The JIT works by observing what code is run frequently and optimizing that, this takes a bit of time to get through the "observer phase", and during observation PyPy is really slow, once observation is done PyPy gets very very fast, but if your program exits too quickly, it'll never get there.
  • Test code isn't like real code: Your test suite is designed to try to execute each pieces of code in your application exactly once. Your real application repeats the same task over and over and over again. The JIT doesn't kick in until a piece of code has been run over 1000 times, so if you run it just a small handful of times, it won't be fast.
  • Test code really isn't like real code: Your test code probably does things like monkeypatch modules to mock things out. Monkeypatching a module will trigger a bit of deoptimization in PyPy. Your real code won't do this and so it will be fully optimized, but your test suite does hit the deoptimization and so it's slow.
  • Test code spends time where app code doesn't: Things like the setup/teardown functions for your tests tend to be things that are never run in your production app, but sometimes they're huge bottlenecks for your tests.
  • Test suites often have high variability in their runtimes: I don't have an explanation for why this is, but it's something observed over a large number of test suites. Bad statistics make for really bad benchmarks, which makes for bad decision making.

If you want to find out how fast PyPy (or any technology) is, sit down and write some benchmarks, I've got some advice on how to do that.

You can find the rest here. There are view comments.

Thoughts on OpenStack

Posted July 11th, 2013. Tagged with openstack, rackspace, python.

Since I joined Rackspace a little over a month ago, I've gotten involved with OpenStack, learning the APIs, getting involved in discussions, and contributing code. I wanted to write a bit about what some of my experiences have been, particularly with respect to the code and contribution process.

Architecture

Were I to have started to design a system similar to OpenStack, and particularly components like Swift (Object store), the first thing I would have done would be build a (or select an existing) general purpose distributed database. The OpenStack developers did not go in this direction, instead they built tools specific to each of their needs. It's not yet clear to me whether this is a better or worse direction, but it was one of the most striking things to me.

Contribution

Open Stack's contribution process is fantastic. Most open source projects have a group of individuals who are committers (often called "core developers" or similar) and the rest of the community contributes by sending patches, which these committers merge. Eventually members of the community become committers; this model is seen in projects like Django or CPython.

OpenStack flips this on its head. In OpenStack there are no committers. The only thing that commits is a Jenkins instance. Instead, they have "core reviewers". Essentially, to contribute to OpenStack, whether you're a long standing member or brand new, you upload your patch with a git-review script to their Gerrit instance. People who follow that project in Gerrit are notified, and Jenkins CI jobs are kicked off. People will review your patch, and once it has both passing tests and the necessary number of "+1" reviews from core reviewers, your patch is automatically merged.

This process of having no committers, only core reviewers, normalizes the contribution process. Uploading a patch is the same experience for me, a relative new comer, as it is for someone who's been working on the project from the beginning. We just have slightly different review experiences. It also puts an emphasis on code review, which I think is fantastic.

Code

OpenStack comprises a large amount of Python code, and I'm a very opinionated Python developer. For the most part the OpenStack code is of good quality, however there are a few issues I've run across:

  • Most projects monkeypatch __builtin__. Almost every OpenStack project monkeypatches this to add gettext as _. This makes code remarkably difficult to read (you can never tell where it came from), and fragile. If you import files in the wrong order, suddenly you get a NameError. I've been trying to work to remove these, and there seems to be some buy-in from the community on this.
  • Most projects use a lot of global state around configuration. I've spoke about why I dislike this approach before. As with Django, I don't have a good suggestion as to how to fix it incrementally.

Infrastructure vs. Application Services

A lot of OpenStack's code is around what I like to think of as "infrastructure". Things like spinning up VMs, storing disk images, taking snapshots, and handling authentication for all of this. When OpenStack started there was one application service, Swift, which does massively scalable object ("blob") storage. One of the most exciting developments in OpenStack, in my opinion, is the growth to include more application services, things that directly provide utilities to your application. These include:

  • Marconi: Queuing as a service, I'm working on a kombu backend so that if you're deployed in an OpenStack cloud with Marconi, having a job queue system will be a matter of a few seconds work with celery.
  • Trove: Relational databases as a service, spin up database instances, back them up, restore, and monitor.
  • Designate: DNS as a service.
  • Libra: Load balancer as a service.
  • Barbican: Secrets as a service, this will be able to manage things like your SECRET_KEY in Django, to avoid forcing you to put it on disk or in your project's source.

It's worth noting that many of these are still "StackForge" projects, which means it's not guarnteed that they'll become a part of OpenStack, nevertheless I think these are exciting developments.

Of particular value is that, because they're (of course) open source, you're spared some of the lock-in concerns that come from many "as a service" offerings.

Future involvement

Looking towards the future, I'm hoping to be involved with OpenStack primarily in three ways:

  • Making it run, idiotically fast, on PyPy. Right now this means I'm working on making sqlalchemy, which many Open Stack projects use, fast on PyPy.
  • Working on opentls, which is a pure python binding to OpenSSL. This also furthers the first goal of getting OpenStack running on PyPy, as well as hopefully contributing to the overall system security.
  • Getting a kombu backend for Marconi, so Open Stack users can have basically drop-in queuing with celery.

You can find the rest here. There are view comments.

Disambiguating BSON and msgpack

Posted February 16th, 2013. Tagged with python, programming.

I had a fairly fun project at work recently, so I thought I'd write about it. We currently store BSON blobs of data in many places. This is unfortunate, because BSON is bloated and slow (an array is internally stored as a dictionary mapping the strings "0", "1", "2", etc. to values). So we wanted to migrate to msgpack, which I've measured as requiring 46% of the space of BSON, and being significantly faster to deserialize (we aren't concerned with serialization speed, though I'm relatively confident that's faster as well).

The one trick we wanted to pull was to do the migration in place, that is gradually rewrite all the columns' data from BSON to msgpack. This is only possible if the data can be interpreted as one or the other unambiguously. So I was tasked if finding out if this was possible.

The first thing that's important to know about BSON is that the first 4-bytes are the length of the entire document (in bytes) as a signed integer, little endian. msgpack has no specific prefix, the first bytes are merely the typecode for whatever the element is. At Rdio, we know something about our data though, because BSON requires all top-level elements to be dictionaries, and we're just re-serializing the same data, we know that all of these msgpacks will have dictionaries as the top level object.

Because a BSON blob starts with its size, in bytes, we're going to try to find the smallest possible 4-byte starting sequence (interpreted as an integer) one of our payloads could have, in order to determine what the smallest possible ambiguity is.

So the first case is the empty dictionary, in msgpack this is serialized as:

>>> msgpack.packb({})
'\x80'

That's less than 4 bytes, and all BSONs are at least 4 bytes, so that can't be ambiguous. Now let's look at a dictionary with some content. Another thing we know about our payloads is that all the keys in the dictionaries are strings, and that the keys are alphanumeric or underscores. Looking at the msgpack spec, the smallest key (interpreted as its serialized integer value) that could exist is "0", since "0" has the lowest ASCII value of any letter, number, or underscore. Further, from the msgpack spec we know that the number 0 serializes as a single byte, so that will be the key's value. Let's see where this gets us:

>>> msgpack.packb({"0": 0})
'\x81\xa10\x00'

A 4 byte result, perfect, this is the smallest prefix we can generate, let's see how many bytes this would be:

>>> struct.unpack('<l', '\x81\xa10\x00')
(3187073,)

3187073 bytes, or a little over 3 MB. To be honest I'm not sure we have a key that starts with a number, let's try with the key "a":

>>> msgpack.packb({"a": 0})
'\x81\xa1a\x00'
>>> struct.unpack('<l', '\x81\xa1a\x00')
(6398337,)

A little over 6 MB. Since I know that none of the payloads we store are anywhere close to this large, we can safely store either serialization format, and be able to interpret the result unambiguously as one or the other.

So our final detection code looks like:

def deserialize(s):
    if len(s) >= 4 and struct.unpack('<l', s[:4])[0] == len(s):
        return BSON(s).decode()
    else:
        return msgpack.unpackb(s)

If this sounds like a fun kind of the thing to do, you should apply to come work with me at Rdio.

You can find the rest here. There are view comments.

The compiler rarely knows best

Posted July 12th, 2012. Tagged with pypy, python, response.

This is a response to http://pwang.wordpress.com/2012/07/11/does-the-compiler-know-best/ if you haven't read it yet, start there.

For lack of any other way to say it, I disagree with nearly every premise presented and conclusion derived in Peter's blog post. The post itself doesn't appear to have any coherent theme, besides that PyPy is not the future of Python, so I'll attempt to reply to Peter's statements more or less in order.

First, and perhaps most erroneously, he claims that "PyPy is an even more drastic change to the Python language than Python3". This is wrong. Complete and utterly. PyPy is in fact not a change to Python at all, PyPy faithfully implements the Python language as described by the Python language reference, and as verified by the test suite. Moreover, this is a statement that would apply equally to Jython and IronPython. It is pure, unadulterated FUD. Peter is trying to extend the definition of the Python language to things that it simple doesn't cover, such as the C-API and what he thinks the interpreter should look like (to be discussed more).

Second, he writes, "What is the core precept of PyPy? It’s that “the compiler knows best”." This too, is wrong. First, PyPy's central thesis is, "any task repeatedly performed manually will be done incorrectly", this is why we have things like automatic insertion of the garbage collector, in preference to CPython's "reference counting everywhere", and automatically generating the just in time compiler from the interpreter, in preference to Unladen Swallow's (and almost every other language's) manual construction of it. Second, the PyPy developers would never argue that the compiler knows best, as I alluded to in this post's title. That doesn't mean you should quit trying to write intelligent compilers, 1) the compiler often knows better than the user, just like with C, while it's possible to write better x86 assembler than GCC for specific functions, over the course of a large project GCC will always win, 2) they aren't mutually exclusive, having an intelligent compiler does not prohibit giving the user more control, in fact it's a necessity! There are no pure-python hints that you can give to CPython to improve performance, but these can easily be added with PyPy's JIT.

He follows this by saying that in contrast to PyPy's (nonexistent) principle of "compiler knows best" CPython's strength is that it can communicate with other platforms because its inner workings are simple. These three things have nothing to do with each other. CPython's interoperability with other platforms is a function of it's C-API. You can build an API like this on top of something monstrously complicated too, look at JNI for the JVM. (I don't accept that PyPy is so complex, but that's another post for another time.) In any event, the PyPy developers are deeply committed to interoperability with other platforms, which is why Armin and Maciej have been working on cffi: http://cffi.readthedocs.org/en/latest/index.html

The next paragraph is one of the most bizarre things I've ever read. He suggests that if you do want the free performance gains PyPy promises you should just build a a Python to JS compiler and use Node.js. I have to assume this paragraph is a joke not meant for publication, because it's nonsense. First, I've been told by the scientific Python community (of which Peter is a member) that any solution that isn't backwards compatible with a mountain of older platforms will never be adopted. So naturally his proposed solution is to throw away all existing work. Next, he implies that Google, Mozilla, Apple, and Microsoft are all collaborating on a single Javascript runtime which is untrue, in fact they each have their own VM. And V8, the one runtime specifically alluded to via Node.js, is not, as he writes, designed to be concurrent; Evan Phoenix, lead developer of Rubinius, comments, "It's probably the least concurrent runtime I've seen."

He then moves on to discussing the transparency of the levels involved in a runtime. Here I think he's 100% correct. Being able to understand how a VM is operating, what it's doing, what it's optimizing, how it's executing is enormously important. That's why I'm confused that he's positioning this as an argument against PyPy, as we've made transparency of our system incredibly important. We have the jitviewer, a tool which exposes the exact internal operations and machine code generated for everything PyPy compiles, which can be correlated to a individual line of Python code. We also have a set of hooks into the JIT to be able to programatically inspect what's happening, including writing your own, pure Python, optimization passes: http://pypy.readthedocs.org/en/latest/jit-hooks.html!

That's all I have. Hope you enjoyed.

You can find the rest here. There are view comments.

Why del defaultdict()[k] should raise an error

Posted November 28th, 2011. Tagged with python, programming.

Raymond Hettinger recently asked on twitter what people thought del defaultdict()[k] did for a k that didn't exist in the dict. There are two ways of thinking about this, one is, "it's a defaultdict, there's always a value at a key, so it can never raise a KeyError", the other is, "that only applies to reading a value, this should still raise an error". I initially spent several minutes considering which made more sense, but I eventually came around to the second view, I'm going to explain why.

The Zen of Python says, "Errors should never pass silently." Any Java programmer who's seen NullPointerException knows the result of passing around invalid data, rather than propagating an error. There are two cases for trying to delete a key which doesn't exist in a defaultdict. One is: "this algorithm happens to sometimes produce keys that aren't there, not an issue, ignore it", the other is "my algorithm has a bug, it should always produce valid keys". If you don't raise a KeyError the first case has a single line of nice code, if you do raise an error they have a boring try/ except KeyError thing going on, but no big loss. However, if an error isn't raised and your algorithm should never produce nonexistent keys, you'll be silently missing a large bug in your algorithm, which you'll have to hope to catch later.

The inconvenience of ignoring the KeyError to the programmer with the algorithm that produces nonexistent keys is out weighed by the potential for hiding a nasty bug in the algorithm of the programmer who's code should never produce these. Ignoring an exception is easy, trying to find the bug in your algorithm can be a pain in the ass.

You can find the rest here. There are view comments.

The run-time distinction

Posted October 11th, 2011. Tagged with programming, python, programming-languages.

At PyCodeConf I had a very interesting discussion with Nick Coghlan which helped me understand something that had long frustrated me with programming languages. Anyone who's ever taught a new programmer Java knows this, but perhaps hasn't understood it for what it is. This thing that I hadn't been appreciating was the distinction some programming languages make between the language that exists at compile time, and the language that exists at run-time.

Take a look at this piece of Java code:

class MyFirstProgram {
    public static void main(String[] args) {
        System.out.println("Hello World!");
    }
}

Most people don't appreciate it, but you're really writing in two programming languages here, one of these languages has things like class and function declarations, and the other has executable statements (and yes, I realize Java has anonymous classes, they don't meaningfully provide anything I'm about to discuss).

Compare that with the (approximately) equivalent Python code:

def main():
    print "Hello World"

if __name__ == "__main__":
    main()

There's a very important thing to note here, we have executable statements at the top level, something that's simply impossible in Java, C, or C++. They make a distinction between the top level and your function's bodies. It follows from this that the function we've defined doesn't have special status by virtue of being at the top level, we could define a function or write a class in any scope. And this is important, because it gives us the ability to express things like decorators (both class and function).

Another example of this distinction that James Tauber pointed out to me is the import statement. In Python is it a line of executable code which invokes machinery in the VM to find a module and load it into the current namespace. In Java it is an indication to the compiler that a certain symbol is in scope, it's never executed.

Why would anyone care about this distinction though? It's clearly possibly to write programs in languages on both ends of the spectrum. It appears to me that the expressiveness of a programming language is really a description of what the distance between the compile time language and the runtime language is. Python stands on one end, with no distinction, whereas C/C++/Java stand on the other, with a grand canyon separating them.

But what about a language in the middle? Much of PyPy's code is written in a language named RPython. It has a fairly interesting property though, its run-time language is pretty close to Java in semantics, it's statically typed (though type inferenced), it's compile time language is Python. In practice this means you get many of the benefits in expressiveness as you do from using Python itself. For example you can write a decorator, or generate a class. A good example of this power is in PyPy's NumPy implementation. We're able to create the code for doing all the operations on different dtypes (NumPy's name for the different datatypes its arrays can represent) dynamically, without needing to resort to code generation or repeating ourselves, we're able to rely on Python as our compile time (or meta-programming) language. The "in-practice" result of this is that writing RPython feels much more like writing Python than it does like writing Java, even though most of your code is written under the same constraints (albeit without the need to explicitly write types).

The distinction between compile-time and run-time in programming languages results in both more pain for programmers, as well as more arcane structures to explain to new users. I believe new languages going forward should make it a goal to either minimize this difference (as Python does) or outfit languages with more powerful compile-time languages which give them the ability to express meta-programming constructs.

You can find the rest here. There are view comments.

So you want to write a fast Python?

Posted July 10th, 2011. Tagged with pypy, programming, python.

Thinking about writing your own Python implementation? Congrats, there are plenty out there [1], but perhaps you have something new to bring to the table. Writing a fast Python is a pretty hard task, and there's a lot of stuff you need to keep in mind, but if you're interested in forging ahead, keep reading!

First, you'll need to write yourself an interpreter. A static compiler for Python doesn't have enough information to do the right things [2] [3], and a multi-stage JIT compiler is probably more trouble than it's worth [4]. It doesn't need to be super fast, but it should be within 2x of CPython or so, or you'll have lost too much ground to make up later. You'll probably need to write yourself a garbage collector as well, it should probably be a nice, generational collector [5].

Next you'll need implementations for all the builtins. Be careful here! You need to be every bit as good as CPython's algorithms if you want to stand a chance, this means things like list.sort() keeping up with Timsort [6], str.__contains__ keeping up with fast search [7], and dict.__getitem__ keeping up with the extremely carefully optimized Python dict [8].

Now you've got the core language, take a bow, most people don't make it nearly this far! However, there's still tons of work to go, for example you need the standard library if you want people to actually use this thing. A lot of the stdlib is in Python, so you can just copy that, but some stuff isn't, for that you'll need to reimplement it yourself (you can "cheat" on a lot of stuff and just write it in Python though, rather than C, or whatever language your interpreter is written in).

At this point you should have yourself a complete Python that's basically a drop-in replacement for CPython, but that's a bit slower. Now it's time for the real work to begin. You need to write a Just in Time compiler, and it needs to be a good one. You'll need a great optimizer that can simultaneously understand some of the high level semantics of Python, as well as the low level nitty gritty of your CPU [9].

If you've gotten this far, you deserve a round of applause, not many projects make it this far. But your Python probably still isn't going to be used by the world, you may execute Python code 10x faster, but the Python community is more demanding than that. If you want people to really use this thing you're going to have to make sure their C extensions run. Sure, CPython's C-API was never designed to be run on other platforms, but you can make it work, even if it's not super fast, it might be enough for some people [10].

Finally, remember that standard library you wrote earlier? Did you make sure to take your time to optimize it? You're probably going to need to take a step back and do that now, sure it's huge, and people use every nook and cranny of it, but if you want to be faster, you need it to be faster too. It won't do to have your bz2 module be slower, tarnishing your beautiful speed results [11].

Still with me? Congratulations, you're in a class of your own. You've got a blazing fast Python, a nicely optimized standard library, and you can run anyone's code, Python or C. If this ballad sounds a little familiar, that's because it is, it's the story of PyPy. If you think this was a fun journey, you can join in. There are ways for Python programmers at every level to help us, such as:

  • Contributing to our performance analysis tool, this is actually a web app written using Flask.
  • Contribute to speed.pypy.org which is a Django site.
  • Provide pure Python versions of your C-extensions, to ensure they run on alternative Pythons.
  • Test and benchmark your code on PyPy, let us know if you think we should be faster! (We're always interested in slower code, and we consider it a bug)
  • Contribute to PyPy itself, we've got tons of things to do, you could work on the standard library, the JIT compiler, the GC, or anything in between.

Hope to see you soon [12]!

[1]CPython, IronPython, Jython, PyPy, at least!
[2]http://code.google.com/p/shedskin/
[3]http://cython.org/
[4]http://code.google.com/p/v8/
[5]http://docs.python.org/c-api/refcounting.html
[6]http://hg.python.org/cpython/file/2.7/Objects/listsort.txt
[7]http://effbot.org/zone/stringlib.htm
[8]http://hg.python.org/cpython/file/2.7/Objects/dictnotes.txt
[9]http://code.google.com/p/unladen-swallow/
[10]http://code.google.com/p/ironclad/
[11]https://bugs.pypy.org/issue733
[12]http://pypy.org/contact.html

You can find the rest here. There are view comments.

DjangoCon Europe 2011 Slides

Posted June 7th, 2011. Tagged with talk, djanogcon, pypy, python, django.

I gave a talk at DjangoCon Europe 2011 on using Django and PyPy (with another talk to be delivered tomorrow!), you can get the slides right on bitbucket, and for those who saw the talk, the PyPy compatibility wiki is here.

You can find the rest here. There are view comments.

This Summer

Posted May 6th, 2011. Tagged with pypy, python, self.

For the past nearly two years I've been an independent contractor working primarily for Eldarion, and it's been fantastic, I can't say enough good things. However, this summer I'm shaking things up a bit and heading west. I'll be an intern at Quora this summer. They're a Python shop, and it's going to be my job to make everything zippy. Specifically I'll be making the site run on PyPy, and then tuning the hell out of both their codebase and PyPy itself. I'm super excited about the chance to get a major site running on PyPy, and to contribute as many performance improvements upstream as possible.

You can find the rest here. There are view comments.

My experience with the computer language shootout

Posted April 3rd, 2011. Tagged with pypy, programming, python, programming-languages.

For a long time we, the PyPy developers, have known the Python implementations on the Computer Language Shootout were not optimal under PyPy, and in fact had been ruthlessly microoptimized for CPython, to the detriment of PyPy. But we didn't really care or do anything about it, because we figured those weren't really representative of what people like to do with Python, and who really cares what it says, CPython is over 30 times slower than C, and people use it just the same. However, I've recently have a number of discussions about language implementation speed and people tend to cite the language shootout as the definitive source for cross-language comparisons. So I decided to see what I could do about making it faster.

The first benchmark I took a stab at was reverse-complement, PyPy was doing crappily on it, and it was super obviously optimized for CPython: every loop possible was pushed down into functions known to be implemented in C, various memory allocation tricks are played (e.g. del some_list[:] removes the contents of the list, but doesn't deallocate the memory), and bound method allocation is pulled out of loops. The first one is the most important for PyPy, on PyPy your objective is generally to make sure your hot loops are in Python, the exact opposite of what you want on CPython. So I started coding up my own version, optimized for PyPy, I spent some time with our debugging and profiling tools, and whipped up a nice implementation that was something like 3x faster than the current one on PyPy, which you can see here. Generally the objective here was to make sure the program does as little memory allocation in the hot loops as possible, all of which are in Python. Try that with your average interpreter.

So I went ahead and submitted it, thinking PyPy would be looking 3 times better when I woke up. Naturally I wake up to an email from the shootout, which says that I should provide a Python 3 implementation, and that it doesn't work on CPython. What the hell? I try to run it myself and indeed it doesn't. It turns out on CPython sys.stdout.write(buffer(array.array("c"), 0, idx)) raises an exception. Which is a tad unfortunate because it should be an easy way to print out part of an array of characters without needing to allocate memory. After speaking with some CPython core developers, it appears that it is indeed a bug in CPython. And I noticed on PyPy buffer objects aren't nearly as efficient as they should be, so I set out in search of a new way to work on CPython and PyPy, and be faster if possible. I happened to stuble across the method array.buffer_info which returns a tuple of the memory address of the array's internal storage and its length, and a brilliant hack occurred to me: use ctypes to call libc's write() function. I coded it up, and indeed it worked on PyPy and CPython and was 40% faster on PyPy to boot. Fantastic I thought, I'll just submit this and PyPy will look rocking! Only 3.5x slower than C, not bad for an interpreter, in a language that is notoriously hard to optimize. You can see the implementation right here, it contains a few other performance tricks as well, but nothing too exciting.

So I submitted this, thinking, "Aha! I've done it". Shortly, I had an email saying this has been accepted as an "interesting alternative" because it used ctypes, which is to say it won't be included in the cumulative timings for each implementation, nor will it be listed with the normal implementations for the per-benchmark scores. Well crap, that's no good, the whole point of this was to look good, what's the point if no one is going to see this glorious work. So I sent a message asking why this implementation was considered alternative, since it appeared fairly legitimate. I received a confusing message questioning why this optimization was necessary, followed by a suggestion that perhaps PyPy wasn't compatible enough with (with what I dare not ask, but the answer obviously isn't Python the abstract language, since CPython had the bug!).

Overall it was a pretty crappy experience. The language shootout appears to be governed by arbitrary rules. For example the C implementations use GCC builtins, which are not part of the C standard, making them not implementation portable. The CPython pidigits version uses a C extension which is obviously not implementation portable (by comparison every major Python implementation includes ctypes, only CPython, and to varying extents IronPython and PyPy, support the CPython C-API), although here PyPy was allowed to use ctypes. It's also not possible to send any messages once your ticket has been marked as closed, meaning to dispute a decision you basically need to pray the maintainer reopens it for some reason. The full back and forth is available here. I'm still interested in improving the PyPy submissions there (and further optimizing PyPy where needed). However given the seemingly capricious and painful submission process I'm not really inclined to continue work, nor can I take the shootout seriously as an honest comparison of languages.

You can find the rest here. There are view comments.

PyPy San Francisco Tour Recap

Posted March 9th, 2011. Tagged with python, pypy.

Yesterday was the last day of my (our) tour of the Bay Area, so I figured I'd post a recap of all the cool stuff that happened.

Sprints

I got in on Friday night (technically Saturday morning, thanks for the ride Noah!) and on Saturday and Sunday we held sprints at Noisebridge. They have a very cool location and we had some pretty productive sprints. The turnout was less than expected, based on the people who said they'd show at the Python user group, however we were pretty productive. We fixed a number of remaining issues before we can do a Python 2.7 compatible release. These included: Armin Rigo and I implementing PYTHONIOENCODING (look it up!), Dan Roberts fixing the sqlite3 tests, and Armin fixing a subtle JIT bug. I also spent some time doing some performance profiling with Greg from Quora. Importing pytz was incredibly slow on PyPy, and it turned out the issue was when pytz was in a .egg its attempts to load the timezone files cause repeated reads from the ZIP file, and PyPy wasn't caching the zipfile metadata properly. So we fixed that and now it's much faster.

Google

Monday morning we (Armin Rigo, Maciej Fijalkowski, and myself) gave a tech talk at Google. The first thing I noticed was the Google campus is obscenely gorgeous, and large. Unfortunately, our talk didn't seem to go very well. Part of this was we were unsure if our audience was Python developers looking to make their code run faster, or compiler people who wanted to hear all the gory details of our internals. Even now I am still a little unsure about who showed up to our talk, but they didn't seem very enthused. I'm hoping we can construct some sort of post-mortem on the talk, because Google is precisely the type of company with a lot of Python code and performance aspirations that we think we can help with. (And Google's cafeterias are quite delicious).

Mozilla

After lunch at Google we shuffled over to Mozilla's offices for another talk. The slides we delivered were the same as the Google ones, however this talk went over much better. Our audience was primarily compiler people (who work on the Tracemonkey/Jaegermonkey/other Javascript engines), with a few Python developers who were interested. We had a ton of good questions during and after the talk, and then hung around for some great discussions with Brendan Eich, Andreas Gal, David Mandelin, and Chris Leary, kudos to all those guys. Some of the major topics were:

  • Whether or not we had issues of trace explosion (over specializing traces and thus compiling a huge number of paths that were in practice rarely executed), and why not.
  • The nature of the code we have to optimize, a good bit of real world Python is mostly idiomatic, whereas Mozilla really has to deal with a problem of making any old crap on the internet fast.
  • The fact that for Javascript script execution time is generally very short, whereas the Python applications we target tend to have longer execution times. While we still work to startup quickly (e.g. no 5 minute JVM startup times), it's less of an issue if it takes 30 seconds before the code starts running at top speed. For us this means that targeting browser Javascript is possibly less sensible than server-side Javascript for a possible Javascript VM written on top of our infrastructure.

Overall it was a very positive experience, and I'll be excited to speak with David some more at the VM summit at PyCon.

Recap

Overall the trip was a big success in my view. I believe both the Google and Mozilla talks were recorded, so once I know where they are online hopefully other people can enjoy the talks. Hopefully Armin will blog about some of the talks he gave before I got to the Bay Area. See you at PyCon!

You can find the rest here. There are view comments.

Django and Python 3 (Take 2)

Posted February 17th, 2011. Tagged with python3, django, python.

About 18 months ago I blogged about Django and Python 3, I gave the official roadmap. Not a lot has changed since then unfortunately, we've dropped 2.3 entirely, 2.4 is on it's way out, but there's no 3.X support in core. One fun thing has changed though: I get to help set the roadmap, which is a) cool, b) means the fact that I care about Python 3 counts for something. I'm not going to get into "Why Py3k", because that's been done to death, but it's a better language, that I'd rather program in, so the only question is how do we get there.

I posted a while ago on Hacker News that I foresaw Python 3 support by the end of this summer. I still think that's reasonable, but there's a tiny unspoken question there: how do we go from 0 to there. I have an answer! Drum roll please... it turns out, as a student, I tend to have a lot of free time over the summer, and Google has this lovely thing called Google Summer of Code, where they give students money to work on open source. So I'd like to make Python 3 support for Django a GSOC project this summer. With myself either mentoring or student-ing. I don't care which role I take, just that someone who's committed to the project takes it on. I think this task is eminently reasonable for the timeline, especially in light of the work done by Martin von Löwis which, even though it probably doesn't apply, lays a lot of the ground work and identifies a lot of the key issues (and features a lot of utilities which will probably be of use).

All that being said any statements here or elsewhere (especially one involving timelines) reflects my personal goals and beliefs, not necessarily any other Django core developers, and certainly not an official project position.

You can find the rest here. There are view comments.

PyCon 2011 is going to be Awesome

Posted January 21st, 2011. Tagged with python, pycon.

If you hang out with Pythonistas you've probably already heard, but if you haven't, come to PyCon! It's going to be awesome. Here are just a few reasons why:

You've got until January 25th to secure the early bird pricing for PyCon tickets, and it is expected to sell-out this year, so get registered!

You can find the rest here. There are view comments.

Announcing VCS Translator

Posted January 21st, 2011. Tagged with python, vcs, software, programming, django, open-source.

For the past month or so I've been using a combination of Google, Stackoverflow, and bugging people on IRC to muddle my way through using various VCS that I'm not very familiar with. And all too often my queries are of the form of "how do I do git foobar -q in mercurial?". A while ago I tweeted that someone should write a VCS translator website. Nobody else did, so when I woke up far too early today I decided I was going to get something online to solve this problem, today! About 6 hours later I tweeted the launch of VCS translator.

This is probably not even a minimum viable product. It doesn't handle a huge range of cases, or version control systems. However, it is open source and it provides a framework for answering these questions. If you're interested I'd encourage you to fork it on github and help me out in fixing some of the most requested translation (I remove them once they're implemented).

My future goals for this are to allow commenting, so users can explain the caveats of the translations (very infrequently are the translations one-to-one) and to add a proper API. Moreover my goal is to make this a useful tool to other programmers who, like myself, have far too many VCS in their lives.

You can find the rest here. There are view comments.

Getting the most out of tox

Posted December 17th, 2010. Tagged with testing, python, taggit, programming, django.

tox is a recent Python testing tool, by Holger Krekel. It's stated purpose is to make testing Python projects against multiple versions of Python (or different interpreters, like PyPy and Jython) much easier. However, it can be used for so much more. Yesterday I set it up for django-taggit, and it's an absolute dream, it automates testing against four different versions of Python, two different versions of Django, and it automates building the docs and checking for any warnings from Sphinx. I'll try to give a run through on what exactly you need to do to set this up with your project.

First create a tox.ini at the root of your project (i.e. in the same directory as your setup.py). Next create a [tox] section, and list out the enviroments you'd like to be tested (i.e. which Pythons):

[tox]
envlist =
    py25, py26 , py27, pypy

The enviroments we've listed out are a few of the ones included with tox, they point at specific versions of Python and use the default testing setup. Now add a [testenv] section which will tell tox how to actually run your tests:

[testenv]
commands =
    python setup.py test
deps =
    django==1.2.3

commands is the list of commands tox will run, and deps specifies any dependencies that are needed to run the tests (tox creates a virtualenv for each enviroment and doesn't include system wide site-packages, so you need to make sure you list everything needed by default here). If you want to use this same python setup.py test formulation you'll need to be using setuptools or distribute for your setup.py and provide the test_suite argument, Eric Holscher provides a good run down for how to do this for Django projects.

Now you should be able to just type tox into your command line and it will try to run your tests in each of the enviroments you specified. Hopefully they're all passing (future test runs will go faster, for the first run it has to install all the dependencies). The next thing you may want to do is get it setup to build your documentation. To do this create a [testenv:docs] section:

[testenv:docs]
changedir = docs
deps =
    sphinx
commands =
    sphinx-build -W -b html -d {envtmpdir}/doctrees . {envtmpdir}/html

This tells tox a few things. First changedir tells it that to run these commands it should cd into the docs/ directory (if you're docs live elsewhere, change as appropriate). Next it has sphinx as a dependency. Finally the commands invoke sphinx-build, -W makes warnings into errors (so you get an approrpiate failure status code), -b html uses the HTML builder, and the rest of the parameters tell Sphinx where the docs live and to put the output in the temporary directory that tox creates for each env.

Now all you need to do is add docs to the envlist, and a tox run will build your documentation.

The last thing you might want to do is set it up to test against multiple versions of a package (such as Django 1.1, Django 1.2, and trunk). To do this create another section whose name includes both the Python version and dependency version, e.g. [testenv:py25-trunk]. In it place:

[testenv:py25-trunk]
basepython = python2.5
deps =
    http://www.djangoproject.com/download/1.3-alpha-1/tarball/

This "inherits" from the default testenv, so it still has its commands, but we specify the basepython indicating this testenv is for python 2.5, and a different set of dependencies, here we're using the Django 1.3 alpha. You'll need to do a bit of copy-paste and create one of these for each version of Python you're testing against, and make sure to add each of these to the envlist.

At this point you should have a lean, mean, testing setup. With one command you can test your package with different dependencies, different pythons, and build your documentation. The tox documentation features tons of examples so you should use it as a reference.

You can find the rest here. There are view comments.

The continuous integration I want

Posted November 2nd, 2010. Tagged with testing, python, tests, django, open-source.

Testing is important, I've been a big advocate of writing tests for a while, however when you've got tests you need to run them. This is a big problem in open source, Django works on something like six versions of Python (2.4, 2.5, 2.6, 2.7, Jython, and PyPy), 4 databases (SQLite, PostgreSQL, MySQL, Oracle, plus the GIS backends, and external backends), and I don't even know how many operating systems (at least the various Linuxes, OS X, and Windows). If I tried to run the tests in all those configurations for every commit I'd go crazy. Reusable applications have it even worse, ideally they should be tested under all those configurations, with each version of Django they support. For a Django application that wants to work on Django 1.1, 1.2, and all of those interpreters, databases, and operating systems you've got over 100 configurations. Crazy. John Resig faced a similar problem with jQuery (5 major browsers, multiple versions, mobile and desktop, different OSs), and the result was Test Swarm (note that at this time it doesn't appear to be up), an automated way for people to volunteer their machines to run tests. We need something like that for Python.

It'd be nice if it were as simple as users pointing their browser at a URL, but that's not practical with Python: the environments we want to test in are more complex than what can be detected, we need to know what external services are available (databases, for example). My suggestion is users should maintain a config file (.ini perhaps) somewhere on their system, it would say what versions of Python are available, and what external services are available (and how they can be accessed, e.g. DB passwords). Then the user downloads a bootstrap script and runs it. This script sees what services the user has available on their machine and queries a central server to see what tests need to be run, given the configuration they have available. The script downloads a test, creates a virtualenv, and does whatever setup it needs to do (e.g. writing a Django settings.py file given the available DB configuration), and runs the tests. Finally it sends the test results back to the central server.

It's very much like a standard buildbot system, except any user can download the script and start running tests. There are a number of problems to be solved, how do you verify that a project's tests aren't malicious (only allow trusted tests to start), how do you verify that the test results are valid, how do you actually write the configuration for a test suite? However, if solved I think this could be an invaluable resource for the Python community. Have a reusable app you want tested? Sign it up for PonySwarm, add a post-commit hook, and users will automatically run the tests for it.

You can find the rest here. There are view comments.

Priorities

Posted October 24th, 2010. Tagged with programming, django, python, open-source.

When you work on something as large and multi-faceted as Django you need a way to prioritize what you work on, without a system how do I decide if I should work on a new feature for the template system, a bugfix in the ORM, a performance improvement to the localization features, or better docs for contrib.auth? There's tons of places to jump in and work on something in Django, and if you aren't a committer you'll eventually need one to commit your work to Django. So if you ever need me to commit something, here's how I prioritize my time on Django:

  1. Things I broke: If I broke a buildbot, or there's a ticket reported against something I committed this is my #1 priority. Though Django no longer has a policy of trunk generally being perfectly stable it's still a very good way to treat it, once it gets out of shape it's hard to get it back into good standing.
  2. Things I need for work: Strictly speaking these don't compete with the other items on this list, in that these happen on my work's time, rather than in my free time. However, practically speaking, this makes them a relatively high priority, since my work time is fixed, as opposed to free time for Django, which is rather elastic.
  3. Things that take me almost no time: These are mostly things like typos in the documentation, or really tiny bugfixes.
  4. Things I think are cool or important: These are either things I personally think are fun to work on, or are in high demand from the community.
  5. Other things brought to my attention: This is the most important category, I can only work on bugs or features that I know exist. Django's trac has about 2000 tickets, way too many for me to ever sift through in one sitting. Therefore, if you want me to take a look at a bug or a proposed patch it needs to be brought to my attention. Just pinging me on IRC is enough, if I have the time I'm almost always willing to take a look.

In actuality the vast majority of my time is spent in the bottom half of this list, it's pretty rare for the build to be broken, and even rarer for me to need something for work, however, there are tons of small things, and even more cool things to work on. An important thing to remember is that the best way to make something show up in category #3 is to have an awesome patch with tests and documentation, if all I need to do is git apply && git commit that saves me a ton of time.

You can find the rest here. There are view comments.

django-taggit 0.9 Released

Posted September 21st, 2010. Tagged with python, taggit, release, django, application.

It's been a long time coming, since taggit 0.8 was released in June, however 0.9 is finally here, and it brings a ton of cool bug fixes, improvements, and cleanups. If you don't already know, django-taggit is an application for django to make tagging simple and awesome. The biggest changes in this release are:

  • The addition of a bunch of locales.
  • Support for custom tag models.
  • Moving taggit.contrib.suggest into an external package.
  • Changed the filter syntax from filter(tags__in=["foo"]) to filter(tags__name__in=["foo"]). This change is backwards incompatible, but was necessary to support lookups on all fields.

You can checkout the docs for complete details. The goals for the 1.0 release are going to be adding some useful widgets for use in the admin and forms, hopefully adding a class based generic view to replace the current one, and adding a migration command to move data from django-tagging into the django-taggit models.

You can get the latest release on PyPi. Enjoy.

You can find the rest here. There are view comments.

PyOhio Slides

Posted August 2nd, 2010. Tagged with pypy, talk, unladenswallow, python, pyohio.

I've (finally) uploaded the slides from my PyOhio talk (I gave a similar talk at ChiPy). You can get them right here, I'll be putting all my slides on Scribd from here on out, they were much easier to upload to than SlideShare, plus HTML5 is awesome!

You can find the rest here. There are view comments.

Testing Utilities in Django

Posted July 6th, 2010. Tagged with testing, python, fixtures, tests, django.

Lately I've had the opportunity to do some test-driven development with Django, which a) is awesome, I love testing, and b) means I've been working up a box full of testing utilities, and I figured I'd share them.

Convenient get() and post() methods

If you've done testing of views with Django you probably have some tests that look like:

def test_my_view(self):
    response = self.client.get(reverse("my_url", kwargs={"pk": 1}))

    response = self.client.post(reverse("my_url", kwargs={"pk": 1}), {
        "key": "value",
    })

This was a tad too verbose for my tastes so I wrote:

def get(self, url_name, *args, **kwargs):
    return self.client.get(reverse(url_name, args=args, kwargs=kwargs))

def post(self, url_name, *args, **kwargs):
    data = kwargs.pop("data", None)
    return self.client.post(reverse(url_name, args=args, kwargs=kwargs), data)

Which are used:

def test_my_view(self):
    response = self.get("my_url", pk=1)

    response = self.post("my_url", pk=1, data={
        "key": "value",
    })

Much nicer.

login() wrapper

The next big issue I had was logging in and out of multiple users was too verbose. I often want to switch between users, either to check different permissions or to test some inter-user workflow. That was solved with a simple context manager:

class login(object):
    def __init__(self, testcase, user, password):
        self.testcase = testcase
        success = testcase.client.login(username=user, password=password)
        self.testcase.assertTrue(
            success,
            "login with username=%r, password=%r failed" % (user, password)
        )

    def __enter__(self):
        pass

    def __exit__(self, *args):
        self.testcase.client.logout()

def login(self, user, password):
    return login(self, user, password)

This is used:

def test_my_view(self):
    with self.login("username", "password"):
        response = self.get("my_url", pk=1)

Again, a lot better.

django-fixture-generator

Not quite a testing utility, but my app django-fixture-generator has made testing a lot easier for me. Fixtures are useful in getting data to work wit, but maintaining them is often a pain, you've got random scripts to generate them, or you just checkin some JSON to your repository with no way to regenerate it sanely (say if you add a new field to your model). django-fixture-generator gives you a clean way to manage the code for generating fixtures.

In general I've found context managers are a pretty awesome tool for writing clean, readable, succinct tests. I'm sure I'll have more utilities as I write more tests, hopefully someone finds these useful.

You can find the rest here. There are view comments.

MultiMethods for Python

Posted June 26th, 2010. Tagged with python, c++.

Every once and a while the topic of multimethods (also known as generic dispatch) comes up in the Python world (see here, and here, here too, and finally here, and probably others). For those of you who aren't familiar with the concept, the idea is that you declare a bunch of functions with the same name, but that take different arguments and the language routes your calls to that function to the correct implementation, based on what types you're calling it with. For example here's a C++ example:

#include <iostream>

void special(int k) {
    std::cout << "I AM THE ALLMIGHTY INTEGER " << k << std::endl;
}

void special(std::string k) {
    std::cout << "I AM THE ALLMIGHTY STRING " << k << std::endl;
}

int main() {
    special(42);
    special("magic");
    return 0;
}

As you can probably guess this will print out:

I AM THE ALLMIGHTY INTEGER 3
I AM THE ALLMIGHTY STRING magic

You, the insightful reader, are no doubt fuming in your seats now, "Alex, you idiot, Python functions don't have type signatures, how can we route our calls based on something that does not exist!", and right you are. However, don't tell me you've never written a function that looks like:

def my_magic_function(o):
    if isinstance(o, basestring):
        return my_magic_function(int(o))
    elif isinstance(o, (int, long)):
        return cache[o]
    else:
        return o

Or something like that, the point is you have one function that has a couple of different behaviors based on the type of it's parameter. Perhaps it'd be nice to separate each of those behaviors into their own function (or not, I don't really care what you do).

I was saying that a bunch of people have already implemented these, why am I? Mostly for fun (that's still a valid reason, right?), but also because a bunch of the implementations make me sad. Some of them use crazy hacks (reading up through stack frames), a few of them have global registrys, and all of them rely on the name of the function to identify a single "function" to be overloaded. However, they also all have one good thing in common: decorators, yay!

My implementation is pretty simple, so I'll present it, and it's test suite without explanation:

class MultiMethod(object):
    def __init__(self):
        self._implementations = {}

    def _get_predicate(self, o):
        if isinstance(o, type):
            return lambda x: isinstance(x, o)
        assert callable(o)
        return o

    def register(self, *args, **kwargs):
        def inner(f):
            key = (
                args,
                tuple(kwargs.items()),
            )
            if key in self._implementations:
                raise TypeError("Duplicate registration for %r" % key)
            self._implementations[key] = f
            return self
        return inner

    def __call__(self, *args, **kwargs):
        for spec, func in self._implementations.iteritems():
            arg_spec, kwarg_spec = spec
            kwarg_spec = dict(kwarg_spec)
            if len(args) != len(arg_spec) or set(kwargs) != set(kwarg_spec):
                continue
            if (all(self._get_predicate(spec)(arg) for spec, arg in zip(arg_spec, args)) and
                all(self._get_predicate(spec)(kwargs[k]) for k, spec in kwarg_spec.iteritems())):
                return func(*args, **kwargs)
        raise TypeError("No implementation with a spec matching: %r, %r" % (
            args, kwargs))

And the tests:

import unittest2 as unittest

from multimethod import MultiMethod


class MultiMethodTestCase(unittest.TestCase):
    def test_basic(self):
        items = MultiMethod()

        @items.register(list)
        def items(l):
            return l

        @items.register(dict)
        def items(d):
            return d.items()

        self.assertEqual(items([1, 2, 3]), [1, 2, 3])
        # TODO: dict ordering dependent, 1 item dict?
        self.assertEqual(items({"a": 1, "b": 2}), [("a", 1), ("b", 2)])

        with self.assertRaises(TypeError):
            items(xrange(3))

    def test_duplicate(self):
        m = MultiMethod()

        @m.register(list)
        def m(o):
            return o

        with self.assertRaises(TypeError):
            @m.register(list)
            def m(o):
                return o


if __name__ == "__main__":
    unittest.main()

Bon appétit.

You can find the rest here. There are view comments.

Hey, could someone write this app for me

Posted June 8th, 2010. Tagged with applications, testing, python, fixtures, reusable, django.

While doing some work today I realized that generating fixtures in Django is way too much of a pain in the ass, and I suspect it's a pain in the ass for a lot of other people as well. I also came up with an API I'd kind of like to see for it, unfortunately I don't really have the time to write the whole thing, however I'm hoping someone else does.

The key problem with writing fixtures is that you want to have a clean enviroment to generate them, and you need to be able to edit them in the future. In addition, I'd personally prefer to have my fixture generation specifically be imperative. I have an API I think I think solves all of these concerns.

Essentially, in every application you can have a fixture_gen.py file, which contains a bunch of functions that can generate fixtures:

from fixture_generator import fixture_generator

from my_app.models import Model1, Model2


@fixture_generator(Model1, Model2, requires=["my_app.other_dataset"])
def some_dataset():
    # Some objects get created here

@fixture_generator(Model1)
def other_dataset():
    # Some objects get created here

Basically you have a bunch of functions, each of which is responsible for creating some objects that will become a fixture. You then decorate them with a decorator that specifies what models need to be included in the fixture that results from them, and finally you can optionally specify dependencies (these are necessary because a dependency could use models which your fixture doesn't).

After you have these functions there's a management command which can be invoked to actually generate the fixtures:

$ ./manage.py generate_fixture my_app.some_dataset --format=json --indent=4

Which actually creates the clean database enviroment, handles the dependencies, calls the functions, and dumps the fixtures to stdout. Then you can redirect that stdout off to a file somewhere, for use in testing or whatever else people use fixtures for.

Hopefully someone else has this problem, and likes the API enough to build this. Failing that I'll try to make some time for it, but no promises when (aka if you want it you should probably build it).

You can find the rest here. There are view comments.

DjangoCon.eu slides

Posted May 24th, 2010. Tagged with python, nosql, django, orm, djangocon.

I just finished giving my talk at DjangoCon.eu on Django and NoSQL (also the topic of my Google Summer of Code project). You can get the slides over at slideshare. My slides from my lightning talk on django-templatetag-sugar are also up on slideshare.

You can find the rest here. There are view comments.

PyPy is the Future of Python

Posted May 15th, 2010. Tagged with python, pypy.

Currently the most common implementation of Python is known as CPython, and it's the version of Python you get at python.org, probably 99.9% of Python developers are using it. However, I think over the next couple of years we're going to see a move away from this towards PyPy, Python written in Python. This is going to happen because PyPy offers better speed, more flexibility, and is a better platform for Python's growth, and the most important thing is you can make this transition happen.

The first thing to consider: speed. PyPy is a lot faster than CPython for a lot of tasks, and they've got the benchmarks to prove it. There's room for improvement, but it's clear that for a lot of benchmarks PyPy screams, and it's not just number crunching (although PyPy is good at that too). Although Python performance might not be a bottleneck for a lot of us (especially us web developers who like to push performance down the stack to our database), would you say no to having your code run 2x faster?

The next factor is the flexibility. By writing their interpreter in RPython PyPy can automatically generate C code (like CPython), but also JVM and .NET versions of the interpreter. Instead of writing entirely separate Jython and IronPython implementations of Python, just automatically generate them from one shared codebase. PyPy can also have its binary generated with a stackless option, just like stackless Python, again no separate implementations to maintain. Lastly, PyPy's JIT is almost totally separate from the interpreter, this means changes to the language itself can be made without needing to update the JIT, contrast this with many JITs that need to statically define fast-paths for various operations.

And finally that it's a better platform for growth. The last point is a good example of this: one can keep the speed from the JIT while making changes to the language, you don't need to be an assembly expert to write a new bytecode, or play with the builtin types, the JIT generator takes care of it for you. Also, it's written in Python, it may be RPython which isn't as high level as regular Python, but compare the implementations of of map from CPython and PyPy:

static PyObject *
builtin_map(PyObject *self, PyObject *args)
{
    typedef struct {
        PyObject *it;           /* the iterator object */
        int saw_StopIteration;  /* bool:  did the iterator end? */
    } sequence;

    PyObject *func, *result;
    sequence *seqs = NULL, *sqp;
    Py_ssize_t n, len;
    register int i, j;

    n = PyTuple_Size(args);
    if (n < 2) {
        PyErr_SetString(PyExc_TypeError,
                        "map() requires at least two args");
        return NULL;
    }

    func = PyTuple_GetItem(args, 0);
    n--;

    if (func == Py_None) {
        if (PyErr_WarnPy3k("map(None, ...) not supported in 3.x; "
                           "use list(...)", 1) < 0)
            return NULL;
        if (n == 1) {
            /* map(None, S) is the same as list(S). */
            return PySequence_List(PyTuple_GetItem(args, 1));
        }
    }

    /* Get space for sequence descriptors.  Must NULL out the iterator
     * pointers so that jumping to Fail_2 later doesn't see trash.
     */
    if ((seqs = PyMem_NEW(sequence, n)) == NULL) {
        PyErr_NoMemory();
        return NULL;
    }
    for (i = 0; i < n; ++i) {
        seqs[i].it = (PyObject*)NULL;
        seqs[i].saw_StopIteration = 0;
    }

    /* Do a first pass to obtain iterators for the arguments, and set len
     * to the largest of their lengths.
     */
    len = 0;
    for (i = 0, sqp = seqs; i < n; ++i, ++sqp) {
        PyObject *curseq;
        Py_ssize_t curlen;

        /* Get iterator. */
        curseq = PyTuple_GetItem(args, i+1);
        sqp->it = PyObject_GetIter(curseq);
        if (sqp->it == NULL) {
            static char errmsg[] =
                "argument %d to map() must support iteration";
            char errbuf[sizeof(errmsg) + 25];
            PyOS_snprintf(errbuf, sizeof(errbuf), errmsg, i+2);
            PyErr_SetString(PyExc_TypeError, errbuf);
            goto Fail_2;
        }

        /* Update len. */
        curlen = _PyObject_LengthHint(curseq, 8);
        if (curlen > len)
            len = curlen;
    }

    /* Get space for the result list. */
    if ((result = (PyObject *) PyList_New(len)) == NULL)
        goto Fail_2;

    /* Iterate over the sequences until all have stopped. */
    for (i = 0; ; ++i) {
        PyObject *alist, *item=NULL, *value;
        int numactive = 0;

        if (func == Py_None && n == 1)
            alist = NULL;
        else if ((alist = PyTuple_New(n)) == NULL)
            goto Fail_1;

        for (j = 0, sqp = seqs; j < n; ++j, ++sqp) {
            if (sqp->saw_StopIteration) {
                Py_INCREF(Py_None);
                item = Py_None;
            }
            else {
                item = PyIter_Next(sqp->it);
                if (item)
                    ++numactive;
                else {
                    if (PyErr_Occurred()) {
                        Py_XDECREF(alist);
                        goto Fail_1;
                    }
                    Py_INCREF(Py_None);
                    item = Py_None;
                    sqp->saw_StopIteration = 1;
                }
            }
            if (alist)
                PyTuple_SET_ITEM(alist, j, item);
            else
                break;
        }

        if (!alist)
            alist = item;

        if (numactive == 0) {
            Py_DECREF(alist);
            break;
        }

        if (func == Py_None)
            value = alist;
        else {
            value = PyEval_CallObject(func, alist);
            Py_DECREF(alist);
            if (value == NULL)
                goto Fail_1;
        }
        if (i >= len) {
            int status = PyList_Append(result, value);
            Py_DECREF(value);
            if (status < 0)
                goto Fail_1;
        }
        else if (PyList_SetItem(result, i, value) < 0)
            goto Fail_1;
    }

    if (i < len && PyList_SetSlice(result, i, len, NULL) < 0)
        goto Fail_1;

    goto Succeed;

Fail_1:
    Py_DECREF(result);
Fail_2:
    result = NULL;
Succeed:
    assert(seqs);
    for (i = 0; i < n; ++i)
        Py_XDECREF(seqs[i].it);
    PyMem_DEL(seqs);
    return result;
}

That's a lot of code! It wouldn't be bad, for C code, except for the fact that there's far too much boilerplate: every single call into the C-API needs to check for an exception, and INCREF and DECREF calls are littered throughout the code. Compare this with PyPy's RPython implementation:

def map(space, w_func, collections_w):
    """does 3 separate things, hence this enormous docstring.
       1.  if function is None, return a list of tuples, each with one
           item from each collection.  If the collections have different
           lengths,  shorter ones are padded with None.

       2.  if function is not None, and there is only one collection,
           apply function to every item in the collection and return a
           list of the results.

       3.  if function is not None, and there are several collections,
           repeatedly call the function with one argument from each
           collection.  If the collections have different lengths,
           shorter ones are padded with None
    """
    if not collections_w:
        msg = "map() requires at least two arguments"
        raise OperationError(space.w_TypeError, space.wrap(msg))
    num_collections = len(collections_w)
    none_func = space.is_w(w_func, space.w_None)
    if none_func and num_collections == 1:
        return space.call_function(space.w_list, collections_w[0])
    result_w = []
    iterators_w = [space.iter(w_seq) for w_seq in collections_w]
    num_iterators = len(iterators_w)
    while True:
        cont = False
        args_w = [space.w_None] * num_iterators
        for i in range(len(iterators_w)):
            try:
                args_w[i] = space.next(iterators_w[i])
            except OperationError, e:
                if not e.match(space, space.w_StopIteration):
                    raise
            else:
                cont = True
        w_args = space.newtuple(args_w)
        if cont:
            if none_func:
                result_w.append(w_args)
            else:
                w_res = space.call(w_func, w_args)
                result_w.append(w_res)
        else:
            return space.newlist(result_w)
map.unwrap_spec = [ObjSpace, W_Root, "args_w"]

It's not exactly what you'd write for a pure Python implementation of map, but it's a hell of a lot closer than the C version.

The case for PyPy being the future is strong, I think, however it's not all sunshine are roses, there are a few issues. It lags behind CPython's version (right now Python 2.5 is implemented), C extension compatibility isn't there yet, and not enough people are trying it out yet. But PyPy is getting there, and you can help.

Right now the single biggest way to help for most people is to test their code. Any pure Python code targeting Python 2.5 should run perfectly under PyPy, and if it doesn't: it's a bug, if it's slower than Python: let us know (unless it involves re, we know it's slow). Maybe try out your C-extensions, however cpyext is very alpha and even a segfault isn't surprising (but let us know so we can investigate). Of course help on development is always appreciated, right now most of the effort is going into speeding up the JIT even more, however I believe there is also going to be work on moving up to Python 2.7 (currently pre-release) this summer. If you're interested in helping out with either you should hop into #pypy on irc.freenode.net, or send a message to pypy-dev. PyPy's doing good work, Python doesn't need to be slow, and we don't all need to write C code!

You can find the rest here. There are view comments.

A Tour of the django-taggit Internals

Posted May 9th, 2010. Tagged with taggit, django, python.

In a previous post I talked about a cool new customization API that django-taggit has. Now I'm going to dive into the internals.

The public API is almost exclusively exposed via a class named TaggableManager, you attach one of these to your model and it has some cool tagging APIs. This class basically masquerades as ManyToManyField, this is how it gets cool things like filtering and forms automatically. If you look at its definition you'll see it has a bunch of attributes that it never actually uses, basically all of these act to emulate the Field interface. This class is also the entry point for the new customization API, exposed via the through parameter. This basically acts as an analogue to the through parameter on actual ManyToManyFields (documented here). The final crucial method is __get__, which turns TaggableManager into a descriptor.

This descriptor exposes an _TaggableManager class, which holds some of the internal logic. This class exposes all of the "managery" type methods, add(), set(), remove(), and clear(). This class is pretty simple, basically it just proxies bewteen the methods called and it's through model. This class is, unlike TaggableManager, actually a subclass of models.Manager, it just defines get_query_set() to return a QuerySet of all the tags for that model, or instance, and then filtering, ordering, and more falls out naturally.

Beyond that there's not too much going on. The code is fairly simple, and it's not particularly long. I've found this to be a pretty good pattern for extensibility, and it really resolves the need to have dozens of parameters, or GenericForeignKeys popping out every which way.

You can find the rest here. There are view comments.

Cool New django-taggit API

Posted May 4th, 2010. Tagged with applications, django, python.

A little while ago I wrote about some of the issues with the reusable application paradigm in Django. Yesterday Carl Meyer pinged me about an issue in django-taggit, it uses an IntegerField for the GenericForeignKey, which is great. Except for when you have a model with a CharField, TextField, or anything else for a primary key. The easy solution is to change the GenericForeignKey to be something else. But that's lame, a pain in the ass, and a hack (more of a hack than a GenericForeignKey in the first place).

The alternate solution we came up with:

from django.db import models

from taggit.managers import TaggableManager
from taggit.models import TaggedItemBase


class TaggedFood(TaggedItemBase):
    content_object = models.ForeignKey('Food')

class Food(models.Model):
    # ... fields here

    tags = TaggableManager(through=TaggedFood)

Custom through models for the taggable relationship! This let's the included GenericForeignKey implementation cater to the common case of integer primary keys, and lets other people provide their own implementations when necessary. Plus it means doing things like, adding a ForeignKey to auth.User or adding the "originally" typed version of the tag (for systems where tags are normalized).

In addition I've finally added some docs, they aren't really complete, but they're a start. I'm planning a release for sometime next week, unless some major issue pops up.

You can find the rest here. There are view comments.

Making Django and PyPy Play Nice (Part 1)

Posted April 16th, 2010. Tagged with pypy, django, python.

If you track Django's commits aggressivly (ok so just me...), you may have noticed that there have been a number of commits to improve the compatibility of Django and PyPy in the last couple of days. In the run up to Django 1.0 there were a ton of commits to make sure Django didn't have too many assumptions that it was running on CPython, and that it could work with systems like Jython and PyPy. Unfortunately, since then, our support has laxed a little, and a number of tests have begun to fail. In the past couple of days I've been working to correct this. Here are some of the things that were wrong

The first issue I ran into was, in various tests, response.context and response.template being None, instead of the lists that were expected. This was a pain to diagnose, but the ultimate source of the bug is that Django registers a signal handler in the test client that listens for templates being rendered. However, it doesn't actually unregister that signal receiver. Instead it relies on the fact that signals are stored as weakrefs, and when the function ends, the receivers that were registered (which were local variables) would be automatically deallocated. On PyPy, Jython, and any other system with a garbage collector more advanced than CPython's reference counting, the local variables aren't guaranteed to be deallocated at the end of the function, and therefore the weakref can still be alive. Truthfully, I'm not 100% sure how this results in the next signal being sent not to store the appropriate data, but it does. The solution is the make sure that the signals are manually disconnected at the end of the run. This was fixed in r12964 of Django.

The next issue was actually a problem in PyPy, specifically it was crashing with a UnicodeDecodeError. When I say crashing I mean crashing in the C sense of the word, not the Python, nice exception and stack trace sense... sort of. PyPy is written in a language named RPython, RPython is Pythonesque (all valid RPython is valid Python), and has exceptions. However if they aren't caught they sort of propagate to the top, they give you a kind-of-ok stacktrace, but its function names are all the generated function names from the C source, not useful ones from the RPython source. Internally, PyPy uses an OperationError to keep track of exceptions at the interpreter level. A trick to debugging RPython is, if running the code on top of CPython works, than running it translated to C will work, and the contrapositive appears true as well, if the C doesn't work, running on CPython won't work. After trying to run the code on CPython, the location of the exception bubbled right to the top, and the fix followed easily.

These are the first two issues I fixed, a couple others have been fixed and committed, and a further few have also been fixed, but not committed yet. I'll be writing about those as I find the time.

You can find the rest here. There are view comments.

Towards Application Objects in Django

Posted March 28th, 2010. Tagged with applications, reusable, django, python.

Django's application paradigm (and the accompanying reusable application environment) have served it exceptionally well, however there are a few well known problems with it. Chief among these is pain in extendability (as exemplified by the User model), and abuse of GenericForeignKeys where a true ForeignKey would suffice (in the name of being generic), there are also smaller issues, such as wanting to install the same application multiple times, or having applications with the same "label" (in Django parlance this means the path.split(".")[-1]). Lately I've been thinking that the solution to these problems is a more holistic approach to application construction.

It's a little difficult to describe precisely what I'm thinking about, so I'll start with an example:

from django.contrib.auth import models as auth_models

class AuthApplication(Application):
    models = auth_models

    def login(self, request, template_name='registration/login.html'):
        pass

    # ... etc

And in settings.py:

from django.core import app

INSTALLED_APPS = [
    app("django.contrib.auth.AuthApplication", label="auth"),
]

The critical elements are that a) all models are referred to be the attribute on the class, so that they can be swapped out by a subclass, b) applications are now installed using an app object that wraps the app class, with a label (to allow multiple apps of the same name to be registered). But how does this allow swapping out the User model, from the perspective of people who are expecting to just be able to use django.contrib.auth.models.User for any purpose? Instead of explicit references to the model these could be replaced with: get_app("auth").models.User.

What about the issue of GenericForeignKeys? To solve these we'd really need something like C++'s templates, or Java's generics, but we'll settle for the next best thing, callables! Imagine a comment app where the models.py looked like:

from django.core import get_app
from django.db import models

def get_models(target_model):
    class Comment(models.Model):
        obj = models.ForeignKey(target_model)
        commenter = models.ForeignKey(get_app("auth").models.User)
        text = models.TextField()

    return [Comment]

Then instead of providing a module to be models on the application class this callable would be provided, and Django would know to call it with the appropriate model class based on either a class attribute (for subclasses) or a parameter from the app object (to allow for easily installing more than one of the comment app, for each object that should allow commenting), in practice I think allowing the same app to be installed multiple times would require some extra parameters to the get_models function, so that things like db_table can be adjusted appropriately.

I think this could be done in a backwards compatible manner, by having strings that are in INSTALLED_APPS automatically generate an app object that was the default "filler" one with just a models module, and the views ignoring self, and a default label. Like I said this is all just a set of ideas floating around my brain at this point, but hopefully by floating this design it'll get people thinking about big architecture ideas like this.

You can find the rest here. There are view comments.

Languages Don't Have Speeds, Or Do They?

Posted March 15th, 2010. Tagged with pypy, python, compiler, programming-languages, psyco.

Everyone knows languages don't have speeds, implementations do. Python isn't slow; CPython is. Javascript isn't fast; V8, Squirrelfish, and Tracemonkey are. But what if a language was designed in such a way that it appeared that it was impossible to be implemented efficiently? Would it be fair to say that language is slow, or would we still have to speak in terms of implementations? For a long time I followed the conventional wisdom, that languages didn't have speeds, but lately I've come to believe that we can learn something by thinking about what the limits on how fast a language could possibly be, given a perfect implementation.

For example consider the following Python function:

def f(n):
    i = 0
    while i < n:
        i += 1
        n += i
    return n

And the equivilant C function:

int f(int n) {
    int i = 0;
    while (i < n) {
        i += 1;
        n += i;
    }
    return n;
}

CPython probably runs this code 100 times slower than the GCC compiled version of the C code. But we all know CPython is slow right? PyPy or Psyco probably runs this code 2.5 times slower than the C version (I'm just spitballing here). Psyco and PyPy are, and contain, really good just in time compilers that can profile this code, see that f is always called with an integer, and therefore a much more optimized version can be generated in assembly. For example the optimized version could generate just a few add instructions in the inner loop (plus a few more instructions to check for overflow), this would skip all the indirection of calling the __add__ function on integers, allocating the result on the heap, and the indirection of calling the __lt__ function on integers, and maybe even some other things I missed.

But there's one thing no JIT for Python can do, no matter how brilliant. It can't skip the check if n is an integer, because it can't prove it always will be an integer, someone else could import this function and call it with strings, or some custom type, or anything they felt like, so the JIT must verify that n is an integer before running the optimized version. C doesn't have to do that. It knows that n will always be an integer, even if you do nothing but call f until the end of the earth, GCC can be 100% positive that n is an integer.

The absolute need for at least a few guards in the resulting assembly guarantees that even the most perfect JIT compiler ever, could not generate code that was strictly faster than the GCC version. Of course, a JIT has some advantages over a static compiler, for example, it can inline at dynamic call sites. However, in practice I don't believe this ability is ever likely to beat a static compiler for a real world program. On the other hand I'm not going to stop using Python any time soon, and it's going to continue to get faster, a lot faster.

You can find the rest here. There are view comments.

Committer Models of Unladen Swallow, PyPy, and Django

Posted February 25th, 2010. Tagged with pypy, python, unladen-swallow, django, open-source.

During this year's PyCon I became a committer on both PyPy and Unladen Swallow, in addition I've been a contributer to Django for quite a long time (as well as having commit privileges to my branch during the Google Summer of Code). One of the things I've observed is the very different models these projects have for granting commit privileges, and what the expectations and responsibilities are for committers.

Unladen Swallow

Unladen Swallow is a Google funded branch of CPython focused on speed. One of the things I've found is that the developers of this project carry over some of the development process from Google, specifically doing code review on every patch. All patches are posted to Rietveld, and reviewed, often by multiple people in the case of large patches, before being committed. Because there is a high level of review it is possible to grant commit privileges to people without requiring perfection in their patches, as long as they follow the review process the project is well insulated against a bad patch.

PyPy

PyPy is also an implementation of Python, however its development model is based largely around aggressive branching (I've never seen a project handle SVN's branching failures as well as PyPy) as well as sprints and pair programming. By branching aggressively PyPy avoids the overhead of reviewing every single patch, and instead only requires review when something is already believed to be "trunk-ready", further this model encourages experimentation (in the same way git's light weight branches do). PyPy's use of sprints and pair programming are two ways to avoid formal code reviews and instead approach code quality as more of a collaborative effort.

Django

Django is the project I've been involved with for the longest, and also the only one I don't have commit privileges on. Django is extremely conservative in giving out commit privileges (there about a dozen Django committers, and about 500 names in the AUTHORS file). Django's development model is based neither on branching (only changes as large in scope as multiple database support, or an admin UI refactor get their own branch) nor on code review (most commits are reviewed by no one besides the person who commits them). Django's committers maintain a level of autonomy that isn't seen in either of the other two projects. This fact comes from the period before Django 1.0 was released when Django's trunk was often used in production, and the need to keep it stable at all times, combined with the fact that Django has no paid developers who can guarantee time to do code review on patches. Therefore Django has maintained code quality by being extremely conservative in granting commit privileges and allowing developers with commit privileges to exercise their own judgment at all times.

Conclusion

Each of these projects uses different methods for maintaining code quality, and all seem to be successful in doing so. It's not clear whether there's any one model that's better than the others, or that any of these projects could work with another's model. Lastly, it's worth noting that all of these models are fairly orthogonal to the centralized VCS vs. DVCS debate which often surrounds such discussions.

You can find the rest here. There are view comments.

Thoughts on HipHop PHP

Posted February 2nd, 2010. Tagged with c++, python, unladen-swallow, compile, php, compiler, programming-languages.

This morning Facebook announced, as had been rumored for several weeks, a new, faster, implementation of PHP. To someone, like me, who loves dynamic languages and virtual machines this type of announcement is pretty exciting, after all, if they have some new techniques for optimizing dynamic languages they can almost certainly be ported to a language I care about. However, because of everything I've read (and learned about PHP, the language) since the announcement I'm not particularly excited about HipHop.

Firstly, there's the question of what problem HipHop solves. It aims to improve the CPU usage of PHP applications. For all practical purposes PHP exists exclusively to serve websites (yes, it can do other things, no one does them). Almost every single website on the internet is I/O bound, not CPU bound, web applications spend their time waiting on external resources (databases, memcache, other HTTP resources, etc.). So the part of me that develops websites professionally isn't super interested, Facebook is in the exceptionally rare circumstance that they've optimized their I/O to the point that optimizing CPU gives worthwhile returns. However, the part of me that spends his evenings contributing to Unladen Swallow and hanging around PyPy still thought that there might be some interesting VM technology to explore.

The next issue for consideration was the "VM" design Facebook choose. They've elected to compile PHP into C++, and then use a C++ compiler to get a binary out of it. This isn't a particularly new technique, in the Python world projects like Shedskin and Cython have exploited a similar technique to get good speed ups. However, Facebook also noted that in doing so they had dropped support for "some rarely used features — such as eval()". An important question is which features exactly, they had dropped support for. After all, the reason compiling a dynamic language to efficient machine code is difficult is because the dynamicism defeats the compiler's ability to optimize, but if you remove the dynamicism you remove the obstacles to efficient compilation. However, you're also not compiling the same language. PHP without eval(), and whatever else they've removed is quite simply a different language, for this reason I don't consider either Shedskin or Cython to be an implementation of Python, because they don't implement the entire language.

This afternoon, while I was idling in the Unladen Swallow IRC channel a discussion about HipHop came up, and I learned a few things about PHP I hadn't previous realized. The biggest of these is that a name bound to a function in PHP cannot be undefined, or redefined. If you've ever seen Collin Winter give a talk about Unladen Swallow, the canonical example of Python's dynamicism defeating a static compiler is the len() function. For lists, tuples, or dicts a call to len() should be able to be optimized to a single memory read out of a field on the object, plus a call to instantiate an integer object. However, in CPython today it's actually about 3 function calls, and 3 memory reads to get this data (plus the call to instantiate an integer object), plus the dictionary lookup in the global builtins to see what the object named len is. That's a hell of a lot more work than a single memory read (which is one instruction on an x86 CPU). The reason CPython needs to do all that work is that it a) doesn't know what the len object is, and b) when len is called it has no idea what its arguments will be.

As I've written about previously, Unladen Swallow has some creative ways to solve these problems, to avoid the dictionary lookups, and, eventually, to inline the body of the len() into the caller and optimize if for the types it's called with. However, this requires good runtime feedback, since the compiler simply cannot know statically what any of the objects will actually be at runtime. However, if len could be know to be the len() function at compile time Unladen Swallow could inline the body of the function, unconditionally, into the caller. Even with only static specialization for lists, dicts, and tuples like:

if isinstance(obj, list):
    return obj.ob_size
elif isinstance(obj, tuple):
    return obj.ob_size
elif isisinstance(obj, dict):
    return obj.ma_fill
else:
    return obj.__len__()

This would be quite a bit faster than the current amount of indirection. In PHP's case it's actually even easier, it only has one builtin array type, which acts as both a typical array as well as a hash table. Now extend this possibly optimization to not only every builtin, but every single function call. Instead of the dictionary lookups Python has to do for every global these can just become direct function calls.

Because of differences like this (and the fact that PHP only has machine sized integers, not arbitrary sized ones, and not implementing features of the language such as eval()) I believe that the work done on HipHop represents a fundamentally smaller challenge than that taken on by the teams working to improve the implementations of languages like Python, Ruby, or Javascript. At the time of the writing the HipHop source has not been released, however I am interested to see how they handle garbage collection.

You can find the rest here. There are view comments.

Dive into Python 3 Review

Posted January 12th, 2010. Tagged with review, python, book.

Disclosure: I received a free review copy of Dive into Python 3 from Apress.

Unlike a ton of people I know in the Python world, my experience in learning Python didn't include the original Dive into Python at all, in fact I didn't encounter it until quite a while later when I was teaching a friend Python and I was looking for example exercises. Since Dive into Python is really a book for people who don't know Python a lot of my views on it are based on how helpful I think it would have been while teaching my friend, since it's pretty difficult to imagine myself not knowing Python as I do.

The first thing to note is that Mark Pilgrim has an absolutely brilliant writing style, even when I was reading about stuff I already knew it was an absolute pleasure. The next thing to note is that this book is squarely targeted at people who are already programmers who want to learn Python, I don't think it would make a good "my first programming book". Dive into Python 3 jumps into Python full steam ahead, it dives into Python's datatypes, generators, unit testing, and interacting with the web.

The book is strongly example based, Mark does a great job of showing code and explaining it clearly. He also does a good job of emphasising best practices such as unit testing. It also covers some external libraries like httplib2, plus there's stuff on porting your existing libraries to Python 3, and a great appendix.

For all these reasons I think Dive into Python 3 makes a good introduction to Python 3. But don't take my word for it, Mark has made a point of releasing all of his books online, free of charge. So if you think you're in the target audience (or even if you aren't) check it out, it doesn't cost you a dime, which Mark goes above and beyond the call of duty to ensure.

You can find the rest here. There are view comments.

Hot Django on WSGI Action (announcing django-wsgi)

Posted January 11th, 2010. Tagged with release, django, python, wsgi.

For a long time it's been possible to deploy a Django project under a WSGI server, and in the run up to Django's 1.0 release a number of bugs were fixed to make Django's WSGI handler as compliant to the standard as possible. However, Django's support for interacting with the world of WSGI application, middleware, and frameworks has been less than stellar. However, I recently got inspired to improve this situation.

WSGI (Web Server Gateway Interface) is a specification for how applications and frameworks in Python can interface with a server. There are tons of servers that support the WSGI interface, most notably mod_wsgi (an Apache plugin), however there are tons of other ones, spawning, twisted, uwsgi, gunicorn, cherrypy, and probably dozens more.

The inspiration for improving Django's integration with the WSGI world was Ruby on Rails 3's improved support for Rack, Rack is the Ruby world's equivilant to WSGI. In Rails 3 every layer of the stack, the entire application, the dispatch, and individual controllers, is exposed as a Rack application. It occured to me that it would be pretty swell if we could do the same thing with Django, allow individual views and URLConfs to be exposed as WSGI application, and the reverse, allowing WSGI application to be deployed inside of Django application (via the standard URLConf mapping system). Another part of this inspiration was discussing gunicorn with Eric Florenzano, gunicorn is an awesome new WSGI server, inspired by Ruby's Unicorn, there's not enough space in this post to cover all the reasons it is awesome, but it is.

The end result of this is a new package, django-wsgi, which aims to bridge the gap between the WSGI world and the Django world. Here's an example of exposing a Django view as a WSGI application:

from django.http import HttpResponse

from django_wsgi import wsgi_application

def my_view(request):
    return HttpResponse("Hello world, I'm a WSGI application!")


application = wsgi_application(my_view)

And now you can point any WSGI server at this and it'll serve it up for you. You can do the same thign with a URLConf:

from django.conf.urls.defaults import patterns
from django.http import HttpResponse

from django_wsgi import wsgi_application

def hello_world(request):
    return HttpResponse("Hello world!")

def hello_you(request, name):
    return HttpResponse("Hello %s!" % name)


urls = patterns("",
    (r"^$", hello_world),
    (r"^(?P<name>\w+)/$", hello_you)
)

application = wsgi_application(urls)

Again all you need to do is point your server at this and it just works. However, the point of all this isn't just to make building single file applications easier (although this definitely does), the real win is that you can take a Django application and mount it inside of another WSGI application through whatever process it supports. Of course you can also go the other direction, mount a WSGI application inside of a Django URLconf:

from django.conf.urls.defaults import *

from django_wsgi import django_view

def my_wsgi_app(environ, start_response):
    start_response("200 OK", [("Content-type", "text/plain")])
    return ["Hello World!"]

urlpatterns = patterns("",
    # other views here
    url("^my_view/$", django_view(my_wsgi_app))
)

And that's all there is to it. Write your apps the way you want and deploy them, plug them in to each other, whatever. There's a lot of work being done in the Django world to play nicer with the rest of the Python ecosystem, and that's definitely a good thing. I'd also like to thank Armin Ronacher for helping me make sure this actually implements WSGI correctly. Please use this, fork it, send me hate mail, improve it, and enjoy it!

You can find the rest here. There are view comments.

You Built a Metaclass for *what*?

Posted November 30th, 2009. Tagged with c++, django, python, metaclass.

Recently I had a bit of an interesting problem, I needed to define a way to represent a C++ API in Python. So, I figured the best way to represent that was one class in Python for each class in C++, with a functions dictionary to track each of the methods on each class. Seems simple enough right, do something like this:

class String(object):
    functions = {
        "size": Function(Integer, []),
    }

We've got a String class with a functions dictionary that maps method names to Function objects. The Function constructor takes a return type and a list of arguments. Unfortunately we run into a problem when we want to do something like this:

class String(object):
    functions = {
        "size": Function(Integer, []),
        "append": Function(None, [String])
    }

If we try to run this code we're going to get a NameError, String isn't defined yet. Django models have a similar issue, with recursive foreign keys. Django's solution is to use the placeholder string "self", and have a metaclass translate it into the right class. Also having a slightly more declarative API might be nice, so something like this:

class String(DeclarativeObject):
    size = Function(Integer, [])
    append = Function(None, ["self"])

So now that we have a nice pretty API we need our metaclass to make it happen:

RECURSIVE_TYPE_CONSTANT = "self"

class DeclarativeObjectMetaclass(type):
    def __new__(cls, name, bases, attrs):
        functions = dict([(n, attr) for n, attr in attrs.iteritems()
            if isinstance(attr, Function)])
        for attr in functions:
            attrs.pop(attr)
        new_cls = super(DeclarativeObjectMetaclass, cls).__new__(cls, name, bases, attrs)
        new_cls.functions = {}
        for name, function in functions.iteritems():
            if function.return_type == RECURSIVE_TYPE_CONSTANT:
                function.return_type = new_cls
            for i, argument in enumerate(function.arguments):
                if argument == RECURSIVE_TYPE_CONSTANT:
                    function.arguments[i] = new_cls
            new_cls.functions[name] = function
        return new_cls

class DeclarativeObject(object):
    __metaclass__ = DeclarativeObjectMetaclass

And that's all their is to it. We take each of the functions on the class out of the attributes, create a normal class instance without the functions, and then we do the replacements on the function objects and stick them in a functions dictionary.

Simple patterns like this can be used to build beautiful APIs, as is seen in Django with the models and forms API.

You can find the rest here. There are view comments.

Getting Started with Testing in Django

Posted November 29th, 2009. Tagged with tests, django, python.

Following yesterday's post another hotly requested topic was testing in Django. Today I wanted to give a simple overview on how to get started writing tests for your Django applications. Since Django 1.1, Django has automatically provided a tests.py file when you create a new application, that's where we'll start.

For me the first thing I want to test with my applications is, "Do the views work?". This makes sense, the views are what the user sees, they need to at least be in a working state (200 OK response) before anything else can happen (business logic). So the most basic thing you can do to start testing is something like this:

from django.tests import TestCase
class MyTests(TestCase):
    def test_views(self):
        response = self.client.get("/my/url/")
        self.assertEqual(response.status_code, 200)

By just making sure you run this code before you commit something you've already eliminated a bunch of errors, syntax errors in your URLs or views, typos, forgotten imports, etc. The next thing I like to test is making sure that all the branches of my code are covered, the most common place my views have branches is in views that handle forms, one branch for GET and one for POST. So I'll write a test like this:

from django.tests import TestCase
class MyTests(TestCase):
    def test_forms(self):
        response = self.client.get("/my/form/")
        self.assertEqual(response.status_code, 200)

        response = self.client.post("/my/form/", {"data": "value"})
        self.assertEqual(response.status_code, 302) # Redirect on form success

        response = self.client.post("/my/form/", {})
        self.assertEqual(response.status_code, 200) # we get our page back with an error

Now I've tested both the GET and POST conditions on this view, as well the form is valid and form is invalid cases. With this strategy you can have a good base set of tests for any application with not a lot of work. The next step is setting up tests for your business logic. These are a little more complicated, you need to make sure models are created and edited in the right cases, emails are sent in the right places, etc. Django's testing documentation is a great place to read more on writing tests for your applications.

You can find the rest here. There are view comments.

Django and Python 3

Posted November 28th, 2009. Tagged with django, python.

Today I'm starting off doing some of the posts people want to see, and the number one item on that list is Django and Python 3. Python 3 has been out for about a year at this point, and so far Django hasn't really started to move towards it (at least at a first glance). However, Django has already begun the long process towards moving to Python 3, this post is going to recap exactly what Django's migration strategy is (most of this post is a recap of a message James Bennett sent to the django-developers mailing list after the 1.0 release, available here).

One of the most important things to recognize in this that though there are many developers using Django for smaller projects, or new projects that want to start these on Python 3, there are also a great many more with legacy (as if we can call recent deployments on Python2.6 and Django 1.1 legacy) deployments that they want to maintain and update. Further, Django's latest release, 1.1, has support for Python releases as old as 2.3, and a migration to Python 3 from 2.3 is nontrivial. However, it is significantly easier to make this migration from Python 2.6. This is the crux of James's plan, people want to move to Python 3.0 and moving towards Python 2.6 makes this easier for them and us. Therefore, since the 1.1 release Django has been removing support for one point version of Python per Django release. So, Django 1.1 will be the last release to support Python 2.3, 1.2 will be the last to support 2.4, etc. This plan isn't guaranteed, if there's a compelling reason to maintain support for a version for longer it will likely override this plan (for example if a particularly common deployment platform only offered Python 2.5 removing support for it might be delayed an additional release).

At the end of this process Django is going to end up only supporting Python 2.6. At this point (or maybe even before), a strategy will need to be devised for how to actually handle the switch. Some possibilities are, 1) having an official breakpoint, only one version is supported at a given time, 2) Python 3 support begins in a branch that tracks trunk and eventually it switches to become trunk once Python 3 is the more common deployment, 3) Python 2.6 and 3 are supported from a single codebase. I'm not sure which one of these is easiest, other projects such as PLY have chosen to go with option 3, however my inclination is that option 2 will be best for Django since issues like bytes vs. string are particularly prominent in Django (since it talks to so many external data sources).

For people who are interested Martin von Löwis actually put together a patch that, at the time, gave Django Python 3 support (at least enough to run the tutorial under SQLite). If you're very interested in Django on Python 3 the best path would probably be to bring that patch up to date (unless it's wildly out of date, I haven't checked), and starting to fix new things that have been introduced since the patch was written. This work isn't likely to get any official support, since maintaining Python 2.4 support and Python 3 would be far too difficult, however there's no reason you can't maintain the patch externally on something like Github or Bitbucket.

You can find the rest here. There are view comments.

Why Meta.using was removed

Posted November 27th, 2009. Tagged with python, models, django, orm, gsoc.

Recently Russell Keith-Magee and I decided that the Meta.using option needed to be removed from the multiple-db work on Django, and so we did. Yesterday someone tweeted that this change caught them off guard, so I wanted to provide a bit of explanation as to why we made that change.

The first thing to note is that Meta.using was very good for one specific use case, horizontal partitioning by model. Meta.using allowed you to tie a specific model to a specific database by default. This meant that if you wanted to do things like have users be in one db and votes in another this was basically trivial. Making this use case this simple was definitely a good thing.

The downside was that this solution was very poorly designed, particularly in light on Django's reusable application philosophy. Django emphasizes the reusability of application, and having the Meta.using option tied your partitioning logic to your models, it also meant that if you wanted to partition a reusable application onto another DB this easily the solution was to go in and edit the source for the reusable application. Because of this we had to go in search of a better solution.

The better solution we've come up with is having some sort of callback you can define that lets you decide what database each query should be executed on. This would let you do simple things like direct all queries on a given model to a specific database, as well as more complex sharding logic like sending queries to the right database depending on which primary key value the lookup is by. We haven't figured out the exact API for this, and as such this probably won't land in time for 1.2, however it's better to have the right solution that has to wait than to implement a bad API that would become deprecated in the very next release.

You can find the rest here. There are view comments.

Final Review of Python Essential Reference

Posted November 25th, 2009. Tagged with review, python, book.

Disclosure: I received a free review copy of the book.

Today I finished reading the Python Essential Reference and I wanted to share my final thoughts on the book. I'll start by saying I still agree with everything I wrote in my initial review, specifically that it's both a great resource as well as a good way to find out what you don't already know. Reading the second half of the book there were a few things that really exemplified this for me.

The first instance of this is the chapter on concurrency. I've done some concurrent programming with Python, but it's mostly been small scripts, a multiprocess and multithreaded web scraper for example, so I'm familiar with the basic APIs for threading and multiprocessing. However, this chapter goes into the full details, really covering the stuff you need to know if you want to build bigger applications that leverage these techniques. Things like shared data for processes or events and condition variables for threads and the kind of things that the book gives a good explanation of, as well as good examples of how to use them.

The other chapter that really stood out for me is the one on network programming and sockets. This chapter describes everything from the low-level select module up through through the included socket servers. The most valuable part is an example of how to build an asynchronous IO system. This example is about 2 pages long and it's a brilliant example of how to use the modules, how to make an asynchronous API feel natural, and what the tradeoffs of asynchronous versus concurrency are. In addition, in the wake of the "* in Unix" posts from a while ago I found the section on the socket module interesting as it's something I've never actually worked directly with.

The rest of the book is a handy reference, but for me these two chapters are the types of things that earns this a place on my bookshelf. The way Python Essential Reference balances depth with conciseness is excellent, it shows you the big picture for everything and gives you super details on the things that are really important. I just got my review copy of Dive into Python 3 today, so I look forward to giving a review of it in the coming days.

You can find the rest here. There are view comments.

Filing a Good Ticket

Posted November 24th, 2009. Tagged with django, python, software.

I read just about every single ticket that's filed in Django's trac, and at this point I'e gotten a pretty good sense of what (subjectively) makes a useful ticket. Specifically there are a few things that can make your ticket no better than spam, and a few that can instantly bump your ticket to the top of my "TODO" list. Hopefully, these will be helpful in both filing ticket's for Django as well as other open source projects.

  • Search for a ticket before filing a new one. Django's trac, for example, has at least 10 tickets describing "Decoupling urls in the tutorial, part 3". These have all been wontfixed (or closed as a duplicate of one of the others). Each time one of these is filed it takes time for someone to read through it, write up an appropriate closing message, and close it. Of course, the creator of the ticket also invested time in filing the ticket. Unfortunately, for both parties this is time that could be better spent doing just about anything else, as the ticket has been decisively dealt with plenty of times.
  • On a related note, please don't reopen a ticket that's been closed before. This one depends more on the policy of the project, in Django's case the policy is that once a ticket has been closed by a core developer the appropriate next step is to start a discussion on the development mailing list. Again this results in some wasted time for everyone, which sucks.
  • Read the contributing documentation. Not every project has something like this, but when a project does it's definitely the right starting point. It will hopefully contain useful general bits of knowledge (like what I'm trying to put here) as well as project specific details, what the processes are, how to dispute a decision, how to check the status of a patch, etc.
  • Provide a minimal test case. If I see a ticket who's description involves a 30 field model, it drops a few rungs on my TODO list. Large blocks of code like this take more time to wrap ones head around, and most of it will be superfluous. If I see just a few lines of code it takes way less time to understand, and it will be easier to spot the origin of the problem. As an extension to this if the test case comes in the form of a patch to Django's test suite it becomes even easier for a developer to dive into the problem.
  • Don't file a ticket advocating a major feature or sweeping change. Pretty much if it's going to require a discussion the right place to start is the mailing list. Trac is lousy at facilitating discussions, mailing lists are designed explicitly for that purpose. A discussion on the mailing list can more clearly outline what needs to happen, and it may turn out that several tickets are needed. For example filing a ticket saying, "Add CouchDB support to the ORM" is pretty useless, this requires a huge amount of underlying changes to make it even possible, and after that a database backend can live external to Django, so there's plenty of design decisions to go around.

These are some of the issues I've found to be most pressing while reviewing tickets for Django. I realize they are mostly in the "don't" category, but filing a good ticket can sometimes be as good as clearly stating what the problem is, and how to reproduce it.

You can find the rest here. There are view comments.

Using PLY for Parsing Without Using it for Lexing

Posted November 23rd, 2009. Tagged with parse, python, lex, ply, yacc.

Over the past week or so I've been struggling with attempting to write my own parser (or parser generator) by hand. A few days ago I finally decided to give up on this notion (after all the parser isn't my end goal) as it was draining my time from the interesting work to be done. However, I wanted to keep my existing lexer. I wrote the lexer by hand in the method I described in a previous post, it's fast, easy to read, and I rather like my handiwork, so I wanted to keep it if possible. I've used PLY before (as I described last year) so I set out to see if it would be possible to use it for parsing without using it for lexing.

As it turns out PLY expects only a very minimal interface from it's lexer. In fact it only needs one method, token(), which returns a new token (or None at the end). Tokens are expected to have just 4 attributes. Having this knowledge I now set out to write a pair of compatibility classes for my existing lexer and token classes, I wanted to do this without altering the lexer/token API so that if and when I finally write my own parser I don't have to remove legacy compatibility stuff. My compatibility classes are very small, just this:

class PLYCompatLexer(object):
    def __init__(self, text):
        self.text = text
        self.token_stream = Lexer(text).parse()

    def token(self):
        try:
            return PLYCompatToken(self.token_stream.next())
        except StopIteration:
            return None


class PLYCompatToken(object):
    def __init__(self, token):
        self.type = token.name
        self.value = token.value
        self.lineno = None
        self.lexpos = None

    def __repr__(self):
        return "<Token: %r %r>" % (self.type, self.value)

This is the entirety of the API that PLY needs. Now I can write my parser exactly as I would normally with PLY.

You can find the rest here. There are view comments.

A Bit of Benchmarking

Posted November 22nd, 2009. Tagged with pypy, python, django, compiler, programming-languages.

PyPy recently posted some interesting benchmarks from the computer language shootout, and in my last post about Unladen Swallow I described a patch that would hopefully be landing soon. I decided it would be interesting to benchmarks something with this patch. For this I used James Tauber's Mandelbulb application, at both 100x100 and 200x200. I tested CPython, Unladen Swallow Trunk, Unladen Swallow Trunk with the patch, and a recent PyPy trunk (compiled with the JIT). My results were as follows:

VM 100 200
CPython 2.6.4 17s 64s
Unladen Swallow Trunk 16s 52s
Unladen swallow Trunk + Patch 13s 49s
PyPy Trunk 10s 46s

Interesting results. At 100x100 PyPy smokes everything else, and the patch shows a clear benefit for Unladen. However, at 200x200 both PyPy and the patch show diminishing returns. I'm not clear on why this is, but my guess is that something about the increased size causes a change in the parameters that makes the generated code less efficient for some reason.

It's important to note that Unladen Swallow has been far less focussed on numeric benchmarks than PyPy, instead focusing on more web app concerns (like template languages). I plan to benchmark some of these as time goes on, particularly after PyPy merges their "faster-raise" branch, which I'm told improves PyPy's performance on Django's template language dramatically.

You can find the rest here. There are view comments.

Things College Taught me that the "Real World" Didn't

Posted November 21st, 2009. Tagged with pypy, parse, python, unladen-swallow, compile, django, ply, programming-languages, c++, response, lex, compiler, yacc, college.

A while ago Eric Holscher blogged about things he didn't learn in college. I'm going to take a different spin on it, looking at both things that I did learn in school that I wouldn't have learned else where (henceforth defined as my job, or open source programming), as well as thinks I learned else where instead of at college.

Things I learned in college:

  • Big O notation, and algorithm analysis. This is the biggest one, I've had little cause to consider this in my open source or professional work, stuff is either fast or slow and that's usually enough. Learning rigorous algorithm analysis doesn't come up all the time, but every once in a while it pops up, and it's handy.
  • C++. I imagine that I eventually would have learned it myself, but my impetus to learn it was that's what was used for my CS2 class, so I started learning with the class then dove in head first. Left to my own devices I may very well have stayed in Python/Javascript land.
  • Finite automaton and push down automaton. I actually did lexing and parsing before I ever started looking at these in class (see my blog posts from a year ago) using PLY, however, this semester I've actually been learning about the implementation of these things (although sadly for class projects we've been using Lex/Yacc).

Things I learned in the real world:

  • Compilers. I've learned everything I know about compilers from reading my papers from my own interest and hanging around communities like Unladen Swallow and PyPy (and even contributing a little).
  • Scalability. Interesting this is a concept related to algorithm analysis/big O, however this is something I've really learned from talking about this stuff with guys like Mike Malone and Joe Stump.
  • APIs, Documentation. These are the core of software development (in my opinion), and I've definitely learned these skills in the open source world. You don't know what a good API or documentation is until it's been used by someone you've never met and it just works for them, and they can understand it perfectly. One of the few required, advanced courses at my school is titled, "Software Design and Documentation" and I'm deathly afraid it's going to waste my time with stuff like UML, instead of focusing on how to write APIs that people want to use and documentation that people want to read.

So these are my short lists. I've tried to highlight items that cross the boundaries between what people traditionally expect are topics for school and topics for the real world. I'd be curious to hear what other people's experience with topics like these are.</div>

You can find the rest here. There are view comments.

Another Pair of Unladen Swallow Optimizations

Posted November 19th, 2009. Tagged with pypy, python, unladen-swallow, django, programming-languages.

Today a patch of mine was committed to Unladen Swallow. In the past weeks I've described some of the optimizations that have gone into Unladen Swallow, in specific I looked at removing the allocation of an argument tuple for C functions. One of the "on the horizon" things I mentioned was extending this to functions with a variable arity (that is the number of arguments they take can change). This has been implemented for functions that take a finite range of argument numbers (that is, they don't take *args, they just have a few arguments with defaults). This support was used to optimize a number of builtin functions (dict.get, list.pop, getattr for example).

However, there were still a number of functions that weren't updated for this support. I initially started porting any functions I saw, but it wasn't a totally mechanical translation so I decided to do a little profiling to better direct my efforts. I started by using the cProfile module to see what functions were called most frequently in Unladen Swallow's Django template benchmark. Imagine my surprise when I saw that unicode.encode was called over 300,000 times! A quick look at that function showed that it was a perfect contender for this optimization, it was currently designated as a METH_VARARGS, but in fact it's argument count was a finite range. After about of dozen lines of code, to change the argument parsing, I ran the benchmark again, comparing it a control version of Unladen Swallow, and it showed a consistent 3-6% speedup on the Django benchmark. Not bad for 30 minutes of work.

Another optimization I want to look at, which hasn't landed yet, is one of optimize various operations. Right now Unladen Swallow tracks various data about the types seen in the interpreter loop, however for various operators this data isn't actually used. What this patch does is check at JIT compilation time whether the operator site is monomorphic (that is there is only one pair of types ever seen there), and if it is, and it is one of a few pairings that we have optimizations for (int + int, list[int], float - float for example) then optimized code is emitted. This optimized code checks the types of both the arguments that they are the expected ones, if they are then the optimized code is executed, otherwise the VM bails back to the interpreter (various literature has shown that a single compiled optimized path is better than compiling both the fast and slow paths). For simple algorithm code this optimization can show huge improvements.

The PyPy project has recently blogged about the results of the results of some benchmarks from the Computer Language Shootout run on PyPy, Unladen Swallow, and CPython. In these benchmarks Unladen Swallow showed that for highly algorithmic code (read: mathy) it could use some work, hopefully patches like this can help improve the situation markedly. Once this patch lands I'm going to rerun these benchmarks to see how Unladen Swallow improves, I'm also going to add in some of the more macro benchmarks Unladen Swallow uses to see how it compares with PyPy in those. Either way, seeing the tremendous improvements PyPy and Unladen Swallow have over CPython gives me tremendous hope for the future.

You can find the rest here. There are view comments.

Announcing django-admin-histograms

Posted November 19th, 2009. Tagged with admin, release, django, python.

This is just a quick post because it's already past midnight here. Last week I realized some potentially useful code that I extracted from the DjangoDose and typewar codebases. Basically this code let's you get simple histogram reports for models in the admin, grouped by a date field. After I released it David Cramer did some work to make the code slightly more flexible, and to provide a generic templatetag for creating these histograms anywhere. The code can be found on github, and if you're curious what it looks like there's a screenshot on the wiki. Enjoy.

You can find the rest here. There are view comments.

Writing a Lexer

Posted November 17th, 2009. Tagged with parse, python, compile, lex, programming-languages.

People who have been reading this blog since last year (good lord) may recall that once upon a time I did a short series of posts on lexing and parsing using PLY. Back then I was working on a language named Al. This past week or so I've started working on another personal project (not public yet) and I've once again had the need to lex things, but this time I wrote my lexer by hand, instead of using any sort of generator. This has been an exceptional learning experience, so I'd like to pass some of that on to you.

The first thing to note is that writing a lexer is a great place to TDD (test driven development), I've rewritten various parts of my lexer five or more times, I've needed my tests to keep me sane. Got your tests written? Ok it's time to dive right into our lexer.

I've structured my lexer as a single class that takes an input string, and has a parse method which returns a generator that yields tokens (tokens are just a namedtuple with a name and value field). The parser has two important attributes, state which is a string that says what state the lexer is in (this is used for tokens that are more than one character long), and current_val which is a list containing characters that will eventually become the value for the current token being found.

The parse method iterates through characters in the text and then it checks, if the parser has a state (self.state is not None) it does getattr(self, self.state)(character). Otherwise it calls self.generic(character). Then the various "state methods" are responsible for mutating self.current_val and self.state and returning a Token. So for example the string state looks like this:

def string(self, ch):
    if ch == '"':
        sym = Symbol("STRING", "".join(self.current_val))
        self.current_val = []
        self.state = None
        return sym
    elif ch == "\\":
        self.state = "string_escaped"
    else:
        self.current_val.append(ch)

If the character is a quote then we're closing our string so we return our string Symbol, reset the current_val and state. If the character is a then we switch into a string_escaped state which knows to handle the character as a literal and then go back to string state. If the character is anything else then we just append it to the current_val, it will get handled at the end of the string.

I've found this to be an exceptionally powerful method, and it makes my end result code very readable. Hopefully I'll be able to reveal my project in the coming weeks, as I'm very excited about it, even if it's not ready I'll continue to share these lessons learned as I go.

You can find the rest here. There are view comments.

Initial Review: Python Essential Reference

Posted November 15th, 2009. Tagged with review, python, book.

Disclosure: I received a free review copy of Python Essential Reference, Fourth Edition.

I've never really used reference material, I've always loved tutorials, howtos, and guides for learning things, but I've usually shunned reference material in favor of reading the source. Therefore, I didn't think I'd have a huge use for this book. However, so far (I've read about half the book so far) I've found it to be an exceptional resource, and I definitely plan on keeping it on my bookshelf.

The first third or so of the book is a reference on the syntax and other basic constructs of Python, it's probably not the part of the book you'll be consulting very frequently if you're an experienced Python programmer, however the end of this section is a bit of "Testing, Debugging, Profiling, and Tuning", this I can see myself flipping back to, as it extensively documents the doctests, unittests, pdb, cProfile, and dis modules.

The next third of the book is all about the Python library, including both the builtins and the standard library. This section is organized by functionality and I can definitely see myself using it. For example it has sections on "Python Runtime Services" (like atexit, gc, marshal, and weakref), "Data Structures, Algorithms, and Code Simplification" (bisect, collections, heapq for example), "String and Text Handling" (codecs, re, struct), and "Python Database Access" (PEP249, sqlite, and dbm). There's more, but this is as far as I've read. Reading through like a novel each of these sections has exposed me to things I wasn't aware of or don't use as frequently as I should, and I plan on using this book as a resource for exploring them. David Beazley has painstakingly documented the details of these modules, paying particular attention to the functions and classes you are likely to need most.

All in all I've found the Python Essential Reference to be a good book, especially for people who like reference documentation. Depending on how you use Python this book can serve as an excellent eye opener into other parts of the language and standard library, and for me I think that's where a ton of value will come from, as a day to day Python user I don't need a reference for most of the language, but for the bits it's introducing me to, having it handy will be a leg up.

You can find the rest here. There are view comments.

Why jQuery shouldn't be in the admin

Posted November 14th, 2009. Tagged with python, admin, jquery, django, gsoc.

This summer as a part of the Google Summer of Code program Zain Memon worked on improving the UI for Django's admin, specifically he integrated jQuery for various interface improvements. I am opposed to including jQuery in Django's admin, as far as I know I'm the only one. I should note that on a personal level I love jQuery, however I don't think that means it should be included in Django proper. I'm going to try to explain why I think it's a bad idea and possibly even convince you.

The primary reason I'm opposed is because it lowers the pool of people who can contribute to developing Django's admin. I can hear the shouts from the audience, "jQuery makes Javascript easy, how can it LOWER the pool". By using jQuery we prevent people who know Javascript, but not jQuery from contributing to Django's admin. If we use more "raw" Javascript then anyone who knows jQuery should be able to contribute, as well as anyone who knows Mootools, or Dojo, or just vanilla Javascript. I'm sure there are some people who will say, "but it's possible to use jQuery without knowing Javascript", I submit to you that this is a bad thing and certainly shouldn't be encouraged. We need to look no further than Jacob Kaplan-Moss's talks on Django where he speaks of his concern at job postings that look for Django experience with no mention of Python.

The other reason I'm opposed is because selecting jQuery for the admin gives the impression that Django has a blessed Javascript toolkit. I'm normally one to say, "if people make incorrect inferences that's their own damned problem," however in this case I think they would be 100% correct, Django would have blessed a Javascript toolkit. Once again I can hear the calls, "But, it's in contrib, not Django core", and again I disagree, Django's contrib isn't like other projects' contrib directories that are just a dumping ground for random user contributed scripts and other half working features. Django's contrib is every bit as official as parts of Django that live elsewhere in the source tree. Jacob Kaplan-Moss has described what django.contrib is, no part of that description involves it being less official, quite the opposite in fact.

For these reasons I believe Django's admin should avoid selecting a Javascript toolkit, and instead maintain it's own handrolled code. Though this brings an increase burden on developers I believe it is more important to these philosophies than to take small development wins. People saying this stymies the admin's development should note that Django's admin's UI has changed only minimally over the past years, and only a small fraction of that can be attributed to difficulties in Javascript development.

You can find the rest here. There are view comments.

When Django Fails? (A response)

Posted November 11th, 2009. Tagged with django, python, response, rails.

I saw an article on reddit (or was in hacker news?) that asked the question: what happens when newbies make typos following the Rails tutorial, and how good of a job does Rails do at giving useful error messages? I decided it would be interesting to apply this same question to Django, and see what the results are. I didn't have the time to review the entire Django tutorial, so instead I'm going to make the same mistakes the author of that article did and see what the results are, I've only done the first few where the analogs in Django were clear.

Mistake #1: Point a URL at a non-existent view:

I pointed a URL at the view "django_fails.views.homme" when it should have been "home". Let's see what the error is:

ViewDoesNotExist at /
Tried homme in module django_fails.views. Error was: 'module' object has no attribute 'homme'

So the exception name is definitely a good start, combined with the error text I think it's pretty clear that the view doesn't exist.

Mistake #2: misspell url in the mapping file

Instead of doing url("^$" ...) I did urll:

NameError at /
name 'urll' is not defined

The error is a normal Python exception, which for a Python programmer is probably decently helpful, the cake is that if you look at the traceback it points to the exact line, in user code, that has the typo, which is exactly what you need.

Mistake #3: Linking to non-existent pages

I created a template and tried to use the {% url %} tag on a nonexistent view.

TemplateSyntaxError at /
Caught an exception while rendering: Reverse for 'homme' with arguments '()' and keyword arguments '{}' not found.

It points me at the exact line of the template that's giving me the error and it says that the reverse wasn't found, it seems pretty clear to me, but it's been a while since I was new, so perhaps a new users perspective on an error like this would be important.

It seems clear to me that Django does a pretty good job with providing useful exceptions, in particular the tracebacks on template specific exceptions can show you where in your templates the errors are. One issue I'll note that I've experience in my own work is that when you have an exception from within a templatetag it's hard to get the Python level traceback, which is important when you are debugging your own templatetags. However, there's a ticket that's been filed for that in Django's trac.

You can find the rest here. There are view comments.

The State of MultiDB (in Django)

Posted November 10th, 2009. Tagged with python, models, django, gsoc, internals, orm.

As you, the reader, may know this summer I worked for the Django Software Foundation via the Google Summer of Code program. My task was to implement multiple database support for Django. Assisting me in this task were my mentors Russell Keith-Magee and Nicolas Lara (you may recognize them as the people responsible for aggregates in Django). By the standards of the Google Summer of Code project my work was considered a success, however, it's not yet merged into Django's trunk, so I'm going to outline what happened, and what needs to happen before this work is considered complete.

Most of the major things happened, settings were changed from a series of DATABASE_* to a DATABASES setting that's keyed by DB aliases and who's values are dictionaries containing the usual DATABASE* options, QuerySets grew a using() method which takes a DB alias and says what DB the QuerySet should be evaluated against, save() and delete() grew similar using keyword arguments, a using option was added to the inner Meta class for models, transaction support was expanded to include support for multiple databases, as did the testing framework. In terms of internals almost every internal DB related function grew explicit passing of the connection or DB alias around, rather than assuming the global connection object as they used to. As I blogged previously ManyToMany relations were completely refactored. If it sounds like an awful lot got done, that's because it did, I knew going in that multi-db was a big project and it might not all happen within the confines of the summer.

So if all of that stuff got done, what's left? Right before the end of the GSOC time frame Russ and I decided that a fairly radical rearchitecting of the Query class (the internal datastructure that both tracks the state of an operation and forms its SQL) was needed. Specifically the issue was that database backends come in two varieties. One is something like a backend for App Engine, or CouchDB. These have a totally different design than SQL, they need different datastructures to track the relevant information, and they need different code generation. The second type of database backend is one for a SQL database. By contrast these all share the same philosophies and basic structure, in most cases their implementation just involves changing the names of database column types or the law LIMIT/OFFSET is handled. The problem is Django treated all the backends equally. For SQL backends this meant that they got their own Query classes even though they only needed to overide half of the Query functionality, the SQL generation half, as the datastructure part was identical since the underlying model is the same. What this means is that if you make a call to using() on a QuerySet half way through it's construction you need to change the class of the Query representation if you switch to a database with a different backend. This is obviously a poor architecture since the Query class doesn't need to be changed, just the bit at the end that actually constructs the SQL. To solve this problem Russ and I decided that the Query class should be split into two parts, a Query class that stores bits about the current query, and a SQLCompiler which generated SQL at the end of the process. And this is the refactoring that's holding up the merger of my multi-db work primarily.

This work is largely done, however the API needs to be finalized and the Oracle backend ported to the new system. In terms of other work that needs to be done, GeoDjango needs to be shown to shown to still work (or fixed). In my opinion everything else on the TODO list (available here, please don't deface) is optional for multi-db to be merge ready, with the exception of more example documentation.

There are already people using the multi-db branch (some even in production), so I'm confident about it's stability. For the next 6 weeks or so (until the 1.2 feature deadline), my biggest priority is going to be getting this branch into a merge ready state. If this is something that interests you please feel free to get in contact with me (although if you don't come bearing a patch I might tell you that I'll see you in 6 weeks ;)), if you happen to find bugs they can be filed on the Django trac, with version "soc2009/multidb". As always contributors are welcome, you can find the absolute latest work on my Github and a relatively stable version in my SVN branch (this doesn't contain the latest, in progress, refactoring). Have fun.

You can find the rest here. There are view comments.

Another Unladen Swallow Optimization

Posted November 8th, 2009. Tagged with python, internals, unladen-swallow.

This past week I described a few optimizations that the Unladen Swallow team have done in order to speed up CPython. In particular one of the optimizations I described was to emit direct calls to C functions that took either zero or one argument. This improves the speed of Python when calling functions like len() or repr(), who only take one argument. However, there are plenty of builtin functions that take a fixed number of arguments that is greater than one. This is the source of the latest optimization.

As I discussed previously there were two relevant flags, METH_O and METH_NOARGS. These described functions that take either one or zero arguments. However, this doesn't cover a wide gamut of functions. Therefore the first stage of these optimizations was to replace these two flags with METH_FIXED, which indicates that the function takes a fixed number of arguments. There was also an additional slot added to the struct that holds C functions to store the arity of the function (the number of arguments it takes). Therefore something like:

{"id", builtin_id, METH_O, id_doc}

Which is what the struct for a C function looks like would be replaced with:

{"id", builtin_id, METH_FIXED, id_doc, 1}

This allows Unladen Swallow to emit direct calls to functions that take more than 1 argument, specifically up to 3 arguments. This results in functions like hasattr() and setattr() to be better optimized. This change ultimately results in a 7% speed increase in Unladen Swallow's Django benchmark. Here the speed gains will largely come from avoiding allocating a tuple for the arguments, as Python used to have to do since the functions were defined as METH_VARARGS (which results in it receiving it's arguments as a tuple), as well as avoiding parsing that tuple.

This change isn't as powerful as it could be, specifically it requires that the function always take the same number of arguments. This prevents optimizing calls to getattr() for example, which can take either 2 or 3 arguments. This optimization doesn't hold because C doesn't have any way of expressing default arguments for the function, therefore the CPython runtime must pass all of the needed arguments to a function, which means C functions need to have a way to encode their defaults in a way that CPython can understand. One of the proposed solutions to this problem is to have functions be able to provide the minimum number of arguments they take and then CPython could pad the provided arguments with NULLs to achieve the correct number of arguments to the function (interestingly the C standard allows more arguments to be passed to a function than it takes). This type of optimization would speed up calls to things like dict.get() and getattr().

As you can see the speed of a Python application can be fairly sensitive to how various internal things are handled, in this case the speed increase can be shown to come exclusively from eliminating a tuple allocation and some extra logic on certain function calls. If you're interested in seeing the full changeset it's available on the internet.

You can find the rest here. There are view comments.

My Workflow

Posted November 7th, 2009. Tagged with python, pip, virtualenv, virtualenvwrapper, easy_install.

About a year ago I blogged about how I didn't like easy_install, and I alluded to the fact that I didn't really like any programming language specific package managers. I'm happy to say I've changed my tune quite drastically in the past 2 months. Since I started working with Eldarion I've dived head first into the pip and virtualenv system and I'm happy to say it works brilliantly. The nature of the work is that we have lots of different projects all at once, often using wildly different versions of packages in all sorts of incompatible ways. The only way to stay sane is to have isolated environments for each of them. Enter virtualenv stage left.

If you work with multiple Python projects that use different versions of things virtualenv is indispensable. It allows you to have totally isolated execution environments for different projects. I'm also using Doug Hellmann's virtualenvwrapper, which wraps up a few virtualenv commands and gives you some hooks you can use. When I start a new project it looks something like this:

$ git checkout some_repo
$ cd some_repo/
$ mkvirtualenv project_name

The first two steps are probably self explanatory. What mkvirtualenv does is to a new virtual environment, and activate it. I also have a hook set up with virtualenvwrapper to install the latest development version of pip, as well as ipython and ipdb. pip is a tremendous asset to this process. It has a requirements file that makes it very easy to keep track of all the dependencies for a given project, plus pip allows you to install packages out of a version control system which is tremendously useful.

When I want to work on an existing project all I need to do is:

$ workon project_name

This activates the environment for that project. Now the PATH prioritizes stuff installed into that virtualenv, and my Python path only has stuff installed into this virtualenv. I can't imagine what my job would be like without these tools, if I had to manually manage the dependencies for each project I'd probably go crazy within a week. Another advantage is it makes it easy to test things against multiple versions of a library. I can test if something works on Django 1.0 and 1.1 just by switching which environment I'm in.

As promised tomorrow I'll be writing about an optimization that just landed in Unladen Swallow, and I'm keeping Monday's post a secret. I'm not sure what Tuesday's post will be, but I think I'll be writing something Django related, either about my new templatetag library, or the state of my multiple database work. See you around the internet.

You can find the rest here. There are view comments.

Towards a Better Template Tag Definition Syntax

Posted November 6th, 2009. Tagged with django, python, template.

Eric Holscher has blogged a few times this month about various template tag definition syntax ideas. In particular he's looked at a system based on Surlex (which is essentially an alternate syntax for certain parts of regular expressions), and a system based on keywords. I highly recommend giving his posts a read as they explain the ideas he's looked at in far better detail than I could. However, I wasn't particularly satisfied with either of these solution, I love Django's use of regular expressions for URL resolving, however, for whatever reason I don't really like the look of using regular expressions (or an alternate syntax like Surlex) for template tag parsing. Instead I've been thinking about an object based parsing syntax, similar to PyParsing.

This is an idea I've been thinking about for several months now, but Eric's posts finally gave me the kick in the pants I needed to do the work. Therefore, I'm pleased to announce that I've released django-kickass-templatetags. Yes, I'm looking for a better name, it's already been pointed out to me that a name like that won't fly in the US government, or most corporate environments. This library is essentially me putting to code everything I've been thinking about, but enough talking let's take a look at what the template tag definition syntax:

@tag(register, [Constant("for"), Variable(), Optional([Constant("as"), Name()])]):
def example_tag(context, val, asvar=None):

As you can see it's a purely object based syntax, with different classes for different components of a template tag. For example this would parse something like:

{% example_tag for variable %}
{% example_tag for variable as new_var %}

It's probably clear that this is significantly less code than the manual parsing, manual node construction, and manual resolving of variable that you would have needed to do with a raw templatetag definition. Then the function you have gets the resolved values for each of its parameters, and at that point it's basically the same as Node.render, it is expected to either return a string to be inserted into the template or alter the context. I'm looking forward to never writing manual template parsing again. However, there are still a few scenarios it doens't handle, it won't handle something like the logic in the {% if %} tag, and it won't handle tags with {% end %} type tags. I feel like these should both be solvable problems, but since it's a bolt-on addition to the existing tools it ultimately doesn't have to cover every use case, just the common ones (when's the last time you wrote your own implementation of the {% if %} or {% for %} tags).

It's my hope that something like this becomes popular, as a) developers will be happier, b) moving towards a community standard is the first step towards including a solution out of the box. The pain and boilerplate of defining templatetags has long been a complain about Django's template language (especially because the language itself limits your ability to perform any sort of advanced logic), therefore making it as painless as possible ultimately helps make the case for the philosophy of the language itself (which I very much agree with it).

In keeping with my promise I'm giving an overview of what my next post will be, and this time I'm giving a full 3-day forecast :). Tommorow I'm going to blog about pip, virtualenv, and my development workflow. Sunday's post will cover a new optimization that just landed in Unladen Swallow. And finally Monday's post will contain a strange metaphor, and I'm not saying any more :P. Go checkout the code and enjoy.

Edit: Since this article was published the name of the library was changed to be: django-templatetag-sugar. I've updated all the links in this post.

You can find the rest here. There are view comments.

The Pycon Program Committee and my PyCon Talk

Posted November 5th, 2009. Tagged with talk, python, pc, pycon.

Last year I presented at a conference for the first time in my life at PyCon, I moderated a panel on ORMs and I enjoyed it a ton (and based on the feedback I've gotten at least a few people enjoyed attending it). Above and beyond that the two PyCons I've attended have both been amazing conferences, tons of awesome talks, great people to hang out with, and an awesome environment for maximizing both. For both of the last two years I've hung around the PyCon organizers mailing list since the conference was in Chicago and I lived there, however this year I really wanted to increase my contributions to such a great conference. Therefore, I joined the Pycon program committee. This committee is responsible for reviewing all talk submissions and selecting the talks that will ultimately appear at PyCon.

This year the PyCon programming committee had a really gargantuan task. There were more talks submitted then ever before, more than 170 of them, for only 90 or so slots. Unfortunately this meant that we had to reject some really good talks, which always sucks. There's been a fair bit of discussion about the process this year and what can be done to improve it. As a reviewer the one thing I wish I'd known going in was that the votes left on talks were just a first round, and ultimately didn't count for a huge amount. Had I known this I would have been less tepid in giving positive reviews to talks which merely looked interesting.

Another hot topic in the aftermath is whether or not the speaker's identity should factor into a reviewer's decision. My position is that it should wherein the speaker has a reputation, be it good or bad. If I know a speaker is awesome I'm way more likely to give them the +1, likewise if I see a speaker has a history of poor talks I'm more likely to give them a -1. That being said I don't think new speakers, or slightly inexperienced ones should be penalized for that, I was a brand new speaker last time and I'm grateful I was given the chance to present.

To give an example of this one of the talks I'm really looking forward to is Mark Ramm's, "To relate or not to relate, that is the question". Mark and I spoke about this topic for quite a while at PyOhio, and every talk from Mark I've ever seen has been excellent. Therefore I was more than willing to +1 it. However, had I not known the speaker it would still have been a good proposal, and an interesting topic, I just would not have gotten immediately excited about going to see the talk.

As an attendee one of the things I've always found is that speakers who are very passionate about their topics almost always give talks I really enjoy. Thinking back to my first PyCon Maciej Fijalkowski managed to get me excited and interested in PyPy in just 30 minutes, because he was so passionate in speaking about the topic.

All that being said I wanted to share a short list of the talks I'm excited about this year, before I dive into what my own talk will be about:
  • Optimizations And Micro-Optimizations In CPython
  • Unladen Swallow: fewer coconuts, faster Python
  • Managing the world's oldest Django project
  • Understanding the Python GIL
  • The speed of PyPy
  • Teaching compilers with python
  • To relate or not to relate, that is the question
  • Modern version control: Mercurial internals
  • Eventlet: Asynchronous I/O with a synchronous interface
  • Hg and Git : Can't we all just get along?
  • Mastering Team Play: Four powerful examples of composing Python tools

It's a bit of a long list, but compared to the size of the list of accepted talks I'm sure there are quite a few gems I've missed.

The talk I'm going to be giving this year is about the real time web, also known as HTTP push, Comet, or reverse Ajax. All of those are basically synonyms for the server being able to push data to the browser, rather than having the browser constantly poll the server for data. Specifically I'm going to be looking at my experience building three different things, LeafyChat, DjangoDose's DjangoCon stream, and Hurricane.

Leafychat is an IRC client built for the DjangoDash by myself, Leah Culver, and Chris Wanstrath. The DjangoDose DjangoCon stream was a live stream of all the Twitter items about DjangoCon that Eric Florenzano and I built in the week leading up to DjangoCon. Finally, Hurricane is the library Eric Florenzano and I have been working on in order to abstract the lessons learned from our experience building "real time" applications in Python.

In the talk I'm going to try to zero in on what we did for each of these projects, what worked, what didn't, and what I'm taking away from the experience. Finally, Eric Florenzano and I are working to put together a new updated, better version of the DjangoCon stream for PyCon. I'm going to discuss what we do with that project, and why we do it that way in light of the lessons of previous projects.

I'm hoping both my talk, and all of them will be awesome. One thing's for sure, I'm already looking forward to PyCon 2010. Tomorrow I'm going to be writing about my thoughts on a more ideal template tag definition syntax for Django, and hopefully sharing some code if I have time to start working on it. See you then (and in Atlanta ;))!

You can find the rest here. There are view comments.

Django's ManyToMany Refactoring

Posted November 4th, 2009. Tagged with python, models, django, gsoc, internals, orm.

If you follow Django's development, or caught next week's DjangoDose Tracking Trunk episode (what? that's not how time flows you say? too bad) you've seen the recent ManyToManyField refactoring that Russell Keith-Magee committed. This refactoring was one of the results of my work as a Google Summer of Code student this summer. The aim of that work was to bring multiple database support to Django's ORM, however, along the way I ran into refactor the way ManyToManyField's were handled, the exact changes I made are the subject of tonight's post.

If you've looked at django.db.models.fields.related you may have come away asking how code that messy could possibly underlie Django's amazing API for handling related objects, indeed the mess so is so bad that there's a comment which says:

# HACK

which applies to an entire class. However, one of the real travesties of this module was that it contained a large swath of raw SQL in the manager for ManyToMany relations, for example the clear() method's implementation looks like:

cursor = connection.cursor()
cursor.execute("DELETE FROM %s WHERE %s = %%s" % \
    (self.join_table, source_col_name),
    [self._pk_val])
transaction.commit_unless_managed()

As you can see this hits the trifecta, raw SQL, manual transaction handling, and the use of a global connection object. From my perspective the last of these was the biggest issue. One of the tasks in my multiple database branch was to remove all uses of the global connection object, and since this uses it it was a major target for refactoring. However, I really didn't want to rewrite any of the connection logic I'd already implemented in QuerySets. This desire to avoid any new code duplication, coupled with a desire to remove the existing duplication (and flat out ugliness), led me to the simple solution: use the existing machinery.

Since Django 1.0 developers have been able to use a full on model for the intermediary table of a ManyToMany relation, thanks to the work of Eric Florenzano and Russell Keith-Magee. However, that support was only used when the user explicitly provided a through model. This of course leads to a lot of methods that basically have two implementation: one for the through model provided case, and one for the normal case -- which is yet another case of code bloat that I was now looking to eliminate. After reviewing these items my conclusion was that the best course was to use the provided intermediary model if it was there, otherwise create a full fledged model with the same fields (and everything else) as the table that would normally be specially created for the ManyToManyField.

The end result was dynamic class generation for the intermediary model, and simple QuerySet methods for the methods on the Manager, for example the clear() method I showed earlier now looks like this:

self.through._default_manager.filter(**{
    source_field_name: self._pk_val
}).delete()

Short, simple, and totally readable to anyone with familiarity with Python and Django. In addition this move allowed Russell to fix another ticket with just two lines of code. All in all this switch made for cleaner, smaller code and fewer bugs.

Tomorrow I'm going to be writing about both the talk I'm going to be giving at PyCon, as well as my experience as a member of the PyCon program committee. See you then.

You can find the rest here. There are view comments.

Diving into Unladen Swallow's Optimizations

Posted November 3rd, 2009. Tagged with python, unladen-swallow, compile, internals, compiler.

Yesterday I described the general architecture of Unladen Swallow, and I said that just by switching to a JIT compiler and removing the interpretation overhead Unladen Swallow was able to get a performance gain. However, that gain is nowhere near what the engineers at Google are hoping to accomplish, and as such they've been working on building various optimizations into their JIT. Here I'm going to describe two particularly interesting ones they implemented during the 3rd quarter (they're doing quarterly releases).

Before diving into the optimizations themselves I should note there's one piece of the Unladen Swallow architecture I didn't discuss in yesterday's post. The nature of dynamic languages is that given code can do nearly anything depending on the types of the variables present, however in practice usually very few types are seen. Therefore it is necessary to collect information about the types seen in practice in order to perform optimizations. Therefore what Unladen Swallow has done is added data collection to the interpreter while it is executing bytecode. For example the BINARY_ADD opcode records the types of both of it's operands, the CALL_FUNCTION opcode records the function it is calling, and the UNPACK_SEQUENCE opcode records the type of the sequence it's unpacking. This data is then used when the function is compiled to generate optimal code for the most likely scenarios.

The first optimization I'm going to look at is one for the CALL_FUNCTION opcode. Python has a number of flags that functions defined in C can have, the two relevant to this optimization are METH_NOARGS and METHO_O. These flags indicate that the function (or method) in question take either 0 or 1 argument respectively (this is excluding the self argument on methods). Normally when Python calls a function it builds up a tuple of the arguments, and a dictionary for keyword arguments. For functions defined in Python CPython lines up the arguments with those the function takes and then sets them as local variables for the new function. For C functions they are given the tuple and dictionary directly and are responsible for parsing them themselves. By contrast functions with METH_NOARGS or METH_O receive their arguments (or nothing in the case of METH_NOARGS) directly.

Because calling METH_NOARGS and METH_O functions is so much easier than the alternate case (which involves several allocations and complex logic) when possible it is best to special case them in the generated assembly. Therefore, when compiling a CALL_FUNCTION opcode if using the data recorded there is only ever one function called (imagine a call to len, it is going to be the same len function every time), and that function is METH_NOARGS or METH_O then instead of generating a call to the usual function call machinery Unladen Swallow instead emits a check to make sure the function is actually the expected one and if it passes emits a call directly to the function with the correct arguments. If this guard fails then Unladen Swallow jumps back to the regular interpreter, leaving the optimized assembly. The reason for this is that the ultimately generated assembly can be more efficient when it only has to consider one best case scenario, as opposed to needing to deal with a large series of if else statements, which catalogue every single best case and the corresponding normal case. Ultimately, this results in more efficient code for calls to functions like len(), which are basically never redefined.

The next optimization we're going to look at is one for the LOAD_GLOBAL function. The LOAD_GLOBAL opcode is used for getting the value of a global variable, such as a builtin function, an imported class, or a global variable in the same module. In the interpreter the code for this opcode looks something like:

name = OPARG()
try:
    value = globals[name]
except KeyError:
    try:
        value = builtins[name]
    except KeyError:
        raise_exception(KeyError, name)
PUSH(value)

As you can see in the case of a builtin object (something like len, str, or dict) there are two dictionary lookups. While the Python dictionary is an exceptionally optimized datastructure it still isn't fast compared to a lookup of a local value (which is a single lookup in a C array). Therefore the goal of this optimization is to reduce the number of dictionary lookups needed to find the value for a global or builtin.

The way this was done was for code objects (the datastructures that hold the opcodes and various other internal details for functions) to register themselves with the globals and builtin dictionaries. By registering themselves the dictionaries are able to notify the code objects (similar to Django signals) whenever they are modified. The result of this is that the generated assembly for a LOAD_GLOBAL can perform the dictionary lookup once at compilation time and then the resulting assembly will be valid until the globals or builtins dictionary notifies the code object that they have been modified, thus rendering the assembly invalid. In practice this is very efficient because globals and builtins are very rarely modified.

Hopefully you've gotten a sense of the type of work that the people behind Unladen Swallow are doing. If you're interested in reading more on this type of work I'd highly recommend taking a look at the literature listed on the Unladen Swallow wiki, as they note that there is no attempt to do any original research, all the work being done is simply the application of existing, proven techniques to the CPython interpreter.

For the rest of this month I'm going to try to give a preview of the next day's post with each post, that way I can start thinking about it well in advance. Tomorrow I'm going to shift gears a little bit and write about the ManyToManyField refactoring I did over the summer and which was just committed to Django.

You can find the rest here. There are view comments.

Introduction to Unladen Swallow

Posted November 2nd, 2009. Tagged with python, unladen-swallow, compile, internals, compiler.

Unless you've been living under a rock for the past year (or have zero interest in either Python or dynamic languages, it which case why are you here?) you've probably heard of Unladen Swallow. Unladen Swallow is a Google funded branch of the CPython interpreter, with a goal of making CPython significantly faster while retaining both API and ABI compatibility. In this post I'm going to try to explain what it is Unladen Swallow is doing to bring a new burst of speed to the Python world.

In terms of virtual machines there are a few levels of complexity, which roughly correspond to their speed. The simplest type of interpreter is an AST evaluator, these are more or less the lowest of the low on the speed totem pole, up until YARV was merged into the main Ruby interpreter, MRI (Matz Ruby Interpreter) was this type of virtual machine. The next level of VM is a bytecode interpreter, this means that the language is compiled to an intermediary format (bytecode) which is then executed. Strictly speaking this is an exceptionally broad category which encompasses most virtual machines today, however for the purposes of this article I'm going to exclude any VMs with a Just-In-Time compiler from this section (more on them later). The current CPython VM is this type of interpreter. The most complex (and fastest) type of virtual machine is one with a Just-In-Time compiler, this means that the bytecode that the virtual machine interprets is also dynamically compiled into assembly and executed. This type of VM includes modern Javascript interpreters such as V8, Tracemonkey, and Squirellfish, as well as other VMs like the Hotspot Java virtual machine.

Now that we know where CPython is, and what the top of the totem pole looks like it's probably clear what Unladen Swallow is looking to accomplish, however there is a bit of prior art here that's worthy of taking a look. There is actually currently a JIT for CPython, named Psyco. Psyco is pretty commonly used to speed up numerical code, as that's what it's best at, but it can speed up most of the Python language. However, Psyco is extremely difficult to maintain and update. It only recently gained support for modern Python language features like generators, and it still only supports x86 CPUs. For these reasons the developers at Google chose to build their JIT rather than work to improve the existing solution (they also chose not to use one of the alternative Python VMs, I'll be discussing these in another post).

I just said that Unladen Swallow looked to build their own JIT, but that's not entirely true. The developers have chosen not to develop their own JIT (meaning their own assembly generator, and register allocator, and optimizer, and everything else that goes along with a JIT), they have instead chosen to utilize the LLVM (Low Level Virtual Machine) JIT for all the code generation. What this means is that instead of doing all the work I've alluded the devs can instead translate the CPython bytecode into LLVM IR (intermediate representation) and then use LLVM's existing JIT infrastructure to do some of the heavy lifting. This gives the devs more time to focus on the interesting work of how to optimize the Python language.

Now that I've layed out the background I'm going to dive into what exactly it is that Unladen Swallow does. Right now the CPython virtual machine looks something like this:

for opcode in opcodes:
    if opcode == BINARY_ADD:
        x, y = POP(), POP()
        z = x + y
        PUSH(z)
    elif opcode == JUMP_ABSOLUTE:
        pc = OPARG()
    # ...

This is both hugely simplified and translated into a Pythonesque psuedocode, but hopefully it makes the point clear, right now the CPython VM runs through the opcodes and based on what the opcode is executes some C code. This is particularly inefficient because there is a fairly substantial overhead to actually doing the dispatch on the opcode. What Unladen Swallow does is count the number of times a given Python function is called (the heuristic is actually slightly more complicated than this, but it's a good approximation of what happens), and when it reaches 10000 (the same value the JVM uses) it stops to compile the function using LLVM. Here what it does is essentially unrolls the interpreter loop, into the LLVM IR. So if you had the bytecode:

BINARY_ADD

Unladen Swallow would generate code like:

x, y = POP(), POP()
z = x + y
PUSH(z)

This eliminates all of the overhead of the large loop in the interpreter. Unladen Swallow also performs a number of optimizations based on Python's semantics, but I'll be getting into those in another post, for now LLVM run it's optimizers, which can improve the generated code somewhat, and then CPython executes the generated function. Now whenever this function is called in the future the optimized, assembly version of it is called.

This concludes the introduction to Unladen Swallow. Hopefully you've learned something about the CPython VM, Unladen Swallow, or virtual machines in general. In future posts I'm going to be diving in to some of the optimizations Unladen Swallow does, as well as what other players are doing in this space (particularly PyPy).

You can find the rest here. There are view comments.

Optimising compilers are there so that you can be a better programmer

Posted October 10th, 2009. Tagged with pypy, python, unladen-swallow, django, compiler.

In a discussion on the Django developers mailing list I recently commented that the performance impact of having logging infrastructure, in the case where the user doesn't want the logging, could essentially be disregarded because Unladen Swallow (and PyPy) are bringing us a proper optimising (Just in Time) compiler that would essentially remove that consideration. Shortly thereafter someone asked me if I really thought it was the job of the interpreter/compiler to make us not think about performance. And my answer is: the job of a compiler is to let me program with best practices and not suffer performance consequences for doing things the right way.

Let us consider the most common compiler optimisations. A relatively simple one is function inlining, in the case where including the body of the function would be more efficient than actually calling it, a compiler can simply move the functions body into its caller. However, we can actually do this optimisation in our own code. We could rewrite:

def times_2(x):
    return x * 2

def do_some_stuff(i):
    for x in i:
        # stuff
        z = times_2(x)
    # more stuff

as:

def do_some_stuff(i):
    for x in i:
        # stuff
        z = x * 2
    # more stuff

And this is a trivial change to make. However in the case where times_2 is slightly less trivial, and is used a lot in our codebase it would be exceptionally more programming practice to repeat this logic all over the place, what if we needed to change it down the road? Then we'd have to review our entire codebase to make sure we changed it everywhere. Needless to say that would suck. However, we don't want to give up the performance gain from inlining this function either. So here it's the job of the compiler to make sure functions are inlined when possible, that way we get the best possible performance, as well as allowing us to maintain our clean codebase.

Another common compiler optimisation is to transform multiplications by powers of 2 into binary shifts. Thus x * 2 becomes x << 1 A final optimisation we will consider is constant propagation. Many program have constants that are used throughout the codebase. These are often simple global variables. However, once again, inlining them into methods that use them could provide a significant benefit, by not requiring the code to making a lookup in the global scope whenever they are used. But we really don't want to do that by hand, as it makes our code less readable ("Why are we multiplying this value by this random float?", "You mean pi?", "Oh."), and makes it more difficult to update down the road. Once again our compiler is capable of saving the day, when it can detect a value is a constant it can propagate it throughout the code.

So does all of this mean we should never have to think about writing optimal code, the compiler can solve all problems for us? The answer to this is a resounding no. A compiler isn't going to rewrite your insertion sort into Tim sort, nor is it going to fix the fact that you do 700 SQL queries to render your homepage. What the compiler can do is allow you to maintain good programming practices.

So what does this mean for logging in Django? Fundamentally it means that we shouldn't be concerned with possible overhead from calls that do nothing (in the case where we don't care about the logging) since a good compiler will be able to eliminate those for us. In the case where we actually do want to do something (say write the log to a file) the overhead is unavoidable, we have to open a file and write to it, there's no way to optimise it out.

You can find the rest here. There are view comments.

Django-filter 0.5 released!

Posted August 14th, 2009. Tagged with release, django, python, filter.

I've just tagged and uploaded Django-filter 0.5 to PyPi. The biggest change this release brings is that the package name has been changed from filter to django_filters in order to avoid conflicts with the Python builtin filter. Other changes included the addition of an initial argument on Filters, as well as the addition of an AllValuesFilter which is a ChoiceFilter who's choices are any values currently in the DB for that field. Despite the change in package name I will not be changing the name of the project due to the overhead in moving the repository (Github doesn't set up redirects when you change a project's name) and the PyPi package. I hope everyone enjoys this new release, as a lot of it's improvements have come out of my usage on Django-filter in piano-man.

As for what the future holds several people have indicated their interest in the inclusion of django-filter in Django itself as a contrib package, and for usage in the Admin as a new implementation of the list_filter option that is more flexible. Because of this my next work is probably going to be on implementing a custom ModelAdmin class that uses FilterSets for filtering.

You can find the latest release on PyPi and Github.

You can find the rest here. There are view comments.

pyvcs .2 released

Posted July 12th, 2009. Tagged with release, python, vcs.

Hot on the heels of our .1 release (it's only been a week!) I'm pleased to announce the .2 release of pyvcs. This release brings with it lots of new goodies. Most prominent among these are the newly-added Subversion and Bazaar backends. There are also several bug fixes to the code browsing features of the Mercurial backend. This release can be found at:
If you find any bugs with this release please report them at:

Thanks for all the contributions to this release, almost everything you see in this release is because community members contributed to it, hardly any new code in here is originally written by Justin or I.

We are hoping to have some exciting announcements for piano-man coming up in the next couple of weeks.

Enjoy.

You can find the rest here. There are view comments.

Announcing pyvcs, django-vcs, and piano-man

Posted July 5th, 2009. Tagged with release, django, python, vcs.

Justin Lilly and I have just released pyvcs, a lightweight abstraction layer on top of multiple version control systems, and django-vcs, a Django application leveraging pyvcs in order to provide a web interface to version control systems. At this point pyvcs exists exclusively to serve the needs of django-vcs, although it is separate and completely usable on its own. Django-vcs has a robust feature set, including listing recent commits, pretty diff rendering of commits, and code browsing. It also supports multiple projects. Both pyvcs and django-vcs currently support Git and Mercurial, although adding support for a new backend is as simple as implementing four methods and we'd love to be able to support additional VCS like Subversion or Bazaar. Django-vcs comes with some starter templates (as well as CSS to support the pretty diff rendering).

It goes without saying that we'd like to thank the authors of the VCS themselves, in addition we'd like to thank the authors of Dulwich, for providing a pure Python implementation of the Git protocol, as well as the Pocoo guys, for pygments, the syntax highlighting library for Python, as well as the pretty diff rendering which we lifted out of the lodgeit pastbin application.

Having announced what we have already, we'll now be looking towards the future. As such Justin and I plan to be starting a new Django project "piano-man". Piano-man, a reference to Billy Joel, follows the Django tradition of naming things after musicians (although we've branched out a bit, leaving the realm of Jazz in favour of Rock 'n Roll). Piano-man will be a complete web based project management system, similar to Trac or Redmine. There are a number of logistical details that we still need to sort out, such as whether this will be a part of Pinax as a "code edition" or whether it will be a separate project entirely, like Satchmo and Reviewboard.

Some people are inevitably asking why we've chosen to start a new project, instead of working to improve one of the existing ones I alluded to. The reason is, after hearing coworkers complain about poor Git support in Trac (even with external plugins), and friends complain about the need to modify Redmine just to support branches in Git properly I'd become convinced it couldn't possibly be that hard, and I think Justin and I have proven that it isn't. All the work you see in pyvcs and django-vcs took 48 hours to complete, with both of us working evenings and a little bit during the day on these projects.

You can find both django-vcs and pyvcs on PyPi as well on Github under my account (http://github.com/alex/), both are released under the BSD license. We hope you enjoy both these projects and find them helpful, and we'd appreciate and contributions, just file bugs on Github. I'll have another blog post in a few days outlining the plan for piano-man once Justin and I work out the logistics. Enjoy.

Edit: I seem to have forgotten all the relevant links, here they are

Sorry about that.

You can find the rest here. There are view comments.

A response to "Python sucks"

Posted June 4th, 2009. Tagged with python, response.

I recently saw an article on Programming Reddit, titled, "Python sucks: Why Python is not my favourite programming language". I personally like Python quite a lot (see my blog title), but I figured I might read an interesting critique, probably from a very different point of view from mine. Unfortunately that is the opposite of what I found. The post was, at best, a horribly misinformed inaccurate critique, and at worst an intentionally dishonest, misleading, farce. The post can be found here. I felt the need to respond to it precisely because it is so lacking in facts, and reading it one can get impressions that are completely incorrect, and I am hoping I can correct some of these.

The post's initial statements about iterating over a file are accurate. However, he then goes on to say Python supports closures (which is true), and follows this with a piece of code that has absolutely nothing to do with closures, it is actually a callable object (or as C++ calls them, functors). The authors seems to take issue with these (though he doesn't explain why), ignoring the fact that Python has complete support for actual closures, not just callable objects.

The author then claims that Python has many other such arbitrary rules, using as an example the "yield" keyword. The author appears to be claiming the behavior of the yield keyword is arbitrary and poorly defined, however it's very unclear what his point actually is, or what the source of his complaints is. My only response can be to say that the "yield" keyword always turns the function it's used in into a generator, that is to say it returns an iterable that lazy evaluates the function, pausing each time it reaches the yield statement, and returning that object.

The author claims that many of the arbitrary decisions in Python are a result of Guido's insistence on a specific programming style, using as an example crippled lambdas. It is generally accepted that in Python lambda is just syntactic sugar for defining a function within any context (which the author completely ignores in his discussion of closure). To say that lambdas are crippled is to ignore the fact that absolutely nothing is rendered impossible by this, except for unreadable one liners.

The author's final complaint is directed at Python's C-API. This is possibly his least accurate critique. The author compares what is necessary to use a C library from within various programming languages. He shows that in Python all you have to do is import the library like you would for normal Python code. However, he goes on to say that for this to work you need to write lots of C boilerplate, and says that in other programming languages (showing examples from Haskell and PLT Scheme) this boiler plate is unnecessary. However, this is a completely disingenuous comparison. This is because what he is showing for Haskell and Scheme is their foreign function interface, not any actual language level integration. To do what he shows in Python is perfectly possible using the included ctypes library. I'm not familiar with the C-API of either Haskell or PLT Scheme, however I imagine that in order to work seamlessly and have the APIs appear the same as in code in those languages it is still necessary to write boiler plate so that the interpreter can recognize them.

In conclusion that blog post was a critique completely devoid of value, not worth the bytes that are used to store it. This is not to say there aren't any valid criticisms of Python, there are many, as evidenced by any number of recent blog posts discussing "5 things they hate about technology X", where technology X is something the author likes, because no technology is perfect, however no such honest critique was present here.

You can find the rest here. There are view comments.

EuroDjangoCon 2009

Posted May 5th, 2009. Tagged with talk, django, python, forms.

EuroDjangoCon 2009 is still going strong, but I wanted to share the materials from my talk as quickly as possible. My slides are on Slide Share:

And the first examples code follows:

from django.forms.util import ErrorList
from django.utils.datastructures import SortedDict

def multiple_form_factory(form_classes, form_order=None):
   if form_order:
       form_classes = SortedDict([(prefix, form_classes[prefix]) for prefix in
           form_order])
   else:
       form_classes = SortedDict(form_classes)
   return type('MultipleForm', (MultipleFormBase,), {'form_classes': form_classes})

class MultipleFormBase(object):
   def __init__(self, data=None, files=None, auto_id='id_%s', prefix=None,
       initial=None, error_class=ErrorList, label_suffix=':',
       empty_permitted=False):
       if prefix is None:
           prefix = ''
       self.forms = [form_class(data, files, auto_id, prefix + form_prefix,
           initial[i], error_class, label_suffix, empty_permitted) for
           i, (form_prefix, form_class) in enumerate(self.form_classes.iteritems())]

   def __unicode__(self):
       return self.as_table()

   def __iter__(self):
       for form in self.forms:
           for field in form:
               yield field

   def is_valid(self):
       return all(form.is_valid() for form in self.forms)

   def as_table(self):
       return '\n'.join(form.as_table() for form in self.forms)

   def as_ul(self):
       return '\n'.join(form.as_ul() for form in self.forms)

   def as_p(self):
       return '\n'.join(form.as_p() for form in self.forms)

   def is_multipart(self):
       return any(form.is_multipart() for form in self.forms)

   def save(self, commit=True):
       return tuple(form.save(commit) for form in self.forms)
   save.alters_data = True

EuroDjangoCon has been a blast thus far and after the conference I'll do a blogpost that does it justice.

You can find the rest here. There are view comments.

ORM Panel Recap

Posted March 30th, 2009. Tagged with python, alchemy, gae, django, orm, web2py, pycon, object, sql.

Having now completed what I thought was a quite successful panel I thought it would be nice to do a review of some of the decisions I made, that some people had been asking about. For those who missed it you can find a live blog of the event by James Bennett at his blog, and a video should hopefully be going up sometime soon.

Why Google App Engine

As Guido pointed out App Engine does not have an ORM, as App Engine doesn't have a relational datastore. However, it does have something that looks and acts quite a lot like other ORMs, and it does fundamentally try to serve the same purpose, offering a persistence layer. Therefore I decided it was at least in the same class of items I wanted to add. Further, with the rise of non-relational DBs that all fundamentally deal with the same issues as App Engine, and the relationship between ORMs and these new persistence layers I thought it would be advantageous to have one of these, Guido is a knowledgeable and interesting person, and that's how the cookie crumbled.

Why Not ZODB/Storm/A Talking Pony

Time. I would have loved to have as many different ORMs/things like them as exist in the Python eco-sphere, but there just wasn't time. We had 55 minutes to present and as it is that wasn't enough. I ultimately had time to ask 3 questions(one of which was just background), plus 5 shorter audience questions. I was forced to cut out several questions I wanted to ask, but didn't have time to, for those who are interested the major questions I would have liked to ask were:

  • What most often requested feature won't you add to your ORM?
  • What is the connection between an ORM and a schema migration tool. Should they both be part of the same project, should they be tied together, or are they totally orthogonal?
  • What's your support for geographic data? Is this(or other complex data types like it) in scope for the core of an ORM?

Despite these difficulties I thought the panel turned out very well. If there are any other questions about why things were the way they were just ask in the comments and I'll try to post a response.

You can find the rest here. There are view comments.

PyCon Wrapup

Posted March 30th, 2009. Tagged with pycon, django, python.

With PyCon now over for me(and the sprints just begining for those still there) I figured I'd recap it for those who couldn't be there, and compare notes for those who were. I'll be doing a separate post to cover my panel, since I have quite a brain dump there.

Day One

We start off the morning with lightning talks, much better than last year due to the removal of so many sponsor talks(thanks Jacob), by now I've already met up with quite a few Django people. Next I move on to the "Python for CS1" talk, this was particularly interesting for me since they moved away from C++, which is exactly what my school uses. Next, on to "Introduction to CherryPy" with a few other Djangonauts. CherryPy looks interesting as a minimalist framework, however a lot of the callback/plugin architecture was confusing me, so I feel like I'd be ignoring it and just being very close to the metal, which isn't always a bad thing, something to explore in the future. For there I'm off to "How Python is Developed", it's very clear that Django's development model is based of Python's, and that seems to work for both parties.

Now we head to lunch, which was fantastic, thanks to whoever put it together. From there I go to "Panel: Python VMs", this was very informative, though I didn't have a chance to ask my question, "Eveyone on the panel has mentioned that performance is likely to improve greatly in the future as a potential reason to use their alternate VM, should Unladen Swallow succeed in their goal, how does that affect you?". After that I went to the "Twisted, AMQP, and Thrift" talk, it was fairly good, though this feels like a space I have to look at more, or have a real world use case for, to totally understand.

A quick break, and then we're on to Jesse Noller's mutliprocessing talk. Jesse really is the expert in this, and having used the multiprocessing library before it's clear to me there's still tons of cool stuff to explore. Lastly we had the "Behind the scenes of Everyblock.com" from Adrian, it's very cool to see their architecture and it's getting me excited for their going open source in June. I also got a bit out a shoutout here when Adrian was asked about future scalability and he mentioned sharding across multiple databases. We had lightning talks and that was the end of the official conference for the day.

In the evening a big group of us(officially a Pinax meetup, but I think we had more than that) went to a restaurant, as the evening progressed our group shrunk, until they finally kicked us out at closing time. A good time was had by all I think.

Day two

After staying out exceptionally late the night before I ended up sleeping on the floor with friends rather than heading home(thanks!) so, retrospectively, day two should have been tiring, but it was as high energy as could be. I began the morning with lightning talks followed by Guido's keynote, which was a little odd and jumped around a lot, but I found I really enjoyed it. After that I got to sit in one room for 3 talks in a row, "The State of Django", "Pinax", and "State of TurboGears", which I can say, without qualifications, were all great(I received another shoutout during the Django talk, I guess I really need to deliver, assuming the GSOC project is accepted). From there it was off to lunch.

After lunch it was time to take the stage for my "ORM Panel". I'm saving all the details for another blog post, but I think it went well. After this we had a Django Birds of a Feather session, which was very fun. We got to see Honza Kraal show off the cool features of the Ella admin. The highlight had to be having Jacob Kalpan-Moss rotate us around the room to get Djangonauts to meet each other.

After this I heard Jesse Noller's second concurrency talk, and then Ian Bicking's talk on ... stuff, both talks were great, but Ian's is probably in a "you had to be there" category.

After that it was time for lightning talks, followed by what was supposed to be dinner for people interested in Oebfare(Django blog application) to make some design decisions. In the end quite a few more people were there and I got to meet some really cool people like Mark Pilgrim. After dinner it was back to the Hyatt for about an hour of hacking on Django before I called it a night.

Day Three

The final day of the conference. Once again I began my day with some lightning talks. After this was the keynote, by the reddit.com creators. They gave a short, but interesting keynote and answered a lot of questions. I have to say that some of their technical answers were less than satisfying. After this I went to an open space on podcasting where I learned that I'm probably the only person who doesn't mind podcasts that are over an hour long.

After this I went to the Eve Online talk, but I ended up leaving for the testing tools panel. This panel was interesting, but I can't help but feel that the Python community would be best served by getting the testing tools that everyone agrees about into the unittest module in the Python standard library. For my last talk of the conference I saw Jacob Kaplan-Moss's talk on "Django's design decisions", thankfully this didn't step on the toes of the ORM design decision panel, and it was very interesting. From there it was off to lunch, and a final round of lightning talks, capped off my Guido running away with the Django Pony(I hope someone caught this on video), and my conference was done.

This was a really great conference for me, hopefully everyone else enjoyed it as much as I did. I'll be doing a follow up post where I answer some of the questions I've seen about the ORM panel, as well as put the rest of my thoughts on paper(hard disk). And now if you'll excuse me, I have a virtual sprint to attend.

You can find the rest here. There are view comments.

Google Moderator for PyCon ORM Panel

Posted March 15th, 2009. Tagged with python, alchemy, gae, django, orm, web2py, object, sql.

I'm going to be moderating a panel this year at PyCon between 5 of the Python ORMs(Django, SQLAlchemy, SQLObject, Google App Engine, and web2py). To make my job easier, and to make sure the most interesting questions are asked I've setup a Google Moderator page for the panel here. Go ahead and submit your questions, and moderate others to try to ensure we get the best questions possible, even if you can't make it to PyCon(there will be a recording made I believe). I'll be adding my own questions shortly to make sure they are as interesting as I think they are.

Also, if you aren't already, do try to make it out to PyCon, there's still time and the talks look to be really exceptional.

You can find the rest here. There are view comments.

A Second Look at Inheritance and Polymorphism with Django

Posted February 10th, 2009. Tagged with python, models, django, internals, orm, metaclass.

Previously I wrote about ways to handle polymorphism with inheritance in Django's ORM in a way that didn't require any changes to your model at all(besides adding in a mixin), today we're going to look at a way to do this that is a little more invasive and involved, but also can provide much better performance. As we saw previously with no other information we could get the correct subclass for a given object in O(k) queries, where k is the number of subclasses. This means for a queryset with n items, we would need to do O(nk) queries, not great performance, for a queryset with 10 items, and 3 subclasses we'd need to do 30 queries, which isn't really acceptable for most websites. The major problem here is that for each object we simply guess as to which subclass a given object is. However, that's a piece of information we could know concretely if we cached it for later usage, so let's start off there, we're going to be building a mixin class just like we did last time:

from django.db import models

class InheritanceMixIn(models.Model):
    _class = models.CharField(max_length=100)

    class Meta:
        abstract = True

So now we have a simple abstract model that the base of our inheritance trees can subclass that has a field for caching which subclass we are. Now let's add a method to actually cache it and retrieve the subclass:

from django.db import models
from django.db.models.fields import FieldDoesNotExist
from django.db.models.related import RelatedObject

class InheritanceMixIn(models.Model):
    ...
    def save(self, *args, **kwargs):
        if not self.id:
            parent = self._meta.parents.keys()[0]
            subclasses = parent._meta.get_all_related_objects()
            for klass in subclasses:
                if isinstance(klass, RelatedObject) and klass.field.primary_key \
                    and klass.opts == self._meta:
                    self._class = klass.get_accessor_name()
                    break
        return super(InheritanceMixIn, self).save(*args, **kwargs)

    def get_object(self):
        try:
            if self._class and self._meta.get_field_by_name(self._class)[0].opts != self._meta:
                return getattr(self, self._class)
        except FieldDoesNotExist:
            pass
        return self

Our save method is where all the magic really happens. First, we make sure we're only doing this caching if it's the first time a model is being saved. Then we get the first parent class we have (this means this probably won't play nicely with multiple inheritance, that's unfortunate, but not as common a usecase), then we get all the related objects this class has(this includes the reverse relationship the subclasses have). Then for each of the subclasses, if it is a RelatedObject, and it is a primary key on it's model, and the class it points to is the same as us then we cache the accessor name on the model, break out, and do the normal save procedure.

Our get_object function is pretty simple, if we have our class cached, and the model we are cached as isn't of the same type as ourselves we get the attribute with the subclass and return it, otherwise we are the last descendent and just return ourselves. There is one(possible quite large) caveat here, if our inheritance chain is more than one level deep(that is to say our subclasses have subclasses) then this won't return those objects correctly. The class is actually cached correctly, but since the top level object doesn't have an attribute by the name of the 2nd level subclass it doesn't return anything. I believe this can be worked around, but I haven't found a way yet. One idea would be to actually store the full ancestor chain in the CharField, comma separated, and then just traverse it.

There is one thing we can do to make this even easier, which is to have instances automatically become the correct subclass when they are pulled in from the DB. This does have an overhead, pulling in a queryset with n items guarantees O(n) queries. This can be improved(just as it was for the previous solution) by ticket #7270 which allows select_related to traverse reverse relationships. In any event, we can write a metaclass to handle this for us automatically:

from django.db import models
from django.db.models.base import ModelBase
from django.db.models.fields import FieldDoesNotExist
from django.db.models.related import RelatedObject

class InheritanceMetaclass(ModelBase):
    def __call__(cls, *args, **kwargs):
        obj = super(InheritanceMetaclass, cls).__call__(*args, **kwargs)
        return obj.get_object()

class InheritanceMixIn(models.Model):
    __metaclass__ = InheritanceMetaclass
    ...

Here we've created a fairly trivial metaclass that subclasses the default one Django uses for it's models. The only method we've written is __call__, on a metalcass what __call__ does is handle the instantiation of an object, so it would call __init__. What we do is do whatever the default __call__ does, so that we get an instances as normal, and then we call the get_object() method we wrote earlier and return it, and that's all.

We've now looked at 2 ways to handle polymorphism, with this way being more efficient in all cases(ignoring the overhead of having the extra charfield). However, it still isn't totally efficient, and it fails in several edge cases. Whether automating the handling of something like this is a good idea is something that needs to be considered on a project by project basis, as the extra queries can be a large overhead, however, they may not be avoidable in which case automating it is probably advantages.

You can find the rest here. There are view comments.

Building a Magic Manager

Posted January 31st, 2009. Tagged with models, django, orm, python.

A very common pattern in Django is to create methods on a manager to abstract some usage of ones data. Some people take a second step and actually create a custom QuerySet subclass with these methods and have their manager proxy these methods to the QuerySet, this pattern is seen in Eric Florenzano's Django From the Ground Up screencast. However, this requires a lot of repetition, it would be far less verbose if we could just define our methods once and have them available to us on both our managers and QuerySets.

Django's manager class has one hook for providing the QuerySet, so we'll start with this:

from django.db import models

class MagicManager(models.Manager):
   def get_query_set(self):
       qs = super(MagicManager, self).get_query_set()
       return qs

Here we have a very simple get_query_set method, it doesn't do anything but return it's parent's queryset. Now we need to actually get the methods defined on our class onto the queryset:

class MagicManager(models.Manager):
   def get_query_set(self):
       qs = super(MagicManager, self).get_query_set()
       class _QuerySet(qs.__class__):
           pass
       for method in [attr for attr in dir(self) if not attr.startswith('__') and callable(getattr(self, attr)) and not hasattr(_QuerySet, attr)]:
           setattr(_QuerySet, method, getattr(self, method))
       qs.__class__ = _QuerySet
       return qs

The trick here is we dynamically create a subclass of whatever class the call to our parent's get_query_set method returns, then we take each attribute on ourself, and if the queryset doesn't have an attribute by that name, and if that attribute is a method then we assign it to our QuerySet subclass. Finally we set the __class__ attribute of the queryset to be our QuerySet subclass. The reason this works is when Django chains queryset methods it makes the copy of the queryset have the same class as the current one, so anything we add to our manager will not only be available on the immediately following queryset, but on any that follow due to chaining.

Now that we have this we can simply subclass it to add methods, and then add it to our models like a regular manager. Whether this is a good idea is a debatable issue, on the one hand having to write methods twice is a gross violation of Don't Repeat Yourself, however this is exceptionally implicit, which is a major violation of The Zen of Python.

You can find the rest here. There are view comments.

Optimizing a View

Posted January 19th, 2009. Tagged with python, compile, models, django, orm.

Lately I've been playing with a bit of a fun side project. I have about a year and half worth of my own chatlogs with friends(and 65,000 messages total) and I've been playing around with them to find interesting statistics. One facet of my communication with my friends is that we link each other lots of things, and we can always tell when someone is linking something that we've already seen. So I decided an interesting bit of information would be to see who is the worst offender.

So we want to write a function that returns the number of items each person has relinked, excluding items they themselves linked. So I started off with the most simple implementation I could, and this was the end result:

from collections import defaultdict
from operator import itemgetter

from django.utils.html import word_split_re

from logger.models import Message

def calculate_relinks():
    """
    Calculate the number of times each individual has linked something that was
    linked previously in the course of the chat.
    """
    links = defaultdict(int)
    for message in Message.objects.all().order_by('-time').iterator():
        words = word_split_re.split(message.message)
        for word in words:
            if word.startswith('http'):
                if Message.objects.filter(time__lt=message.time).filter(message__contains=word).exclude(speaker=message.speaker).count():
                    links[message.speaker] += 1
    links = sorted(links.iteritems(), key=itemgetter(1), reverse=True)
    return links

Here I iterated over the messages and for each one I went through each of the words and if any of them started with http(the definition of a link for my purposes) I checked to see if this had ever been linked before by someone other than the author of the current message.

This took about 4 minutes to execute on my dataset, it also executed about 10,000 SQL queries. This is clearly unacceptable, you can't have a view that takes that long to render, or hits your DB that hard. Even with aggressive caching this would have been unmaintainable. Further this algorithm is O(n**2) or thereabouts so as my dataset grows this would have gotten worse exponentially.

By changing this around however I was able to achieve far better results:

from collections import defaultdict
from operator import itemgetter

from django.utils.html import word_split_re

from logger.models import Message

def calculate_relinks():
    """
    Calculate the number of times each individual has linked something that was
    linked previously in the course of the chat.
    """
    links = defaultdict(set)
    counts = defaultdict(int)
    for message in Message.objects.all().filter(message__contains="http").order_by('time').iterator():
        words = word_split_re.split(message.message)
        for word in words:
            if word.startswith('http'):
                if any([word in links[speaker] for speaker in links if speaker != message.speaker]):
                    counts[message.speaker] += 1
                links[message.speaker].add(word)
    counts = sorted(counts.iteritems(), key=itemgetter(1), reverse=True)
    return counts

Here what I do is go through each of the messages which contain the string "http"(this is already a huge advantage since that means we process about 1/6 of the messages in Python that we originally were), for each message we go through each of the words in it, and for each that is a link we check if any other person has said it by looking in the caches we maintain in Python, and if they do we increment their count, finally we add the link to that persons cache.

By comparison this executes in .3 seconds, executes only 1 SQL query, and it will scale linearly(as well as is possible). For reference both of these functions are compiled using Cython. This ultimately takes almost no work to do and for computationally heavy operations this can provide a huge boon.

You can find the rest here. There are view comments.

Playing with Polymorphism in Django

Posted December 5th, 2008. Tagged with python, models, django, internals, orm.

One of the most common requests of people using inheritance in Django, is to have the a queryset from the baseclass return instances of the derives model, instead of those of the baseclass, as you might see with polymorphism in other languages. This is a leaky abstraction of the fact that our Python classes are actually representing rows in separate tables in a database. Django itself doesn't do this, because it would require expensive joins across all derived tables, which the user probably doesn't want in all situations. For now, however, we can create a function that given an instance of the baseclass returns an instance of the appropriate subclass, be aware that this will preform up to k queries, where k is the number of subclasses we have.

First let's set up some test models to work with:

from django.db import models

class Place(models.Model):
    name = models.CharField(max_length=50)

    def __unicode__(self):
        return u"%s the place" % self.name


class Restaurant(Place):
    serves_pizza = models.BooleanField()

    def __unicode__(self):
        return "%s the restaurant" % self.name

class Bar(Place):
    serves_wings = models.BooleanField()

    def __unicode__(self):
        return "%s the bar" % self.name

These are some fairly simple models that represents a common inheritance pattern. Now what we want to do is be able to get an instance of the correct subclass for a given instance of Place. To do this we'll create a mixin class, so that we can use this with other classes.

class InheritanceMixIn(object):
    def get_object(self):
        ...

class Place(models.Model, InheritanceMixIn):
    ...

So what do we need to do in our get_object method? Basically we need to loop each of the subclasses, try to get the correct attribute and return it if it's there, if none of them are there, we should just return ourself. We start by looping over the fields:

class InheritanceMixIn(object):
    def get_object(self):
        for f in self._meta.get_all_field_names():
            field = self._meta.get_field_by_name(f)[0]

_meta is where Django stores lots of the internal data about a mode, so we get all of the field names, this includes the names of the reverse descriptors that related models provide. Then we get the actual field for each of these names. Now that we have each of the fields we need to test if it's one of the reverse descriptors for the subclasses:

from django.db.models.related import RelatedObject

class InheritanceMixIn(object):
    def get_object(self):
        for f in self._meta.get_all_field_names():
            field = self._meta.get_field_by_name(f)[0]
            if isinstance(field, RelatedObject) and field.field.primary_key:

We first test if the field is a RelatedObject, and if it we see if the field on the other model is a primary key, which it will be if it's a subclass(or technically any one to one that is a primary key). Lastly we need to find what the name of that attribute is on our model and to try to return it:

class InheritanceMixIn(object):
    def get_object(self):
        for f in self._meta.get_all_field_names():
            field = self._meta.get_field_by_name(f)[0]
            if isinstance(field, RelatedObject) and field.field.primary_key:
                try:
                    return getattr(self, field.get_accessor_name())
                except field.model.DoesNotExist:
                    pass
        return self

We try to return the attribute, and if it raises a DoesNotExist exception we move on to the next one, if none of them return anything, we just return ourself.

And that's all it takes. This won't be super efficient, since for a queryset of n objects, this will take O(n*k) given k subclasses. Ticket 7270 deals with allowing select_related() to work across reverse one to one relations as well, which will allow one to optimise this, since the subclasses would already be gotten from the database.

You can find the rest here. There are view comments.

A timeline view in Django

Posted November 24th, 2008. Tagged with python, models, tips, django, orm.

One thing a lot of people want to do in Django is to have a timeline view, that shows all the objects of a given set of models ordered by a common key. Unfortunately the Django ORM doesn't have a way of representing this type of query. There are a few techniques people use to solve this. One is to have all of the models inherit from a common baseclass that stores all the common information, and has a method to get the actual object. The problem with this is that it could execute either O(N) or O(N*k) queries, where N is the number of items and k is the number of models. It's N if your baseclass has the subtype it is stored on it, in which case you can directly grab it, else it's N*k since you have to try each type. Another approach is to use a generic relation, this will also need O(N) queries since you need to get the related object for each generic one. However, there's a better solution.

What we can do is use get a queryset for each of the models we want to display(O(k) queries), sorted on the correct key, and then use a simple merge to combine all of these querysets into a single list, comparing on a given key. While this technically may do more operations than the other methods, it does fewer database queries, and this is often the most difficult portion of your application to scale.

Let's say we have 3 models, new tickets, changesets, and wikipage edits(what you see in a typical Trac install). We can get our querysets and then merge them like so:

def my_view(request):
   tickets = Ticket.objects.order_by('create_date')
   wikis = WikiEdit.objects.order_by('create_date')
   changesets = Changeset.objects.order_by('create_date')
   objs = merge(tickets, wikis, changesets, field='create_date')
   return render_to_response('my_app/template.html', {'objects': objs})

Now we just need to write our merge function:

def merge_lists(left, right, field=None):
    i, j = 0, 0
    result = []
    while i:
        if getattr(left[i], field):
            result.append(left[i])
            i += 1
        else:
            result.append(right[j])
            j += 1
    result.extend(left[i:])
    result.extend(right[j:])
    return result

def merge(*querysets, **kwargs):
    field = kwargs.pop('field')
    if field is None:
        raise TypeError('you need to provide a key to do comparisons on')
    if len(querysets) == 1:
        return querysets[0]

    qs = [list(x) for x in querysets]
    q1, q2 = qs.pop(), qs.pop()
    result = merge_lists(q1, q2, field)
    for q in qs:
        result = merge_lists(result, q)
    return result

There might be a more efficient way to write our merge function, but for now it merges together an arbitrary number of querysets on a given key.

And that's all their is too it. If you see a good way to make the merge function more efficient let me know, I would have liked to use Python's included heapq module, but it doesn't have a way to use a custom comparison function that I saw.

You can find the rest here. There are view comments.

A quick update

Posted November 23rd, 2008. Tagged with c++, al, python, compile.

I've now set up Al to be using GMP for all integers, and I'll be doing the same for floats once they get implemented. I haven't started benchmarking yet, but it can compile and calculate the factorial of 50000 pretty quickly, and in Python vanilla that would result in a RuntimeError due to a stack overflow, so it's a good starting point. Sorry for such a short post, I'm pretty tired today.

You can find the rest here. There are view comments.

My Programming Language - Status Update

Posted November 21st, 2008. Tagged with c++, al, python, compile.

Over the past few weeks I've been working on compiling my programming language. At present it works by translating the source into C++, and then you compile that with your compiler of choice. It's garbage collected, using the excellent Boehm GC library. At present it can only compile a limited subset of what it can actually parse, or what the interpreter supports. As of today thought it can compile and run a factorial function, however it can't calculate any factorial greater than 12, due to integer overflow issues. To solve this I'm either going to use GMP or roll my own Bignum library, and I'm not sure which yet. On the whole though, progress is good. The generated C++ is about as good as it could be considering the limitations inherent in turning an interpreted language into a compiled one. I haven't started benchmarking it yet, that was originally going to be the point of today's post before I ran into the integer overflow issues, however this is an example of the C++ code that is generated.

Give this Al(also valid Python):

def fact(n):
   if n == 1 or n == 0:
       return 1
   return n * fact(n-1)

print(fact(1))
print(fact(12))

It generated the following C++:

#include "src/base.h"

AlObj *fact;
class f0:public AlFunction
{
public:
 virtual AlObj * operator () (ARG_TYPE args, KWARG_TYPE kwargs)
 {
   AlObj *n = args.back ();
     args.pop_back ();
   if (*
 ((*((*(n)) == (AlObj *) (new AlInt (1))))
  || (*(n)) == (AlObj *) (new AlInt (0))))
     {
 return (AlObj *) (new AlInt (1));;
     }
   ARG_TYPE t0;
   t0.push_back ((*(n)) - (AlObj *) (new AlInt (1)));
   return (*(n)) * (*fact) (t0, KWARG_TYPE ());
 }
};

int
main ()
{
 fact = new f0 ();
 ARG_TYPE t1;
 ARG_TYPE t2;
 t2.push_back ((AlObj *) (new AlInt (1)));
 t1.push_back ((*fact) (t2, KWARG_TYPE ()));
 (*print) (t1, KWARG_TYPE ());
 ARG_TYPE t3;
 ARG_TYPE t4;
 t4.push_back ((AlObj *) (new AlInt (12)));
 t3.push_back ((*fact) (t4, KWARG_TYPE ()));
 (*print) (t3, KWARG_TYPE ());
}

All said and done, I'm pretty impressed! You can get all the code here, all the compilation work is in the code-generation branch.

You can find the rest here. There are view comments.

Why I don't use easy_install

Posted November 20th, 2008. Tagged with ubuntu, easy_install, python.

First things first, this post is not meant as a flame, nor should it indicate to you that you shouldn't use it, unless of course you're priorities are perfectly aligned with my own. That being said, here are the reasons why I don't use easy_install, and how I'd fix them.
  • No easy_uninstall. Zed mentioned this in his PyCon '08 lightning talk, and it's still true. Yes I can simply remove these files, and yeah I could write a script to do it for me. But I shouldn't have to, if I can install packages, I should be able to uninstall packages, without doing any work.
  • I can't update all of my currently installed packages. For any packages I don't have explicitly installed to a particular version(which to it's credit, easy_install makes very easy to do), it should be very to upgrade all of these, because I probably want to have them up to date, and I can always lock them at a specific version if I want.
  • I don't want to have two package managers on my machine. I run Ubuntu, so I already have apt-get, which i find to be a really good system(and doesn't suffer from either of the aforementioned problems). Having two packages managers inherently brings additional confusion, if a package is available in both which do I install it from? It's an extra thing to remember to keep up to date(assuming #2 is fixed), and it's, in general, an extra thing to think about, every time I go to update anything on my machine.

So what's my solution? PyPi is a tremendous resource for Python libraries, and there are great tools in Python for working with it, for example using a setup.py file makes it incredibly easy to get your package up on PyPi, and keep it up to date. So there's no reason to throw all that stuff out the window. My solution would be for someone to set up a server that mirrored all the data from PyPi, regularly, and then offered the packages as .deb's(for Debian/Ubuntu users, and as RPMs for Fedora users, etc...). That way all a user of a given package manager can just add the URL to their sources list, and then install everything that's available from PyPi, plus they derive all of the benefits of their given package manager(for me personally, the ability to uninstall and batch upgrade).

Note: I'm not suggesting everyone use apt-get, I'm merely suggesting everyone use their native package manager, and there's no reason easy_install/pip/virtualenv can't also be used.

You can find the rest here. There are view comments.

Uncoupled code is good, but doesn't exist

Posted November 19th, 2008. Tagged with python, models, django, orm, turbogears.

Code should try to be as decoupled from the code it depends as possible, I want me C++ to work with any compiler, I want my web framework to work with any ORM, I want my ORM to work with any database. While all of these are achievable goals, some of the decoupling people are searching for is simply not possible. At DjangoCon 2008 Mark Ramm made the argument that the Django community was too segregated from the Python community, both in terms of the community itself, and the code, Django for example doesn't take enough advantage of WSGI level middlewear, and has and ORM unto itself. I believe some of these claims to be true, but I ultiamtely thing the level of uncoupling some people want is simply impossible.

One of Django's biggest selling features has always been it's automatically generated admin. The admin requires you to be using Django's models. Some people would like it to be decoupled. To them I ask, how? It's not as if Django's admin has a big if not isinstance(obj, models.Model): raise Exception, it simply expects whatever is passed to it to define the same API as it uses. And this larger conecern, the Django admin is simply an application, it has no hooks within Django itself, it just happens to live in that namespace, the moment any application does Model.objects.all(), it's no longer ORM agnostic, it's already assumed the usage of a Django ORM. However, all this means is that applications themselves are inextricably tied to a given ORM, templating language, and any other module they import, you quite simply can't write resonably code that works just as well with two different modules unless they both define the same API.

Eric Florenzano wrote a great blog post yesterday about how Django could take better advantage of WSGI middleware, and he's absolutely correct. It makes no sense for a Django project to have it's own special middlewear for using Python's profiling modules, when it can be done more generic a level up, all the code is in Python afterall. However, there are also things that you can't abstract out like that, because they require a knowledge of what components you are using, SQLAlchemy has one transation model, Django has another.

The fact that an application is tied to the modules it uses is not an argument against it. A Django application is no tightly coupled to Django's ORM and template system is than a Turbo Gears application that uses SQL Alchemy and Mako, which is to say of course they're tied to it, they import those modules, they use them, and unless the other implements the same API you can't just swap them out. And that's not a bad thing.

You can find the rest here. There are view comments.

What Python learned from economics

Posted November 18th, 2008. Tagged with economics, django, python.

I find economics to be a fairly interesting subject, mind you I'm bored out of my mind about hearing about the stock markets, derivatives, and whatever else is on CNBC, but I find what guys like Steven Levitt and Steve E. Landsburg do to be fascinating. A lot of what they write about is why people do what they do, and how to incentivise people to do the right thing. Yesterday I was reading through David Goodger's Code Like a Pythonista when I got to this portion:

LUKE: Is from module import * better than explicit imports?

YODA: No, not better. Quicker, easier, more seductive.

LUKE: But how will I know why explicit imports are better than the wild-card form?

YODA: Know you will when your code you try to read six months from now.

And I realized that Python had learned a lot from these economists.

It's often dificult for a programmer to see the advantage of doing something the right way, which will be benneficial in six months, over just getting something done now. However, Python enforces doing things the right way, and when doing things the right way is just as easy as doing in the wrong way, you make the intuitive decision of doing things the right way. Almost every code base I've worked with(outside of Python) had some basic indentation rules that the code observed, Python just encodes this into the language, which requires all code to have a certain level of readability.

Django has also learned this lesson. For example, the template language flat out prevents you from putting your business logic inside of it without doing some real work, you don't want to do that work, so you do things the right way and put your business logic in your views. Another example would be database queries, in Django it would be harder to write a query that injected unescaped into your SQL than it would be do the right thing at use parameterized queries.

Ultimately, this is why I like Python. The belief that best practices shouldn't be optional, and that they shouldn't be difficult creates a community where you actively want to go and learn from people's code. Newcomers to the language aren't encouraged to "just get something working, and then clean it up later," the communiity encourages them to do it right in the first place, and save themselves the time later.

You can find the rest here. There are view comments.

Python Things

Posted November 15th, 2008. Tagged with python, tips.

I wasn't really sure what to name today's post, but it's basically going to be nifty things you can do in Python, and general tips.

  • SystemExit, sys.exit() raises SystemExit, if you actually want to keep going, you can just catch this exception, nothing special about it.
  • iter(callable, terminal), basically if you use iter in this way, it will keep calling the callable until the callable returns terminal, than it beaks.
  • a &lt; x &lt; b , in Python you can chain comparison operators like this. That's the same as writing a &lt; x and x &lt; b.
  • dict(), amongst the other ways to instantiate a dictionary in Python, you can give it a list of two tuples, so for example, [('a', 2), ('b', 3')] becomes {'a': 2, 'b': 3}.
  • open(filename), is an iterable, each iteration yields another line.
  • If you don't need ordering, use set() instead of list(). set() has better runtime for just about every operation, so if you don't need the ordering, use it.
  • Python comes with turtle graphics. This probably doesn't matter to most people, but if you want to help get a kid into programming, import turtle can be a great way.
  • pdb, the Python debugger is simply invaluable, try: code that isn't working, except ExceptionThatGetsRaised: import pdb; pdb.set_trace() is all it takes to get started with the itneractive debugger.
  • webbrowser.open(url), this module is just cool, it opens up the users browser to the desired URL.

And those are my tips! Please share yours.

You can find the rest here. There are view comments.

And now for the disclaimer

Posted November 14th, 2008. Tagged with disclaimer, django, python, internals.

I've been discussing portions of how the Django internals work, and this is powerful knowledge for a Django user. However, it's also internals, and unless they are documented internals are not guaranteed to continue to work. That doesn't mean they break very frequently, they don't, but you should be aware that it has no guarantee of compatibility going forward.

Having said that, I've already discussed ways you can use this to do powerful things by using these, and you've probably seen other ways to use these in your own code. In my development I don't really balk at the idea of using the internals, because I track Django's development very aggressively, and I can update my code as necessary, but for a lot of developers that isn't really an options. Once you deploy something you need it to work, so your options are to either lock your code at a specific version of Django, or not using these internals. What happens if you want to update to Django 1.1 for aggregation support, but 1.1 also removed some internal helper function you were using. Something similar to this happened to django-tagging, before the queryset-refactor branch was merged into trunk there was a helper function to parse portions of the query, and django-tagging made use of this. However, queryset-refactor obsoleted this function and removed, and so django-tagging had to update in order to work going forward, needing to either handle this situation in the code itself, or to maintain two separate branches.

In my opinion, while these things may break, they are worth using if you need them, because they let you do very powerful things. This may not be the answer for everyone though. In any event I'm going to continue writing about them, and if they interest you Marty Alchin has a book coming out, named Pro Django, that looks like it will cover a lot of these.

You can find the rest here. There are view comments.

Django Models - Digging a Little Deeper

Posted November 13th, 2008. Tagged with foreignkey, python, models, django, orm, metaclass.

For those of you who read my last post on Django models you probably noticed that I skirted over a few details, specifically for quite a few items I said we, "added them to the new class". But what exactly does that entail? Here I'm going to look at the add_to_class method that's present on the ModelBase metaclass we look at earlier, and the contribute_to_class method that's present on a number of classes throughout Django.

So first, the add_to_class method. This is called for each item we add to the new class, and what it does is if that has a contribute_to_class method than we call that with the new class, and it's name(the name it should attach itself to the new class as) as arguments. Otherwise we simply set that attribute to that value on the new class. So for example new_class.add_to_class('abc', 3), 3 doesn't have a contribute_to_class method, so we just do setattr(new_class, 'abc', 3).

The contribute_to_class method is more common for things you set on your class, like Fields or Managers. The contribute_to_class method on these objects is responsible for doing whatever is necessary to add it to the new class and do it's setup. If you remember from my first blog post about User Foreign Keys, we used the contribute_to_class method to add a new manager to our class. Here we're going to look at what a few of the builtin contribute_to_class methods do.

The first case is a manager. The manager sets it's model attribute to be the model it's added to. Then it checks to see whether or not the model already has an _default_manager attribute, if it doesn't, or if it's creation counter is lower than that of the current creation counter, it sets itself as the default manager on the new class. The creation counter is essentially a way for Django to keep track of which manager was added to the model first. Lastly, if this is an abstract model, it adds itself to the abstract_managers list in _meta on the model.

The next case is if the object is a field, different fields actually do slightly different things, but first we'll cover the general field case. It also, first, sets a few of it's internal attributes, to know what it's name is on the new model, additionally calculating it's column name in the db, and it's verbose_name if one isn't explicitly provided. Next it calls add_field on _meta of the model to add itself to _meta. Lastly, if the field has choices, it sets the get_FIELD_display method on the class.

Another case is for file fields. They do everything a normal field does, plus some more stuff. They also add a FileDescriptor to the new class, and they also add a signal receiver so that when an instance of the model is deleted the file also gets deleted.

The final case is for related fields. This is also the most complicated case. I won't describe exactly what this code does, but it's biggest responsibility is to set up the reverse descriptors on the related model, those are nice things that let you author_obj.books.all().

Hopefully this gives you a good idea of what to do if you wanted to create a new field like object in Django. For another example of using these techniques, take a look at the generic foreign key field in django.contrib.contenttypes, here.

You can find the rest here. There are view comments.

What software do I use?

Posted November 12th, 2008. Tagged with ubuntu, gtk, django, python.

Taking a page from Brain Rosner's book, today I'm going to overview the software I use day to day. I'm only going to cover stuff I use under Ubuntu, I keep Windows XP on my system for gaming, but I'm not going to cover it here.

  • Ubuntu, I've been using the current version, Intrepid Ibex, since Alpha 4, and I love it. You quite simply couldn't get me to go back to Windows.
  • Python, it's my go to language, I fell in love about 14 months ago and I'm never going to leave it.
  • Django, it's my framework of choice, it's simple, clean, and well designed.
  • g++, C++ is the language used in my CS class, so I use my favorite free compiler.
  • gnome-do, this is an incredibly handy application, similar to Quicksilver for OS X, it makes simple things super fast, stuff like spawning the terminal, posting a tweet, searching for a file, or calling on Google's awesome calculator.
  • Firefox, the tried and true free browser, I also have to thank, Gmail Notifier, Firebug, Download them All, and Reload Every.
  • Chatzilla, I figured this extension deserved it's own mention, I use it almost 24/7 and couldn't live without it.
  • Gedit, who would think that the text editor that came with my OS would be so great?
  • VLC and Totem, you guys are both great, VLC is a bit nicer for playing flvs, but I love Totem's ability to search and play movies from Youtube.
  • Skype, makes it easy to get conference calls going with 5 friends, couldn't live without it.

As you can see most of the software I use is open source. I don't imagine anything I use is very outside the mainstream, but all of these projections deserve a round of applause for being great.

You can find the rest here. There are view comments.

How the Heck do Django Models Work

Posted November 10th, 2008. Tagged with models, django, python, metaclass.

Anyone who has used Django for just about any length of time has probably used a Django model, and possibly wondered how it works. The key to the whole thing is what's known as a metaclass, a metaclass is essentially a class that defines how a class is created. All the code for this occurs is here. And without further ado, let's see what this does.

So the first thing to look at is the method, __new__, __new__ is sort of like __init__, except instead of returning an instance of the class, it returns a new class. You can sort of see this is the argument signature, it takes cls, name, bases, and attrs. Where __init__ takes self, __new__ takes cls. Name is a string which is the name of the class, bases is the class that this new class is a subclass of, and attrs is a dictionary mapping names to class attributes.

The first thing the __new__ method does is check if the new class is a subclass of ModelBase, and if it's not, it bails out, and returns a normal class. The next thing is it gets the module of the class, and sets the attribute on the new class(this is going to be a recurring theme, getting something from the original class, and putting it in the right place on the new class). Then it checks if it has a Meta class(where you define your Model level options), it has to look in two places for this, first in the attrs dictionary, this is where it will be if you stick your class Meta inside your class. However, because of inheritance, we also have to check if the class has an _meta attribute already(this is where Django ultimately stores a bunch of internal information), and handle that scenario as well.

Next we get the app_label attribute, for this we either use the app_label attribute in the Meta class, or we pull it out of sys.modules. Lastly(at least for Meta), we build an instance of the Options class(which lives at django.db.models.options.Option) and add it to the new class as _meta. Next, if this class isn't an abstract base class we add the DoesNotExist and MultipleObjectsReturned exceptions to the class, and also inherit the ordering and get_latest_by attributes if we are a subclass.

Now we start getting to adding fields and stuff. First we check if we have an _default_manager attribute, and if not, we set it to None. Next we check if we've already defined the class, and if we have, we just return the class we already created. Now we go through each item that's left in the attrs dictionary and class the add_to_class method with it on the new class. add_to_class is a piece of internals that you may recognize from my first two blog posts, and what exactly it does I'll explain exactly what it does in another most, but at it's most basic level it adds each item in the dictionary to the new class, and each item knows where exactly it needs to get put.

Now we do a bunch of stuff to deal with inherited models. We iterate through ever item in bases, that's also a subclass of models.Model, and do the following: if it doesn't have an _meta attribute, we ignore it. If the parent isn't an abstract base class, if we already have a OneToOne field to it we set it up as a primary key, otherwise we create a new OneToOne field and install it as a primary key for the model. And now, if it is an abstract class, we iterate through the fields, if any of these fields has a name that is already defined on our class, we raise an error, otherwise we add that field to our class. And now we move managers from the parents down to the new class. Essentially we juts copy them over, and we also copy over virtual fields(these are things like GenericForeignKeys, which doesn't actually have a database field, but we still need to pass down and setup appropriately).

And then we do a few final pieces of cleanup. We make sure our new class doesn't have abstract=True in it's _meta, even if it's inherited from an abstract class. We add a few methods(get_next_in_order, and other), we inherit the docstring, or set a new one, and we send the class prepared signal. Finally, we register the model with Django's model loading system, and return the instance in Django's model cache, this is to make sure we don't have duplicate copies of the class floating around.

And that's it! Obviously I've skirted over how exactly somethings occur, but you should have a basic idea of what occurs. As always with Django, the source is an excellent resource. Hopefully you have a better idea of what exactly happens when you subclass models.Model now.

You can find the rest here. There are view comments.

Getting Started With PLY - Part 3

Posted November 10th, 2008. Tagged with lex, python, ply, yacc.

As promised, today we'll be looking at implementing additional arithmetic operations, dealing with order of operations, and adding variables to our languages, so without further ado, let's jump into the code.

We can replace our old addition rule with this:

import operator
def p_expression_arithmetic(p):
   '''
   expression : expression PLUS expression
              | expression MINUS expression
              | expression TIMES expression
              | expression DIVIDE expression
   '''
   OPS = {
       '+': operator.add,
       '-': operator.sub,
       '*': operator.mul,
       '/': operator.div
   }
   p[0] = OPS[p[2]](p[1], p[3])

Hopefully what this code does is pretty clear, the | operator in the rule is an or option. So if we match any of these, we get the correct function out of our ops dictionary(if you aren't familiar with operator module check it out, it's awesome), and then call it with the two arguments.

This handles the arithmetic correctly, but doesn't handle order of operations, so lets add that in:

precedence = (
   ('left', 'PLUS', 'MINUS'),
   ('left', 'TIMES', 'DIVIDE'),
)

What this says is all these operations are left-associative, and TIMES and DIVIDE have a high precedence than PLUS and MINUS(both groupings have equal precedence, and thus read left to right).

Now that we have a fully functioning calculator, let's add in variables, first we need to add a token for NAMES (variables) and for the assignment operator:

def t_NAME(t):
   r'[a-zA-Z_][a-zA-Z_0-9]*'
   return t

t_EQ = r'='

And of course add NAME and EQ to the list of tokens, and now a few parsing rules:

names = {}

def p_expression_name(p):
   '''
   expression : NAME
   '''
   p[0] = names[p[1]]

def p_assignment(p):
   '''
   assignment : NAME EQ expression
   '''
   names[p[1]] = p[3]

So here we define a names dictionary, it will map variables to values. Hopefully the parse rules are fairly obvious, and everything makes sense.

You can find the rest here. There are view comments.

Getting Started With PLY - Part 2

Posted November 9th, 2008. Tagged with yacc, python, ply.

Yesterday we created our tokens, and using these we can parse our language (which right now is a calculator) into some tokens. Unfortunately this isn't very useful. So today we are going to start writing a grammar, and building an interpreter around it.

In PLY, grammar rules are defined similarly to tokens, that is, using docstrings. Here's what a few grammar rules for out language might look like:

def p_expression_plus(p):
    '''
    expression : expression PLUS expression
    '''
    p[0] = p[1] + t[3]

def p_expression_number(p):
    '''
    expression : NUMBER
    '''
    p[0] = [1]

So the first docstring works is, an expression is defined as expression PLUS expression. Here PLUS is the token we defined earlier, and expression is any other way we've defined expression, so an expression is also a number (which is the token we defined earlier). The way the code works is essentially that p[0] is the result, and each piece of the definition is it's own subscript, so p[1] and p[3] refer to the two expression in the plus expression we defined.

To actually use this parser we've defined we do:

parser = yacc.yacc()
if __name__ == '__main__':
   while True:
       try:
           s = raw_input('calc > ')
       except EOFError:
           break
       if not s:
           continue
       result = parser.parse(s)
       print result

Try it out! As an exercise, the reader can implement other operations (remember the order of operations!), and perhaps variables. Tomorrow, I'll be discussing implementing these. As always, the PLY documentation is excellent, and available here.

You can find the rest here. There are view comments.

Getting Started With PLY

Posted November 8th, 2008. Tagged with lex, python, ply.

The other day I mentioned I was using PLY in my post about building a language, so today I'm going to describe getting started with PLY, specifically the tokenization phase. For those who don't know much about parsing a language, the tokenization phase is where we take the source file, and turn it into a series of tokens. For example, turning a = 3 + 4 into, NAME EQUALS 3 PLUS 4. As you can see that simple assignment becomes 5 tokens, each number is a token, both numbers are tokens, and a is a NAME token. So how do we do this in PLY?

PLY's method for defining tokenization rules is very creative. First you define a list of tokens, for example:

tokens = (
   'NUMBER',
   'PLUS',
   'MINUS',
   'TIMES',
   'DIVIDE',
)

Here we have defined the types of tokens we will define, what each of these is should be self explanatory. Then we define some rules, they look like this:

t_PLUS    = r'\+'
t_MINUS   = r'-'
t_TIMES   = r'\*'
t_DIVIDE  = r'/'
def t_NUMBER(t):
    r'\d+'
    try:
        t.value = int(t.value)
    except ValueError:
        t.value = 0
    return t

This is probably less obvious. There are 2 ways to define the rules for a token, either as a string, or as a function. Either way they are named t_TOKEN_NAME. For a lot of tokens you can just do the string, those are the ones that don't require processing, and the string is just a regex that matches the token. For things that do need processing, we can define a function. The function takes 1 parameter, which is a lexer object, as you can see in our example, we take in t, and since we are defining a number token we set the value to be the integer for the string representation from the source code. The interesting thing here is how we define the rule for, PLY uses the docstring for a function to get the regex for it.

Now that we have all of our rules set up we need to actually build the lexer object:

lexer = lex.lex()

And then we can use the input() function on a lexer to provide the source code, and the token function to pop the next token off the lexer.

That's all for today, in the future we'll take a look at the other components of building the grammar of a language, and at how we implement it. For more information now, PLY has excellent documentation, available here.

You can find the rest here. There are view comments.

That's not change we can believe in

Posted November 7th, 2008. Tagged with php, django, python, obama.

Yesterday president-elect Obama's campaign unveiled their transitional website, change.gov. So, as someone who's interested in these things I immediately began to look at what language, framework, or software package they were using. The first thing I saw was that they were using Apache, however beyond that there were no distinctive headers. None of the pages had tell-tale extensions like .php or .aspx. However, one thing that struck me was that most pages were at a url in the form of /page/*/, which is the same format of the Obama campaign website, which I knew was powered by Blue State Digital's CMS. On the Obama campaign's site however, there were a few pages with those tell-tale .php extension, so I've come to the conclusion that the new site also uses PHP. And to that I say, that's not change we can believe in.

PHP has been something of a powerhouse in web development for the last few years, noted for it's ease of deployment and quick startup times, it's drawn in legions of new users. However, PHP has several notable flaws. Firstly, it doesn't encourage best practices, ranging from things like code organization (PHP currently has no concept of namespaces), to database security (the included mysql database adapter doesn't feature parameterized queries), and beyond. However, this isn't just another post to bash on PHP (as much as I'd like to do one), there are already plenty of those out there. This post is instead to offer some of the benefits of switching, to Python, or Ruby, or whatever else.

  • You develop faster. Using a framework like Django, or Rails, or TurboGears let's you do things very quickly.
  • You get the benefits of the community, with Django you get all the reusable applications, Rails has plugins, TurboGears has middleware. Things like these quite simply don't exist in the PHP world.
  • You get a philosophy. As far as I can tell, PHP has no philosophy, however both Python and Ruby do, and so do their respective frameworks. Working within a consistant philsophy makes development remarkably more sane.

If you currently are a user of PHP, I beg of you, take a chance, try out Ruby or Python, or whatever else. Give Django, or TurboGears, or Rails a shot. Even if you don't end up liking it, or switching, it's worth giving it a shot.

You can find the rest here. There are view comments.

Building a Programming Language with Python

Posted November 6th, 2008. Tagged with compile, python, ply.

One of my side projects of late has been building a programming language in Python, using the PLY library. PLY is essentially a Python implementation of the classic Lex and Yacc tools. The language, at present, has a syntax almost exactly the same as Python's (the notable difference, in so far as features that have been implemented, is that you are not allowed to have multiple statements on the same line, or to put anything following a colon on the same line). The language(currently called 'Al' although that's more of a working name), is a dynamic language that builds up an syntax tree for the code, and than executes it. However, the long term goal is to have it actually be a compiled language, similar to Lisp or C. Essentially the mechanism for doing this will be the same as how a C++ compiler handles multiple dispatch, which is dynamically at run time.

At present however, this mythical fully compiled language is far from complete, I haven't even began to think about the assembly generation, mostly because I don't know assembly at all, and one of the courses I will be taking next semester is one which covers assembly code. However, the question that has to be asked, are what are the advantages of a compiled language, and what are the costs?

First the benefits:

  • It's faster, even a worst case C++ program that fully utilizes multiple dispatch at runtime will go faster than a program using the same algorithms in Python.
  • You get an executable at the end. This is a huge advantage for distribution, you don't need to distribute the source code, and you have an exe to give to people.

There are probably others, but I'm assuming the semantics of a language similar to Python, so I haven't included things like compile time type checking. And now the disadvantages:

  • You lose some of the dynamicism. Doing things like eval(), or dynamic imports is inherently harder, if not impossible.
  • You lose the REPL (interactive interpreter).

So can we overcome those? As far as I can tell the first should be doable, eval() necessitates the inclusion of an interpreter with the language, the thought of this already has to be making people think this is just going to end up as a VM. But, I think this can be overcome, we can know, at compile time, whether or not a user will be using eval, and decide then whether or not to compile the interpreter and link against it. Dynamic imports are, if anything harder, I think, I think this is just an issue of doing run time linking, but I'm not sure. As for the issue of the REPL, this is a non-issue as far as I'm concerned, there is no inherent reason a compiled language can't have a REPL, we just often don't, languages like Common Lisp have long had both.

So now, let's see some code. I hope to have some code to show off, that handles at least a subset of Python, for PyCon 2009, as work begins on assembly generation I will post here. For anyone interested in the code at present, you can see it here.

You can find the rest here. There are view comments.

PyGTK and Multiprocessing

Posted November 5th, 2008. Tagged with multiprocessing, pygtk, gtk, python.

Yesterday was election day, and for many people that meant long nights following the results, waiting to see who would be declared the next president of the United States of America. Politics is a game of numbers, and it's nice to offload the crunching to our computers. I had written up a simple application for projecting win likelihood for the candidates based on the likelihood of a win in an individual state. If you are interested in the application itself you can see it here. However this post is going to look at the new multiprocessing library, and how I used it with PyGTK.

Part of my application is that whenever you update a probability for a given candidate in a given state it recomputes their win percentage for the election as a whole. To make this as accurate as possible it runs multiple simulations of the scenario to compute the win percentage. Originally I was running these computations in the same thread as the GUI work and I found that I could only do about 250 simulations before it had a drastically negative impact on usability. So the next step was to offload these calculations into another process.

To go about this I created an Updater class which is a subclass of multiprocessing.Process. It takes a pipe as it's only argument, and it's run method just loops forever polling the pipe for new projection results, tabulating them, and then sending the projection back through the pipe.

In the main process the application obviously starts by creating a duplex pipe, spawning the second process (and giving it the pipe). Then, using the facilities of the gobject library, it sets up a method that checks for new projection results and updates the GUI to be executed whenever the main thread is idle(gobject.idle_add). And lastly the signal responder that gets called whenever the user changes some data simply marshals up the necessary data, and sets it through the pipe to the other process.

And that's all, in total I believe it was under 25 lines of code changed to make my application use a separate process for calculation.

Edit: Upon request, this is the diff where I made the original changes, several subsequent commits will better reflect what is described here though.

You can find the rest here. There are view comments.