alex gaynor's blago-blog

Posts tagged with programming

Why I support diversity

Posted August 28th, 2013. Tagged with python, diversity, community, programming, django.

I get asked from time to time why I care about diversity in the communities I'm a part of, particularly the Django, Python, and the broader software development and open source community.

There's a lot of good answers. The simplest one, and the one I imagine just about everyone can get behind: diverse groups perform better at creative tasks. A group composed of people from different backgrounds will do better work than a homogeneous group.

But that's not the main reason I care. I care because anyone who knows how to read some statistics knows that it's ridiculous that I'm where I am today. I have a very comfortable job and life, many great friends, and the opportunity to travel and to spend my time on the things I care about. And that's obscenely anomalous for a high school dropout like me.

All of that opportunity is because when I showed up to some open source communities no one cared that I was a high school dropout, they just cared about the fact that I seemed to be interested, wanted to help, and wanted to learn. I particularly benefited from the stereotype of white dropouts, which is considerably more charitable than (for example) the stereotype of African American dropouts.

Unfortunately, our communities aren't universally welcoming, aren't universally nice, and aren't universally thoughtful and caring. Not everyone has the same first experience I did. In particular people who don't look like me, aren't white males, disproportionately don't have this positive experience. But everyone ought to. (This is to say nothing of the fact that I had more access to computers at a younger age then most people.)

That's why I care. Because I benefited from so much, and many aren't able to.

This is why I support the Ada Initiative. I've had the opportunity to see their work up close twice. Once, as a participant in Ada Camp San Francisco's Allies Track. And a second time in getting their advice in writing the Code of Conduct for the Django community. They're doing fantastic work to support more diversity, and more welcoming communities.

Right now they're raising funds to support their operations for the next year, if you accord to, I hope you'll donate: http://supportada.org

You can find the rest here. There are view comments.

An open letter to the security community

Posted August 3rd, 2013. Tagged with programming, security.

Your community appears to be a disaster. I've already read about the following happening in your community/conferences:

I can only conclude you don't want to be welcoming to new users.

And so I don't participate. Congratulations. You've built a community so unwelcoming that otherwise interested developers won't be a part of it.

Friends of mine in the security community: I beg of you, don't enable this by continuing to attend events like this. Don't validate this behavior with your participation.

Security community, fix your shit.

You can find the rest here. There are view comments.

Disambiguating BSON and msgpack

Posted February 16th, 2013. Tagged with python, programming.

I had a fairly fun project at work recently, so I thought I'd write about it. We currently store BSON blobs of data in many places. This is unfortunate, because BSON is bloated and slow (an array is internally stored as a dictionary mapping the strings "0", "1", "2", etc. to values). So we wanted to migrate to msgpack, which I've measured as requiring 46% of the space of BSON, and being significantly faster to deserialize (we aren't concerned with serialization speed, though I'm relatively confident that's faster as well).

The one trick we wanted to pull was to do the migration in place, that is gradually rewrite all the columns' data from BSON to msgpack. This is only possible if the data can be interpreted as one or the other unambiguously. So I was tasked if finding out if this was possible.

The first thing that's important to know about BSON is that the first 4-bytes are the length of the entire document (in bytes) as a signed integer, little endian. msgpack has no specific prefix, the first bytes are merely the typecode for whatever the element is. At Rdio, we know something about our data though, because BSON requires all top-level elements to be dictionaries, and we're just re-serializing the same data, we know that all of these msgpacks will have dictionaries as the top level object.

Because a BSON blob starts with its size, in bytes, we're going to try to find the smallest possible 4-byte starting sequence (interpreted as an integer) one of our payloads could have, in order to determine what the smallest possible ambiguity is.

So the first case is the empty dictionary, in msgpack this is serialized as:

>>> msgpack.packb({})
'\x80'

That's less than 4 bytes, and all BSONs are at least 4 bytes, so that can't be ambiguous. Now let's look at a dictionary with some content. Another thing we know about our payloads is that all the keys in the dictionaries are strings, and that the keys are alphanumeric or underscores. Looking at the msgpack spec, the smallest key (interpreted as its serialized integer value) that could exist is "0", since "0" has the lowest ASCII value of any letter, number, or underscore. Further, from the msgpack spec we know that the number 0 serializes as a single byte, so that will be the key's value. Let's see where this gets us:

>>> msgpack.packb({"0": 0})
'\x81\xa10\x00'

A 4 byte result, perfect, this is the smallest prefix we can generate, let's see how many bytes this would be:

>>> struct.unpack('<l', '\x81\xa10\x00')
(3187073,)

3187073 bytes, or a little over 3 MB. To be honest I'm not sure we have a key that starts with a number, let's try with the key "a":

>>> msgpack.packb({"a": 0})
'\x81\xa1a\x00'
>>> struct.unpack('<l', '\x81\xa1a\x00')
(6398337,)

A little over 6 MB. Since I know that none of the payloads we store are anywhere close to this large, we can safely store either serialization format, and be able to interpret the result unambiguously as one or the other.

So our final detection code looks like:

def deserialize(s):
    if len(s) >= 4 and struct.unpack('<l', s[:4])[0] == len(s):
        return BSON(s).decode()
    else:
        return msgpack.unpackb(s)

If this sounds like a fun kind of the thing to do, you should apply to come work with me at Rdio.

You can find the rest here. There are view comments.

The perils of polyglot programming

Posted December 23rd, 2011. Tagged with programming, djangocon, programming-languages.

Polyglot programming (the practice of knowing and using many programming languages) seems to be all the rage these days. Its adherents claim two benefits:

  1. Using the right tool for every job means everything you do is a little bit easier (or better, or faster, or all of the above).
  2. Knowing multiple programming paradigm expands your mind and makes you better at programming in every language.

I'm not going to dispute either of these. Well, maybe the second I'll argue with a little: I think you can get most of the benefits by using different paradigms within the same multi-paradigm language, and I'm a bit skeptical of the global benefits (unless you're the type of person who likes writing FORTRAN in Javascript). But I digress, like I said, I think those are both fair claims.

What I don't like is the conclusion that this means you should always use the right tool for the job. What, you the astute reader asks, does this mean you think we should use the wrong tool for the job? No, that would be idiotic, it means I think sometimes using the less optimal tool for the job carries overall benefits.

So what are the dangers of being a polyglot programmers (or the benefits of not being one, if you will)?

Using multiple languages (or any technology) stresses your operations people. It's another piece they have to maintain. If you've got a nice JVM stack, top to bottom, with nice logging and monitoring do you think your ops people really want to hear that they need to duplicate that setup so you can run three Ruby cron jobs? No, they're going to tell you to suck it up and write it up and either see if JRuby works or use Clojure or something, because 1% of your company's code isn't worth doubling their work.

Another risk is that it raises the requirements for all the other developers on the project. Martin Golding said, "Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live." Imagine when you leave your job and the next guy finds out you decided to write some data analysis scripts in APL (for those of you who don't remember, APL is that lovely language that doesn't use ASCII characters). It's fine to use APL if that's something you can require of new hires, it's not fine when your job says "Python developer" (it may actually work for Perl developers, but I assure you it'll be purely coincidental). Learning a new language is hard, learning to write it effectively is harder. Learning a new language for every script you have to maintain is downright painful, and once you know all of them, context switching isn't free for either humans or computers.

I'm not saying write everything in one language, that'd probably leave you writing a lot of code in very suboptimal languages. But choose two or three, not ten. Your ops people will thank you, and so will the guys who have to maintain your code in a decade. At DjangoCon this year Glyph Lefkowitz actually went farther, he argued that not just the code you write, but your entire technology stack should be in one language. But that's a separate discussion, you should watch the video though.

Also, because I'm a big fan of The West Wing, I'd be remiss if I used the word polyglot this many times without linking to a great scene.

You can find the rest here. There are view comments.

Why del defaultdict()[k] should raise an error

Posted November 28th, 2011. Tagged with python, programming.

Raymond Hettinger recently asked on twitter what people thought del defaultdict()[k] did for a k that didn't exist in the dict. There are two ways of thinking about this, one is, "it's a defaultdict, there's always a value at a key, so it can never raise a KeyError", the other is, "that only applies to reading a value, this should still raise an error". I initially spent several minutes considering which made more sense, but I eventually came around to the second view, I'm going to explain why.

The Zen of Python says, "Errors should never pass silently." Any Java programmer who's seen NullPointerException knows the result of passing around invalid data, rather than propagating an error. There are two cases for trying to delete a key which doesn't exist in a defaultdict. One is: "this algorithm happens to sometimes produce keys that aren't there, not an issue, ignore it", the other is "my algorithm has a bug, it should always produce valid keys". If you don't raise a KeyError the first case has a single line of nice code, if you do raise an error they have a boring try/ except KeyError thing going on, but no big loss. However, if an error isn't raised and your algorithm should never produce nonexistent keys, you'll be silently missing a large bug in your algorithm, which you'll have to hope to catch later.

The inconvenience of ignoring the KeyError to the programmer with the algorithm that produces nonexistent keys is out weighed by the potential for hiding a nasty bug in the algorithm of the programmer who's code should never produce these. Ignoring an exception is easy, trying to find the bug in your algorithm can be a pain in the ass.

You can find the rest here. There are view comments.

The run-time distinction

Posted October 11th, 2011. Tagged with programming, python, programming-languages.

At PyCodeConf I had a very interesting discussion with Nick Coghlan which helped me understand something that had long frustrated me with programming languages. Anyone who's ever taught a new programmer Java knows this, but perhaps hasn't understood it for what it is. This thing that I hadn't been appreciating was the distinction some programming languages make between the language that exists at compile time, and the language that exists at run-time.

Take a look at this piece of Java code:

class MyFirstProgram {
    public static void main(String[] args) {
        System.out.println("Hello World!");
    }
}

Most people don't appreciate it, but you're really writing in two programming languages here, one of these languages has things like class and function declarations, and the other has executable statements (and yes, I realize Java has anonymous classes, they don't meaningfully provide anything I'm about to discuss).

Compare that with the (approximately) equivalent Python code:

def main():
    print "Hello World"

if __name__ == "__main__":
    main()

There's a very important thing to note here, we have executable statements at the top level, something that's simply impossible in Java, C, or C++. They make a distinction between the top level and your function's bodies. It follows from this that the function we've defined doesn't have special status by virtue of being at the top level, we could define a function or write a class in any scope. And this is important, because it gives us the ability to express things like decorators (both class and function).

Another example of this distinction that James Tauber pointed out to me is the import statement. In Python is it a line of executable code which invokes machinery in the VM to find a module and load it into the current namespace. In Java it is an indication to the compiler that a certain symbol is in scope, it's never executed.

Why would anyone care about this distinction though? It's clearly possibly to write programs in languages on both ends of the spectrum. It appears to me that the expressiveness of a programming language is really a description of what the distance between the compile time language and the runtime language is. Python stands on one end, with no distinction, whereas C/C++/Java stand on the other, with a grand canyon separating them.

But what about a language in the middle? Much of PyPy's code is written in a language named RPython. It has a fairly interesting property though, its run-time language is pretty close to Java in semantics, it's statically typed (though type inferenced), it's compile time language is Python. In practice this means you get many of the benefits in expressiveness as you do from using Python itself. For example you can write a decorator, or generate a class. A good example of this power is in PyPy's NumPy implementation. We're able to create the code for doing all the operations on different dtypes (NumPy's name for the different datatypes its arrays can represent) dynamically, without needing to resort to code generation or repeating ourselves, we're able to rely on Python as our compile time (or meta-programming) language. The "in-practice" result of this is that writing RPython feels much more like writing Python than it does like writing Java, even though most of your code is written under the same constraints (albeit without the need to explicitly write types).

The distinction between compile-time and run-time in programming languages results in both more pain for programmers, as well as more arcane structures to explain to new users. I believe new languages going forward should make it a goal to either minimize this difference (as Python does) or outfit languages with more powerful compile-time languages which give them the ability to express meta-programming constructs.

You can find the rest here. There are view comments.

So you want to write a fast Python?

Posted July 10th, 2011. Tagged with pypy, programming, python.

Thinking about writing your own Python implementation? Congrats, there are plenty out there [1], but perhaps you have something new to bring to the table. Writing a fast Python is a pretty hard task, and there's a lot of stuff you need to keep in mind, but if you're interested in forging ahead, keep reading!

First, you'll need to write yourself an interpreter. A static compiler for Python doesn't have enough information to do the right things [2] [3], and a multi-stage JIT compiler is probably more trouble than it's worth [4]. It doesn't need to be super fast, but it should be within 2x of CPython or so, or you'll have lost too much ground to make up later. You'll probably need to write yourself a garbage collector as well, it should probably be a nice, generational collector [5].

Next you'll need implementations for all the builtins. Be careful here! You need to be every bit as good as CPython's algorithms if you want to stand a chance, this means things like list.sort() keeping up with Timsort [6], str.__contains__ keeping up with fast search [7], and dict.__getitem__ keeping up with the extremely carefully optimized Python dict [8].

Now you've got the core language, take a bow, most people don't make it nearly this far! However, there's still tons of work to go, for example you need the standard library if you want people to actually use this thing. A lot of the stdlib is in Python, so you can just copy that, but some stuff isn't, for that you'll need to reimplement it yourself (you can "cheat" on a lot of stuff and just write it in Python though, rather than C, or whatever language your interpreter is written in).

At this point you should have yourself a complete Python that's basically a drop-in replacement for CPython, but that's a bit slower. Now it's time for the real work to begin. You need to write a Just in Time compiler, and it needs to be a good one. You'll need a great optimizer that can simultaneously understand some of the high level semantics of Python, as well as the low level nitty gritty of your CPU [9].

If you've gotten this far, you deserve a round of applause, not many projects make it this far. But your Python probably still isn't going to be used by the world, you may execute Python code 10x faster, but the Python community is more demanding than that. If you want people to really use this thing you're going to have to make sure their C extensions run. Sure, CPython's C-API was never designed to be run on other platforms, but you can make it work, even if it's not super fast, it might be enough for some people [10].

Finally, remember that standard library you wrote earlier? Did you make sure to take your time to optimize it? You're probably going to need to take a step back and do that now, sure it's huge, and people use every nook and cranny of it, but if you want to be faster, you need it to be faster too. It won't do to have your bz2 module be slower, tarnishing your beautiful speed results [11].

Still with me? Congratulations, you're in a class of your own. You've got a blazing fast Python, a nicely optimized standard library, and you can run anyone's code, Python or C. If this ballad sounds a little familiar, that's because it is, it's the story of PyPy. If you think this was a fun journey, you can join in. There are ways for Python programmers at every level to help us, such as:

  • Contributing to our performance analysis tool, this is actually a web app written using Flask.
  • Contribute to speed.pypy.org which is a Django site.
  • Provide pure Python versions of your C-extensions, to ensure they run on alternative Pythons.
  • Test and benchmark your code on PyPy, let us know if you think we should be faster! (We're always interested in slower code, and we consider it a bug)
  • Contribute to PyPy itself, we've got tons of things to do, you could work on the standard library, the JIT compiler, the GC, or anything in between.

Hope to see you soon [12]!

[1]CPython, IronPython, Jython, PyPy, at least!
[2]http://code.google.com/p/shedskin/
[3]http://cython.org/
[4]http://code.google.com/p/v8/
[5]http://docs.python.org/c-api/refcounting.html
[6]http://hg.python.org/cpython/file/2.7/Objects/listsort.txt
[7]http://effbot.org/zone/stringlib.htm
[8]http://hg.python.org/cpython/file/2.7/Objects/dictnotes.txt
[9]http://code.google.com/p/unladen-swallow/
[10]http://code.google.com/p/ironclad/
[11]https://bugs.pypy.org/issue733
[12]http://pypy.org/contact.html

You can find the rest here. There are view comments.

My experience with the computer language shootout

Posted April 3rd, 2011. Tagged with pypy, programming, python, programming-languages.

For a long time we, the PyPy developers, have known the Python implementations on the Computer Language Shootout were not optimal under PyPy, and in fact had been ruthlessly microoptimized for CPython, to the detriment of PyPy. But we didn't really care or do anything about it, because we figured those weren't really representative of what people like to do with Python, and who really cares what it says, CPython is over 30 times slower than C, and people use it just the same. However, I've recently have a number of discussions about language implementation speed and people tend to cite the language shootout as the definitive source for cross-language comparisons. So I decided to see what I could do about making it faster.

The first benchmark I took a stab at was reverse-complement, PyPy was doing crappily on it, and it was super obviously optimized for CPython: every loop possible was pushed down into functions known to be implemented in C, various memory allocation tricks are played (e.g. del some_list[:] removes the contents of the list, but doesn't deallocate the memory), and bound method allocation is pulled out of loops. The first one is the most important for PyPy, on PyPy your objective is generally to make sure your hot loops are in Python, the exact opposite of what you want on CPython. So I started coding up my own version, optimized for PyPy, I spent some time with our debugging and profiling tools, and whipped up a nice implementation that was something like 3x faster than the current one on PyPy, which you can see here. Generally the objective here was to make sure the program does as little memory allocation in the hot loops as possible, all of which are in Python. Try that with your average interpreter.

So I went ahead and submitted it, thinking PyPy would be looking 3 times better when I woke up. Naturally I wake up to an email from the shootout, which says that I should provide a Python 3 implementation, and that it doesn't work on CPython. What the hell? I try to run it myself and indeed it doesn't. It turns out on CPython sys.stdout.write(buffer(array.array("c"), 0, idx)) raises an exception. Which is a tad unfortunate because it should be an easy way to print out part of an array of characters without needing to allocate memory. After speaking with some CPython core developers, it appears that it is indeed a bug in CPython. And I noticed on PyPy buffer objects aren't nearly as efficient as they should be, so I set out in search of a new way to work on CPython and PyPy, and be faster if possible. I happened to stuble across the method array.buffer_info which returns a tuple of the memory address of the array's internal storage and its length, and a brilliant hack occurred to me: use ctypes to call libc's write() function. I coded it up, and indeed it worked on PyPy and CPython and was 40% faster on PyPy to boot. Fantastic I thought, I'll just submit this and PyPy will look rocking! Only 3.5x slower than C, not bad for an interpreter, in a language that is notoriously hard to optimize. You can see the implementation right here, it contains a few other performance tricks as well, but nothing too exciting.

So I submitted this, thinking, "Aha! I've done it". Shortly, I had an email saying this has been accepted as an "interesting alternative" because it used ctypes, which is to say it won't be included in the cumulative timings for each implementation, nor will it be listed with the normal implementations for the per-benchmark scores. Well crap, that's no good, the whole point of this was to look good, what's the point if no one is going to see this glorious work. So I sent a message asking why this implementation was considered alternative, since it appeared fairly legitimate. I received a confusing message questioning why this optimization was necessary, followed by a suggestion that perhaps PyPy wasn't compatible enough with (with what I dare not ask, but the answer obviously isn't Python the abstract language, since CPython had the bug!).

Overall it was a pretty crappy experience. The language shootout appears to be governed by arbitrary rules. For example the C implementations use GCC builtins, which are not part of the C standard, making them not implementation portable. The CPython pidigits version uses a C extension which is obviously not implementation portable (by comparison every major Python implementation includes ctypes, only CPython, and to varying extents IronPython and PyPy, support the CPython C-API), although here PyPy was allowed to use ctypes. It's also not possible to send any messages once your ticket has been marked as closed, meaning to dispute a decision you basically need to pray the maintainer reopens it for some reason. The full back and forth is available here. I'm still interested in improving the PyPy submissions there (and further optimizing PyPy where needed). However given the seemingly capricious and painful submission process I'm not really inclined to continue work, nor can I take the shootout seriously as an honest comparison of languages.

You can find the rest here. There are view comments.

Announcing VCS Translator

Posted January 21st, 2011. Tagged with python, vcs, software, programming, django, open-source.

For the past month or so I've been using a combination of Google, Stackoverflow, and bugging people on IRC to muddle my way through using various VCS that I'm not very familiar with. And all too often my queries are of the form of "how do I do git foobar -q in mercurial?". A while ago I tweeted that someone should write a VCS translator website. Nobody else did, so when I woke up far too early today I decided I was going to get something online to solve this problem, today! About 6 hours later I tweeted the launch of VCS translator.

This is probably not even a minimum viable product. It doesn't handle a huge range of cases, or version control systems. However, it is open source and it provides a framework for answering these questions. If you're interested I'd encourage you to fork it on github and help me out in fixing some of the most requested translation (I remove them once they're implemented).

My future goals for this are to allow commenting, so users can explain the caveats of the translations (very infrequently are the translations one-to-one) and to add a proper API. Moreover my goal is to make this a useful tool to other programmers who, like myself, have far too many VCS in their lives.

You can find the rest here. There are view comments.

Getting the most out of tox

Posted December 17th, 2010. Tagged with testing, python, taggit, programming, django.

tox is a recent Python testing tool, by Holger Krekel. It's stated purpose is to make testing Python projects against multiple versions of Python (or different interpreters, like PyPy and Jython) much easier. However, it can be used for so much more. Yesterday I set it up for django-taggit, and it's an absolute dream, it automates testing against four different versions of Python, two different versions of Django, and it automates building the docs and checking for any warnings from Sphinx. I'll try to give a run through on what exactly you need to do to set this up with your project.

First create a tox.ini at the root of your project (i.e. in the same directory as your setup.py). Next create a [tox] section, and list out the enviroments you'd like to be tested (i.e. which Pythons):

[tox]
envlist =
    py25, py26 , py27, pypy

The enviroments we've listed out are a few of the ones included with tox, they point at specific versions of Python and use the default testing setup. Now add a [testenv] section which will tell tox how to actually run your tests:

[testenv]
commands =
    python setup.py test
deps =
    django==1.2.3

commands is the list of commands tox will run, and deps specifies any dependencies that are needed to run the tests (tox creates a virtualenv for each enviroment and doesn't include system wide site-packages, so you need to make sure you list everything needed by default here). If you want to use this same python setup.py test formulation you'll need to be using setuptools or distribute for your setup.py and provide the test_suite argument, Eric Holscher provides a good run down for how to do this for Django projects.

Now you should be able to just type tox into your command line and it will try to run your tests in each of the enviroments you specified. Hopefully they're all passing (future test runs will go faster, for the first run it has to install all the dependencies). The next thing you may want to do is get it setup to build your documentation. To do this create a [testenv:docs] section:

[testenv:docs]
changedir = docs
deps =
    sphinx
commands =
    sphinx-build -W -b html -d {envtmpdir}/doctrees . {envtmpdir}/html

This tells tox a few things. First changedir tells it that to run these commands it should cd into the docs/ directory (if you're docs live elsewhere, change as appropriate). Next it has sphinx as a dependency. Finally the commands invoke sphinx-build, -W makes warnings into errors (so you get an approrpiate failure status code), -b html uses the HTML builder, and the rest of the parameters tell Sphinx where the docs live and to put the output in the temporary directory that tox creates for each env.

Now all you need to do is add docs to the envlist, and a tox run will build your documentation.

The last thing you might want to do is set it up to test against multiple versions of a package (such as Django 1.1, Django 1.2, and trunk). To do this create another section whose name includes both the Python version and dependency version, e.g. [testenv:py25-trunk]. In it place:

[testenv:py25-trunk]
basepython = python2.5
deps =
    http://www.djangoproject.com/download/1.3-alpha-1/tarball/

This "inherits" from the default testenv, so it still has its commands, but we specify the basepython indicating this testenv is for python 2.5, and a different set of dependencies, here we're using the Django 1.3 alpha. You'll need to do a bit of copy-paste and create one of these for each version of Python you're testing against, and make sure to add each of these to the envlist.

At this point you should have a lean, mean, testing setup. With one command you can test your package with different dependencies, different pythons, and build your documentation. The tox documentation features tons of examples so you should use it as a reference.

You can find the rest here. There are view comments.

Programming Languages Terminology

Posted November 19th, 2010. Tagged with programming-languages, programming.

This semester I've been taking a course titled "Programming Languages". Since I'm a big languages and compilers nerd this should be wonderful. Unfortunately every time I'm in this class I'm reminded of just what a cluster-fuck programming language terminology is. What follows are my definitions of a number of terms:

  • Dynamic typing (vs static typing): Values do not have a type until runtime. Has nothing to do with the declaration of types (i.e. a type inferenced language is not dynamically typed).
  • Type safety (vs type unsafety): Operations cannot be performed on values which do not support them, these operations need not be prohibited prior to execution, a run time exception suffices.
  • Strong typing (vs weak typing): Implicit conversions are not performed. This has nothing to do with static or dynamic typing. Rather it references to whether a language will perform an operation such as '1' + 2. For example Python raises a TypeError here, whereas PHP returns 3. This one is slightly muddied by the fact that in languages with user defined types (and the ability to implement behaviors on operators), a type can really do anything it likes, thus this is less of an aspect of the core language and more one of the included types and functions.

There are probably some terms I've missed, but for these terms I think my definitions roughly match the consensus on them.

You can find the rest here. There are view comments.

A statically typed language I'd actually want to use

Posted November 4th, 2010. Tagged with programming-languages, programming.

Nearly a year ago I tweeted, and released on github a project I'd been working. It's called shore, and it's the start of a compiler for a statically typed language I'd actually like to use. Programming is, for me, most fun when I can work at the level that suits the problem I have: when I'm building a website I don't like to think about how my file system library is implemented. That means a language needs to support certain abstractions, and it also needs to be sufficiently concise that I'd actually want to write things in. A good example is "give me the square of all the items in this sequence who are divisible by 3", in Python:

[x * x for x in seq if x % 3 == 0]

And in C++:

std::vector<int> result;
for (std::vector<int>::iterator itr = seq.begin(); itr != seq.end(); ++itr) {
    if (*itr % 3 == 0) {
        result.push_back((*itr) * (*itr));
    }
}

The best I can say for that is: what the hell. There's nothing that's not static about my Python code (assuming the compiler knows that seq is a list of integers), and yet... it's a fifth as many lines of code, and significantly simpler (and requires no changes if I put my integers in a different sequence).

The point of shore was to bring these higher level syntactic constructs, static typing, and support for higher level abstractions into a single programming language. The result was to be an explicitly statically typed, ahead of time compiled, garbage collected language.

The syntax is inspired almost exclusively by Python and the type system is largely inspired by C++ (except there are no primitive, everything is an object). For example here's a function which does what those two code snippets do:

list{int} def play_with_seq(list{int} seq):
    return [x * x for x in seq if x % 3 == 0]

As you can see it support parametric polymorphism (templating). One important piece of this is the ability to operate on more abstract types. For example this could be rewritten:

list{int} def playwith_seq(iterable{int} seq):
    return [x * x for x in seq if x % 3 == 0]

iterable is anything that implements the iterator protocol.

I'll be writing more about my thoughts on the language as the month goes on, however I need to stress the implementation is both a) untouched since December, and b) nothing you want to look at, it's a working lexer, mostly working parser, and a terrible translator into C++. However, I hope this can inspire people to work towards a more perfect statically typed language.

You can find the rest here. There are view comments.

Priorities

Posted October 24th, 2010. Tagged with programming, django, python, open-source.

When you work on something as large and multi-faceted as Django you need a way to prioritize what you work on, without a system how do I decide if I should work on a new feature for the template system, a bugfix in the ORM, a performance improvement to the localization features, or better docs for contrib.auth? There's tons of places to jump in and work on something in Django, and if you aren't a committer you'll eventually need one to commit your work to Django. So if you ever need me to commit something, here's how I prioritize my time on Django:

  1. Things I broke: If I broke a buildbot, or there's a ticket reported against something I committed this is my #1 priority. Though Django no longer has a policy of trunk generally being perfectly stable it's still a very good way to treat it, once it gets out of shape it's hard to get it back into good standing.
  2. Things I need for work: Strictly speaking these don't compete with the other items on this list, in that these happen on my work's time, rather than in my free time. However, practically speaking, this makes them a relatively high priority, since my work time is fixed, as opposed to free time for Django, which is rather elastic.
  3. Things that take me almost no time: These are mostly things like typos in the documentation, or really tiny bugfixes.
  4. Things I think are cool or important: These are either things I personally think are fun to work on, or are in high demand from the community.
  5. Other things brought to my attention: This is the most important category, I can only work on bugs or features that I know exist. Django's trac has about 2000 tickets, way too many for me to ever sift through in one sitting. Therefore, if you want me to take a look at a bug or a proposed patch it needs to be brought to my attention. Just pinging me on IRC is enough, if I have the time I'm almost always willing to take a look.

In actuality the vast majority of my time is spent in the bottom half of this list, it's pretty rare for the build to be broken, and even rarer for me to need something for work, however, there are tons of small things, and even more cool things to work on. An important thing to remember is that the best way to make something show up in category #3 is to have an awesome patch with tests and documentation, if all I need to do is git apply && git commit that saves me a ton of time.

You can find the rest here. There are view comments.

Dynamic and Static Programming Languages and Teaching

Posted September 29th, 2010. Tagged with education, programming, programming-languages.

Lately I've seen the sentiment, "Why do all teachers need to prepare their own lesson plans" in a few different places, and it strikes me as representative of a particular mindset about education and teaching. To me, it seems like a fundamental question about pedagogical philosophy that's akin to the great programming debate between which is better dynamic or static languages (although it might be more apt to say compiled versus interpreted).

Static languages usually require an up front investment in time (compilation), in return for which you get less work at runtime, various things about your program are strictly defined: this function takes arguments of these types, and that's that. This is similar to a way of teaching, a lesson plan is prepared, and then it's taught, the flow of the class itself should stay in line with the lesson plan. In comparison dynamic languages usually require minimal compilation steps, and push more of the work to the runtime, and various things about the program are impossible to provide statically, you can't know what types a function takes. This is akin to a teaching style that reacts to what goes on in a classroom.

In the programming world which of these is preferable is a giant debate, and I won't even attempt to answer it here (the answer is both though, if you're curious), however I think in the realm of teaching the answer is clear: dynamic teaching is always preferable. One of the single most important components necessary for learning to occur is an active interest from a student. It's impossible for this to exist in an environment where the most important thing is keeping to the predefined, static, schedule. By responding to the way the class is going, the interests of the students, actual learning, as opposed to rote memorization (to be forgotten as soon as the test is over) can occur. We can see examples of this in everyday classes, the language arts teacher who misses a class on prepositions to discuss why word usage often doesn't match the dictionary definition, the math teacher who skips a day of geometry to discuss practical uses for theorem proving, the ethics professor whose lesson on Greek philosophers gets lost to a debate on the social contract. Especially in higher grade levels, there's little material that can't be learned with, "a dollar fifty in late charges at the public library", the value of the debate, discussion, and analysis to make it useful, however, is invaluable, and it's the job of an educator, most of all, to shepherd it.

That's not to say that this style of teaching requires no preparation, in fact it probably requires more preparation (oops, there goes the metaphor), thinking up different directions a class could go, useful comparisons and analogies, finding ways it relates to the students lives all take incredible amounts of time and thought, and are often different for any given class, inhibiting the diligent teachers ability to simply reuse some other teacher's material, as many seem to propose. Indeed, attempting to better tailor courses to effective teach the students taking them is far more valuable than attempting to maximize the throughput we can get out of a lesson.

You can find the rest here. There are view comments.