alex gaynor's blago-blog

Posts tagged with orm

DjangoCon.eu slides

Posted May 24th, 2010. Tagged with python, nosql, django, orm, djangocon.

I just finished giving my talk at DjangoCon.eu on Django and NoSQL (also the topic of my Google Summer of Code project). You can get the slides over at slideshare. My slides from my lightning talk on django-templatetag-sugar are also up on slideshare.

You can find the rest here. There are view comments.

Why Meta.using was removed

Posted November 27th, 2009. Tagged with python, models, django, orm, gsoc.

Recently Russell Keith-Magee and I decided that the Meta.using option needed to be removed from the multiple-db work on Django, and so we did. Yesterday someone tweeted that this change caught them off guard, so I wanted to provide a bit of explanation as to why we made that change.

The first thing to note is that Meta.using was very good for one specific use case, horizontal partitioning by model. Meta.using allowed you to tie a specific model to a specific database by default. This meant that if you wanted to do things like have users be in one db and votes in another this was basically trivial. Making this use case this simple was definitely a good thing.

The downside was that this solution was very poorly designed, particularly in light on Django's reusable application philosophy. Django emphasizes the reusability of application, and having the Meta.using option tied your partitioning logic to your models, it also meant that if you wanted to partition a reusable application onto another DB this easily the solution was to go in and edit the source for the reusable application. Because of this we had to go in search of a better solution.

The better solution we've come up with is having some sort of callback you can define that lets you decide what database each query should be executed on. This would let you do simple things like direct all queries on a given model to a specific database, as well as more complex sharding logic like sending queries to the right database depending on which primary key value the lookup is by. We haven't figured out the exact API for this, and as such this probably won't land in time for 1.2, however it's better to have the right solution that has to wait than to implement a bad API that would become deprecated in the very next release.

You can find the rest here. There are view comments.

The State of MultiDB (in Django)

Posted November 10th, 2009. Tagged with python, models, django, gsoc, internals, orm.

As you, the reader, may know this summer I worked for the Django Software Foundation via the Google Summer of Code program. My task was to implement multiple database support for Django. Assisting me in this task were my mentors Russell Keith-Magee and Nicolas Lara (you may recognize them as the people responsible for aggregates in Django). By the standards of the Google Summer of Code project my work was considered a success, however, it's not yet merged into Django's trunk, so I'm going to outline what happened, and what needs to happen before this work is considered complete.

Most of the major things happened, settings were changed from a series of DATABASE_* to a DATABASES setting that's keyed by DB aliases and who's values are dictionaries containing the usual DATABASE* options, QuerySets grew a using() method which takes a DB alias and says what DB the QuerySet should be evaluated against, save() and delete() grew similar using keyword arguments, a using option was added to the inner Meta class for models, transaction support was expanded to include support for multiple databases, as did the testing framework. In terms of internals almost every internal DB related function grew explicit passing of the connection or DB alias around, rather than assuming the global connection object as they used to. As I blogged previously ManyToMany relations were completely refactored. If it sounds like an awful lot got done, that's because it did, I knew going in that multi-db was a big project and it might not all happen within the confines of the summer.

So if all of that stuff got done, what's left? Right before the end of the GSOC time frame Russ and I decided that a fairly radical rearchitecting of the Query class (the internal datastructure that both tracks the state of an operation and forms its SQL) was needed. Specifically the issue was that database backends come in two varieties. One is something like a backend for App Engine, or CouchDB. These have a totally different design than SQL, they need different datastructures to track the relevant information, and they need different code generation. The second type of database backend is one for a SQL database. By contrast these all share the same philosophies and basic structure, in most cases their implementation just involves changing the names of database column types or the law LIMIT/OFFSET is handled. The problem is Django treated all the backends equally. For SQL backends this meant that they got their own Query classes even though they only needed to overide half of the Query functionality, the SQL generation half, as the datastructure part was identical since the underlying model is the same. What this means is that if you make a call to using() on a QuerySet half way through it's construction you need to change the class of the Query representation if you switch to a database with a different backend. This is obviously a poor architecture since the Query class doesn't need to be changed, just the bit at the end that actually constructs the SQL. To solve this problem Russ and I decided that the Query class should be split into two parts, a Query class that stores bits about the current query, and a SQLCompiler which generated SQL at the end of the process. And this is the refactoring that's holding up the merger of my multi-db work primarily.

This work is largely done, however the API needs to be finalized and the Oracle backend ported to the new system. In terms of other work that needs to be done, GeoDjango needs to be shown to shown to still work (or fixed). In my opinion everything else on the TODO list (available here, please don't deface) is optional for multi-db to be merge ready, with the exception of more example documentation.

There are already people using the multi-db branch (some even in production), so I'm confident about it's stability. For the next 6 weeks or so (until the 1.2 feature deadline), my biggest priority is going to be getting this branch into a merge ready state. If this is something that interests you please feel free to get in contact with me (although if you don't come bearing a patch I might tell you that I'll see you in 6 weeks ;)), if you happen to find bugs they can be filed on the Django trac, with version "soc2009/multidb". As always contributors are welcome, you can find the absolute latest work on my Github and a relatively stable version in my SVN branch (this doesn't contain the latest, in progress, refactoring). Have fun.

You can find the rest here. There are view comments.

Django's ManyToMany Refactoring

Posted November 4th, 2009. Tagged with python, models, django, gsoc, internals, orm.

If you follow Django's development, or caught next week's DjangoDose Tracking Trunk episode (what? that's not how time flows you say? too bad) you've seen the recent ManyToManyField refactoring that Russell Keith-Magee committed. This refactoring was one of the results of my work as a Google Summer of Code student this summer. The aim of that work was to bring multiple database support to Django's ORM, however, along the way I ran into refactor the way ManyToManyField's were handled, the exact changes I made are the subject of tonight's post.

If you've looked at django.db.models.fields.related you may have come away asking how code that messy could possibly underlie Django's amazing API for handling related objects, indeed the mess so is so bad that there's a comment which says:

# HACK

which applies to an entire class. However, one of the real travesties of this module was that it contained a large swath of raw SQL in the manager for ManyToMany relations, for example the clear() method's implementation looks like:

cursor = connection.cursor()
cursor.execute("DELETE FROM %s WHERE %s = %%s" % \
    (self.join_table, source_col_name),
    [self._pk_val])
transaction.commit_unless_managed()

As you can see this hits the trifecta, raw SQL, manual transaction handling, and the use of a global connection object. From my perspective the last of these was the biggest issue. One of the tasks in my multiple database branch was to remove all uses of the global connection object, and since this uses it it was a major target for refactoring. However, I really didn't want to rewrite any of the connection logic I'd already implemented in QuerySets. This desire to avoid any new code duplication, coupled with a desire to remove the existing duplication (and flat out ugliness), led me to the simple solution: use the existing machinery.

Since Django 1.0 developers have been able to use a full on model for the intermediary table of a ManyToMany relation, thanks to the work of Eric Florenzano and Russell Keith-Magee. However, that support was only used when the user explicitly provided a through model. This of course leads to a lot of methods that basically have two implementation: one for the through model provided case, and one for the normal case -- which is yet another case of code bloat that I was now looking to eliminate. After reviewing these items my conclusion was that the best course was to use the provided intermediary model if it was there, otherwise create a full fledged model with the same fields (and everything else) as the table that would normally be specially created for the ManyToManyField.

The end result was dynamic class generation for the intermediary model, and simple QuerySet methods for the methods on the Manager, for example the clear() method I showed earlier now looks like this:

self.through._default_manager.filter(**{
    source_field_name: self._pk_val
}).delete()

Short, simple, and totally readable to anyone with familiarity with Python and Django. In addition this move allowed Russell to fix another ticket with just two lines of code. All in all this switch made for cleaner, smaller code and fewer bugs.

Tomorrow I'm going to be writing about both the talk I'm going to be giving at PyCon, as well as my experience as a member of the PyCon program committee. See you then.

You can find the rest here. There are view comments.

ORM Panel Recap

Posted March 30th, 2009. Tagged with python, alchemy, gae, django, orm, web2py, pycon, object, sql.

Having now completed what I thought was a quite successful panel I thought it would be nice to do a review of some of the decisions I made, that some people had been asking about. For those who missed it you can find a live blog of the event by James Bennett at his blog, and a video should hopefully be going up sometime soon.

Why Google App Engine

As Guido pointed out App Engine does not have an ORM, as App Engine doesn't have a relational datastore. However, it does have something that looks and acts quite a lot like other ORMs, and it does fundamentally try to serve the same purpose, offering a persistence layer. Therefore I decided it was at least in the same class of items I wanted to add. Further, with the rise of non-relational DBs that all fundamentally deal with the same issues as App Engine, and the relationship between ORMs and these new persistence layers I thought it would be advantageous to have one of these, Guido is a knowledgeable and interesting person, and that's how the cookie crumbled.

Why Not ZODB/Storm/A Talking Pony

Time. I would have loved to have as many different ORMs/things like them as exist in the Python eco-sphere, but there just wasn't time. We had 55 minutes to present and as it is that wasn't enough. I ultimately had time to ask 3 questions(one of which was just background), plus 5 shorter audience questions. I was forced to cut out several questions I wanted to ask, but didn't have time to, for those who are interested the major questions I would have liked to ask were:

  • What most often requested feature won't you add to your ORM?
  • What is the connection between an ORM and a schema migration tool. Should they both be part of the same project, should they be tied together, or are they totally orthogonal?
  • What's your support for geographic data? Is this(or other complex data types like it) in scope for the core of an ORM?

Despite these difficulties I thought the panel turned out very well. If there are any other questions about why things were the way they were just ask in the comments and I'll try to post a response.

You can find the rest here. There are view comments.

Google Moderator for PyCon ORM Panel

Posted March 15th, 2009. Tagged with python, alchemy, gae, django, orm, web2py, object, sql.

I'm going to be moderating a panel this year at PyCon between 5 of the Python ORMs(Django, SQLAlchemy, SQLObject, Google App Engine, and web2py). To make my job easier, and to make sure the most interesting questions are asked I've setup a Google Moderator page for the panel here. Go ahead and submit your questions, and moderate others to try to ensure we get the best questions possible, even if you can't make it to PyCon(there will be a recording made I believe). I'll be adding my own questions shortly to make sure they are as interesting as I think they are.

Also, if you aren't already, do try to make it out to PyCon, there's still time and the talks look to be really exceptional.

You can find the rest here. There are view comments.

A Second Look at Inheritance and Polymorphism with Django

Posted February 10th, 2009. Tagged with python, models, django, internals, orm, metaclass.

Previously I wrote about ways to handle polymorphism with inheritance in Django's ORM in a way that didn't require any changes to your model at all(besides adding in a mixin), today we're going to look at a way to do this that is a little more invasive and involved, but also can provide much better performance. As we saw previously with no other information we could get the correct subclass for a given object in O(k) queries, where k is the number of subclasses. This means for a queryset with n items, we would need to do O(nk) queries, not great performance, for a queryset with 10 items, and 3 subclasses we'd need to do 30 queries, which isn't really acceptable for most websites. The major problem here is that for each object we simply guess as to which subclass a given object is. However, that's a piece of information we could know concretely if we cached it for later usage, so let's start off there, we're going to be building a mixin class just like we did last time:

from django.db import models

class InheritanceMixIn(models.Model):
    _class = models.CharField(max_length=100)

    class Meta:
        abstract = True

So now we have a simple abstract model that the base of our inheritance trees can subclass that has a field for caching which subclass we are. Now let's add a method to actually cache it and retrieve the subclass:

from django.db import models
from django.db.models.fields import FieldDoesNotExist
from django.db.models.related import RelatedObject

class InheritanceMixIn(models.Model):
    ...
    def save(self, *args, **kwargs):
        if not self.id:
            parent = self._meta.parents.keys()[0]
            subclasses = parent._meta.get_all_related_objects()
            for klass in subclasses:
                if isinstance(klass, RelatedObject) and klass.field.primary_key \
                    and klass.opts == self._meta:
                    self._class = klass.get_accessor_name()
                    break
        return super(InheritanceMixIn, self).save(*args, **kwargs)

    def get_object(self):
        try:
            if self._class and self._meta.get_field_by_name(self._class)[0].opts != self._meta:
                return getattr(self, self._class)
        except FieldDoesNotExist:
            pass
        return self

Our save method is where all the magic really happens. First, we make sure we're only doing this caching if it's the first time a model is being saved. Then we get the first parent class we have (this means this probably won't play nicely with multiple inheritance, that's unfortunate, but not as common a usecase), then we get all the related objects this class has(this includes the reverse relationship the subclasses have). Then for each of the subclasses, if it is a RelatedObject, and it is a primary key on it's model, and the class it points to is the same as us then we cache the accessor name on the model, break out, and do the normal save procedure.

Our get_object function is pretty simple, if we have our class cached, and the model we are cached as isn't of the same type as ourselves we get the attribute with the subclass and return it, otherwise we are the last descendent and just return ourselves. There is one(possible quite large) caveat here, if our inheritance chain is more than one level deep(that is to say our subclasses have subclasses) then this won't return those objects correctly. The class is actually cached correctly, but since the top level object doesn't have an attribute by the name of the 2nd level subclass it doesn't return anything. I believe this can be worked around, but I haven't found a way yet. One idea would be to actually store the full ancestor chain in the CharField, comma separated, and then just traverse it.

There is one thing we can do to make this even easier, which is to have instances automatically become the correct subclass when they are pulled in from the DB. This does have an overhead, pulling in a queryset with n items guarantees O(n) queries. This can be improved(just as it was for the previous solution) by ticket #7270 which allows select_related to traverse reverse relationships. In any event, we can write a metaclass to handle this for us automatically:

from django.db import models
from django.db.models.base import ModelBase
from django.db.models.fields import FieldDoesNotExist
from django.db.models.related import RelatedObject

class InheritanceMetaclass(ModelBase):
    def __call__(cls, *args, **kwargs):
        obj = super(InheritanceMetaclass, cls).__call__(*args, **kwargs)
        return obj.get_object()

class InheritanceMixIn(models.Model):
    __metaclass__ = InheritanceMetaclass
    ...

Here we've created a fairly trivial metaclass that subclasses the default one Django uses for it's models. The only method we've written is __call__, on a metalcass what __call__ does is handle the instantiation of an object, so it would call __init__. What we do is do whatever the default __call__ does, so that we get an instances as normal, and then we call the get_object() method we wrote earlier and return it, and that's all.

We've now looked at 2 ways to handle polymorphism, with this way being more efficient in all cases(ignoring the overhead of having the extra charfield). However, it still isn't totally efficient, and it fails in several edge cases. Whether automating the handling of something like this is a good idea is something that needs to be considered on a project by project basis, as the extra queries can be a large overhead, however, they may not be avoidable in which case automating it is probably advantages.

You can find the rest here. There are view comments.

Building a Magic Manager

Posted January 31st, 2009. Tagged with models, django, orm, python.

A very common pattern in Django is to create methods on a manager to abstract some usage of ones data. Some people take a second step and actually create a custom QuerySet subclass with these methods and have their manager proxy these methods to the QuerySet, this pattern is seen in Eric Florenzano's Django From the Ground Up screencast. However, this requires a lot of repetition, it would be far less verbose if we could just define our methods once and have them available to us on both our managers and QuerySets.

Django's manager class has one hook for providing the QuerySet, so we'll start with this:

from django.db import models

class MagicManager(models.Manager):
   def get_query_set(self):
       qs = super(MagicManager, self).get_query_set()
       return qs

Here we have a very simple get_query_set method, it doesn't do anything but return it's parent's queryset. Now we need to actually get the methods defined on our class onto the queryset:

class MagicManager(models.Manager):
   def get_query_set(self):
       qs = super(MagicManager, self).get_query_set()
       class _QuerySet(qs.__class__):
           pass
       for method in [attr for attr in dir(self) if not attr.startswith('__') and callable(getattr(self, attr)) and not hasattr(_QuerySet, attr)]:
           setattr(_QuerySet, method, getattr(self, method))
       qs.__class__ = _QuerySet
       return qs

The trick here is we dynamically create a subclass of whatever class the call to our parent's get_query_set method returns, then we take each attribute on ourself, and if the queryset doesn't have an attribute by that name, and if that attribute is a method then we assign it to our QuerySet subclass. Finally we set the __class__ attribute of the queryset to be our QuerySet subclass. The reason this works is when Django chains queryset methods it makes the copy of the queryset have the same class as the current one, so anything we add to our manager will not only be available on the immediately following queryset, but on any that follow due to chaining.

Now that we have this we can simply subclass it to add methods, and then add it to our models like a regular manager. Whether this is a good idea is a debatable issue, on the one hand having to write methods twice is a gross violation of Don't Repeat Yourself, however this is exceptionally implicit, which is a major violation of The Zen of Python.

You can find the rest here. There are view comments.

Optimizing a View

Posted January 19th, 2009. Tagged with python, compile, models, django, orm.

Lately I've been playing with a bit of a fun side project. I have about a year and half worth of my own chatlogs with friends(and 65,000 messages total) and I've been playing around with them to find interesting statistics. One facet of my communication with my friends is that we link each other lots of things, and we can always tell when someone is linking something that we've already seen. So I decided an interesting bit of information would be to see who is the worst offender.

So we want to write a function that returns the number of items each person has relinked, excluding items they themselves linked. So I started off with the most simple implementation I could, and this was the end result:

from collections import defaultdict
from operator import itemgetter

from django.utils.html import word_split_re

from logger.models import Message

def calculate_relinks():
    """
    Calculate the number of times each individual has linked something that was
    linked previously in the course of the chat.
    """
    links = defaultdict(int)
    for message in Message.objects.all().order_by('-time').iterator():
        words = word_split_re.split(message.message)
        for word in words:
            if word.startswith('http'):
                if Message.objects.filter(time__lt=message.time).filter(message__contains=word).exclude(speaker=message.speaker).count():
                    links[message.speaker] += 1
    links = sorted(links.iteritems(), key=itemgetter(1), reverse=True)
    return links

Here I iterated over the messages and for each one I went through each of the words and if any of them started with http(the definition of a link for my purposes) I checked to see if this had ever been linked before by someone other than the author of the current message.

This took about 4 minutes to execute on my dataset, it also executed about 10,000 SQL queries. This is clearly unacceptable, you can't have a view that takes that long to render, or hits your DB that hard. Even with aggressive caching this would have been unmaintainable. Further this algorithm is O(n**2) or thereabouts so as my dataset grows this would have gotten worse exponentially.

By changing this around however I was able to achieve far better results:

from collections import defaultdict
from operator import itemgetter

from django.utils.html import word_split_re

from logger.models import Message

def calculate_relinks():
    """
    Calculate the number of times each individual has linked something that was
    linked previously in the course of the chat.
    """
    links = defaultdict(set)
    counts = defaultdict(int)
    for message in Message.objects.all().filter(message__contains="http").order_by('time').iterator():
        words = word_split_re.split(message.message)
        for word in words:
            if word.startswith('http'):
                if any([word in links[speaker] for speaker in links if speaker != message.speaker]):
                    counts[message.speaker] += 1
                links[message.speaker].add(word)
    counts = sorted(counts.iteritems(), key=itemgetter(1), reverse=True)
    return counts

Here what I do is go through each of the messages which contain the string "http"(this is already a huge advantage since that means we process about 1/6 of the messages in Python that we originally were), for each message we go through each of the words in it, and for each that is a link we check if any other person has said it by looking in the caches we maintain in Python, and if they do we increment their count, finally we add the link to that persons cache.

By comparison this executes in .3 seconds, executes only 1 SQL query, and it will scale linearly(as well as is possible). For reference both of these functions are compiled using Cython. This ultimately takes almost no work to do and for computationally heavy operations this can provide a huge boon.

You can find the rest here. There are view comments.

PyCon '09, Here I come!

Posted December 15th, 2008. Tagged with alchemy, gae, object, sql, django, orm, web2py.

This past year I attended PyCon 2008 in Chicago, which was a tremendous conference. I had a chance to meet people I knew from the community, listen to some amazing talks, meet new people, and get to sprint. As a result of this tremendous experience I decided for this year to submit a talk proposal. I found out just a few minutes ago that my proposal has been accepted.

I proposed a panel on "Object Relational Mapper Philosophies and Design Decisions". This panel is going to look at the design decisions that each of several ORMs engaged, and what philosophies they had, but with respect to their public APIs and their internal code design. Participating in the panel will be:
  • Jacob Kaplan-Moss, representing Django
  • Ian Bicking, representing SQL Object
  • Mike Bayer, represening SQL Alchemy
  • Guido van Rossum, representing Google App Engine
  • Dr. Massimo Di Pierro, representing web2py

I'm tremendously honored to be able to moderate a panel at PyCon, especially with these five individuals. They are all indcredibly smart, and they each bring a different insight and perspective to this panel.

PyCon is a great conference and I would encourage anyone who can to attend.

You can find the rest here. There are view comments.

Playing with Polymorphism in Django

Posted December 5th, 2008. Tagged with python, models, django, internals, orm.

One of the most common requests of people using inheritance in Django, is to have the a queryset from the baseclass return instances of the derives model, instead of those of the baseclass, as you might see with polymorphism in other languages. This is a leaky abstraction of the fact that our Python classes are actually representing rows in separate tables in a database. Django itself doesn't do this, because it would require expensive joins across all derived tables, which the user probably doesn't want in all situations. For now, however, we can create a function that given an instance of the baseclass returns an instance of the appropriate subclass, be aware that this will preform up to k queries, where k is the number of subclasses we have.

First let's set up some test models to work with:

from django.db import models

class Place(models.Model):
    name = models.CharField(max_length=50)

    def __unicode__(self):
        return u"%s the place" % self.name


class Restaurant(Place):
    serves_pizza = models.BooleanField()

    def __unicode__(self):
        return "%s the restaurant" % self.name

class Bar(Place):
    serves_wings = models.BooleanField()

    def __unicode__(self):
        return "%s the bar" % self.name

These are some fairly simple models that represents a common inheritance pattern. Now what we want to do is be able to get an instance of the correct subclass for a given instance of Place. To do this we'll create a mixin class, so that we can use this with other classes.

class InheritanceMixIn(object):
    def get_object(self):
        ...

class Place(models.Model, InheritanceMixIn):
    ...

So what do we need to do in our get_object method? Basically we need to loop each of the subclasses, try to get the correct attribute and return it if it's there, if none of them are there, we should just return ourself. We start by looping over the fields:

class InheritanceMixIn(object):
    def get_object(self):
        for f in self._meta.get_all_field_names():
            field = self._meta.get_field_by_name(f)[0]

_meta is where Django stores lots of the internal data about a mode, so we get all of the field names, this includes the names of the reverse descriptors that related models provide. Then we get the actual field for each of these names. Now that we have each of the fields we need to test if it's one of the reverse descriptors for the subclasses:

from django.db.models.related import RelatedObject

class InheritanceMixIn(object):
    def get_object(self):
        for f in self._meta.get_all_field_names():
            field = self._meta.get_field_by_name(f)[0]
            if isinstance(field, RelatedObject) and field.field.primary_key:

We first test if the field is a RelatedObject, and if it we see if the field on the other model is a primary key, which it will be if it's a subclass(or technically any one to one that is a primary key). Lastly we need to find what the name of that attribute is on our model and to try to return it:

class InheritanceMixIn(object):
    def get_object(self):
        for f in self._meta.get_all_field_names():
            field = self._meta.get_field_by_name(f)[0]
            if isinstance(field, RelatedObject) and field.field.primary_key:
                try:
                    return getattr(self, field.get_accessor_name())
                except field.model.DoesNotExist:
                    pass
        return self

We try to return the attribute, and if it raises a DoesNotExist exception we move on to the next one, if none of them return anything, we just return ourself.

And that's all it takes. This won't be super efficient, since for a queryset of n objects, this will take O(n*k) given k subclasses. Ticket 7270 deals with allowing select_related() to work across reverse one to one relations as well, which will allow one to optimise this, since the subclasses would already be gotten from the database.

You can find the rest here. There are view comments.

Fixing up our identity mapper

Posted December 1st, 2008. Tagged with foreignkey, models, django, internals, orm.

The past two days we've been looking at building an identity mapper in Django. Today we're going to implement some of the improvements I mentioned yesterday.

The first improvement we're going to do is it have it execute the query as usual and just cache the results, to prevent needing to execute additional queries. This means changing the __iter__ method on our queryset class:

def __iter__(self):
    for obj in self.iterator():
        try:
            yield get_from_cache(self.model, obj.pk)
        except KeyError:
            cache_instance(obj)
            yield obj

Now we just iterate over self.iterator() which is a slightly lower level interface to a querysets iteration, it bypasses all the caching that occurs(this means that for now at least, if we iterate over our queryset twice we actually execute two queries, whereas Django would normally do just one), however overall this will be a big win, since before if an item wasn't in the cache we would do an extra query for it.

The next improvement I proposed was to use Django's built in caching interfaces. However, this won't work, this is because the built in locmem cache backend pickles and unpickles everything before caching and retrieving everything from the cache, so we'd end up with different objects(which defeats the point of this).

The last improvement we can make is to have this work on related objects for which we already know the primary key. The obvious route to do this is to start hacking in django.db.models.fields.related, however as I've mentioned in a previous post this area of the code is a bit complex, however if we know a little bit about how this query is executed we can do the optimisation in a far simpler way. As it turns out the related object descriptor simply tries to do the query using the default manager's get method. Therefore, we can simply special case this occurrence in order to optimise this. We also have to make a slight chance to our manager, as by default the manager won't be used on related object queries:

class CachingManager(Manager):
    use_for_related_fields = True
    def get_query_set(self):
        return CachingQuerySet(self.model)

class CachingQueryset(QuerySet):
    ...
    def get(self, *args, **kwargs):
        if len(kwargs) == 1:
            k = kwargs.keys()[0]
            if k in ('pk', 'pk__exact', '%s' % self.model._meta.pk.attname, '%s__exact' % self.model._meta.pk.attname):
                try:
                    return get_from_cache(self.model, kwargs[k])
                except KeyError:
                    pass
        clone = self.filter(*args, **kwargs)
        objs = list(clone[:2])
        if len(objs) == 1:
            return objs[0]
        if not objs:
            raise self.model.DoesNotExist("%s matching query does not exist."
                             % self.model._meta.object_name)
        raise self.model.MultipleObjectsReturned("get() returned more than one %s -- it returned %s! Lookup parameters were %s"
                % (self.model._meta.object_name, len(objs), kwargs))

As you can see we just add one line to the manager, and a few lines to the begging of the get() method. Basically our logic is if there is only one kwarg to the get() method, and it is a query on the primary key of the model, we try to return our cached instance. Otherwise we fall back to executing the query.

And with this we've improved the efficiency of our identity map, there are almost definitely more places for optimisations, but now we have an identity map in very few lines of code.

You can find the rest here. There are view comments.

A Few More Thoughts on the Identity Mapper

Posted December 1st, 2008. Tagged with foreignkey, models, django, internals, orm.

It's late, and I've my flight was delayed for several hours so today is going to be another quick post. With that note here are a few thoughts on the identity mapper:
  • We can optimize it to actually execute fewer queries by having it run the query as usual, and then use the primary key to check the cache, else cache the instance we already have.
  • As Doug points out in the comments, there are built in caching utilities in Django we should probably be taking advantage of. The only qualification is that whatever cache we use needs to be in memory and in process.
  • The cache is actually going to be more efficient than I originally thought. On a review of the source the default manager is used for some related queries, so our manager will actually be used for those.
  • The next place to optimize will actually be on single related objects(foreign keys and one to ones). That's because we already have their primary key and so we can check for them in the cache without executing any SQL queries.

And lastly a small note. As you may have noticed I've been doing the National Blog Everyday for a Month Month, since I started two days late, I'm going to be continueing on for another two days.

You can find the rest here. There are view comments.

Building a simple identity map in Django

Posted November 29th, 2008. Tagged with models, django, orm.

In Django's ticket tracker lies ticket 17, the second oldest open ticket, this proposes an optimisation to have instances of the same database object be represented by the same object in Python, essentially that means for this code:

a = Model.objects.get(pk=3)
b = Model.objects.get(pk=3)

a and b would be the same object at the memory level. This can represent a large optimisation in memory usage if you're application has the potential to have duplicate objects(for example, related objects). It is possible to implement a very simple identity map without touching the Django source at all.

The first step is to set up some very basic infastructure, this is going to be almost identical to what Eric Florenzano does in his post, "Drop-dead simple Django caching".

We start with a few helper functions:

_CACHE = {}

def key_for_instance(obj, pk=None):
    if pk is None:
        pk = obj.pk
    return "%s-%s-%s" % (obj._meta.app_label, obj._meta.module_name, pk)

def get_from_cache(klass, pk):
    return _CACHE[key_for_instance(klass, pk)]

def cache_instance(instance):
    _CACHE[key_for_instance(instance)] = instance

We create our cache, which is a Python dictionary, a function to generate the cache key for an object, a function to get an item from the cache, and a function to cache an item. How these work should be pretty simple. Next we need to create some functions to make sure objects get update in the cache.

from django.db.models.signals import post_save, pre_delete

def post_save_cache(sender, instance, **kwargs):
    cache_instance(instance)
post_save.connect(post_save_cache)

def pre_delete_uncache(sender, instance, **kwargs):
    try:
        del _CACHE[key_for_instance(instance)]
    except KeyError:
        pass
pre_delete.connect(pre_delete_uncache)

Here we set up two signal receivers, when an object is saved we cache it, and when one is deleted we remove it from the cache.

Now we want a way to use our cache the way we already use our connection to the database, this means implementing some sort of hook in a QuerySet, this looks like:

from django.db.models.query import QuerySet

class CachingQueryset(QuerySet):
    def __iter__(self):
        obj = self.values_list('pk', flat=True)
        for pk in obj:
            try:
                yield get_from_cache(self.model, pk)
            except KeyError:
                instance = QuerySet(self.model).get(pk=pk)
                cache_instance(instance)
                yield instance

    def get(self, *args, **kwargs):
        clone = self.filter(*args, **kwargs)
        objs = list(clone[:2])
        if len(objs) == 1:
            return objs[0]
        if not objs:
            raise self.model.DoesNotExist("%s matching query does not exist."
                             % self.model._meta.object_name)
        raise self.model.MultipleObjectsReturned("get() returned more than one %s -- it returned %s! Lookup parameters were %s"
                % (self.model._meta.object_name, len(objs), kwargs))

We create a subclass of QuerySet and override it's __iter__() and get() methods. By default __iter__ does a fair bit of heavy lifting to internally cache the results and allow the usage of multiple iterators properly. We override this to do something simpler. We get the primary keys of each item in the queryset and iterate over them, if the object is in the cache we return it, otherwise we execute a database query to get it, and then cache it. We also override get() to make sure it makes use of the caching we just set up.

To use this on a model we need to create a simple manager:

class CachingManager(Manager):
    def get_query_set(self):
        return CachingQuerySet(self.model)

And then we can use this with our models:

class Post(models.Model):
    title = models.CharField(max_length=100)

    objects = CachingManager()

Post.objects.all()

Now all Posts accessed within the same thread will be cached using the strategy we've implemented.

This strategy will not save us database queries, indeed in some cases it can result in many more queries, it is designed to save memory usage(and be implemented as simply as possible). It can also be made far more useful by having related objects use this strategy as well(if Post had a foreign key to author it would be nice to have all post authors share the same instances, since even you have a large queryset of Posts were all the Posts are unique, they are likely to have duplicate authors).

You can find the rest here. There are view comments.

Other ORM Goodies

Posted November 29th, 2008. Tagged with models, django, orm.

In addition to the aggregate work, the GSOC student had time to finish ticket 7210, which adds support for expressions to filter() and update(). This means you'll be able to execute queries in the form of:

SELECT * FROM table WHERE height > width;

or similar UPDATE queries. This has a syntax similar to that of Q objects, using a new F object. So the above query would look like:

Model.objects.filter(height__gt=F('width'))

or an update query could look like:

Employee.objects.update(salary=F('salary')*1.1)

these objects support the full range of arithmetic operations. These are slated to be a part of Django 1.1.

You can find the rest here. There are view comments.

What aggregates are going to look like

Posted November 27th, 2008. Tagged with models, django, orm.

Prior to Django 1.0 there was a lot of discussion of what the syntax for doing aggregate queries would like. Eventually a syntax was more or less agreed upon, and over the summer Nicolas Lara implemented this for the Google Summer of Code project, mentored by Russell Keith-Magee. This feature is considered a blocker for Django 1.1, so I'm going to outline what the syntax for these aggregates will be.

To facillitate aggregates two new methods are being added to the queryset, aggregate and annotate. Aggregate is used to preform basic aggregation on queryset itself, for example getting the MAX, MIN, AVG, COUNT, and SUM for a given field on the model. Annotate is used for getting information about a related model.

For example, if we had a product model with a price field, we could get the max and minimum price for a product by doing the following:

Product.objects.aggregate(Min('price'), Max('price'))

this will return something like this:

{'price__min': 23.45,
 'price__max': 47.89,
}

We can also give the results aliases, so it's easier to read(if no alias is provided it fallsback to using fieldname__aggregate:

Product.objects.aggregate(max_price = Max('price'), min_price = Min('price'))
{'min_price': 23.45,
 'max_price': 47.89,
}

You can also do aggregate queries on related fields, but the idea is the same, return a single value for each aggregate.

In my opinion, annotate queries are far more interesting. Annotate queries let us represent queries such as, "give me all of the Tags that more than 3 objects have been tagged with", which would look like:

Tag.objects.annotate(num_items=Count('tagged')).filter(num_items__gt=3)

This would return a normal queryset where each Tag object has an attribute named num_items, that was the Count() of all of tagged for it(I'm assuming tagged is a reverse foreign key, to a model that represents a tagged relationship). Another query we might want to execute would be to see how many awards authors of each author's publisher had won, this would look like:

Author.objects.annotate(num_publisher_awards=Count('publisher__authors__awards')).order_by('num_publisher_awards')

This is a little more complicated, but just like when using filter() we can chain this __ syntax. Also, as you've probably noticed we can filter and order_by these annotated attributes the same as we can with regular fields.

If you're interested in seeing more of how this works, Nicolas Lara has written some documentation and doc tests that you can see here. For now none of this is in the Django source tree yet, but there is a patch with the latest work on ticket 366.

Happy thanksgiving!

You can find the rest here. There are view comments.

A timeline view in Django

Posted November 24th, 2008. Tagged with python, models, tips, django, orm.

One thing a lot of people want to do in Django is to have a timeline view, that shows all the objects of a given set of models ordered by a common key. Unfortunately the Django ORM doesn't have a way of representing this type of query. There are a few techniques people use to solve this. One is to have all of the models inherit from a common baseclass that stores all the common information, and has a method to get the actual object. The problem with this is that it could execute either O(N) or O(N*k) queries, where N is the number of items and k is the number of models. It's N if your baseclass has the subtype it is stored on it, in which case you can directly grab it, else it's N*k since you have to try each type. Another approach is to use a generic relation, this will also need O(N) queries since you need to get the related object for each generic one. However, there's a better solution.

What we can do is use get a queryset for each of the models we want to display(O(k) queries), sorted on the correct key, and then use a simple merge to combine all of these querysets into a single list, comparing on a given key. While this technically may do more operations than the other methods, it does fewer database queries, and this is often the most difficult portion of your application to scale.

Let's say we have 3 models, new tickets, changesets, and wikipage edits(what you see in a typical Trac install). We can get our querysets and then merge them like so:

def my_view(request):
   tickets = Ticket.objects.order_by('create_date')
   wikis = WikiEdit.objects.order_by('create_date')
   changesets = Changeset.objects.order_by('create_date')
   objs = merge(tickets, wikis, changesets, field='create_date')
   return render_to_response('my_app/template.html', {'objects': objs})

Now we just need to write our merge function:

def merge_lists(left, right, field=None):
    i, j = 0, 0
    result = []
    while i:
        if getattr(left[i], field):
            result.append(left[i])
            i += 1
        else:
            result.append(right[j])
            j += 1
    result.extend(left[i:])
    result.extend(right[j:])
    return result

def merge(*querysets, **kwargs):
    field = kwargs.pop('field')
    if field is None:
        raise TypeError('you need to provide a key to do comparisons on')
    if len(querysets) == 1:
        return querysets[0]

    qs = [list(x) for x in querysets]
    q1, q2 = qs.pop(), qs.pop()
    result = merge_lists(q1, q2, field)
    for q in qs:
        result = merge_lists(result, q)
    return result

There might be a more efficient way to write our merge function, but for now it merges together an arbitrary number of querysets on a given key.

And that's all their is too it. If you see a good way to make the merge function more efficient let me know, I would have liked to use Python's included heapq module, but it doesn't have a way to use a custom comparison function that I saw.

You can find the rest here. There are view comments.

Uncoupled code is good, but doesn't exist

Posted November 19th, 2008. Tagged with python, models, django, orm, turbogears.

Code should try to be as decoupled from the code it depends as possible, I want me C++ to work with any compiler, I want my web framework to work with any ORM, I want my ORM to work with any database. While all of these are achievable goals, some of the decoupling people are searching for is simply not possible. At DjangoCon 2008 Mark Ramm made the argument that the Django community was too segregated from the Python community, both in terms of the community itself, and the code, Django for example doesn't take enough advantage of WSGI level middlewear, and has and ORM unto itself. I believe some of these claims to be true, but I ultiamtely thing the level of uncoupling some people want is simply impossible.

One of Django's biggest selling features has always been it's automatically generated admin. The admin requires you to be using Django's models. Some people would like it to be decoupled. To them I ask, how? It's not as if Django's admin has a big if not isinstance(obj, models.Model): raise Exception, it simply expects whatever is passed to it to define the same API as it uses. And this larger conecern, the Django admin is simply an application, it has no hooks within Django itself, it just happens to live in that namespace, the moment any application does Model.objects.all(), it's no longer ORM agnostic, it's already assumed the usage of a Django ORM. However, all this means is that applications themselves are inextricably tied to a given ORM, templating language, and any other module they import, you quite simply can't write resonably code that works just as well with two different modules unless they both define the same API.

Eric Florenzano wrote a great blog post yesterday about how Django could take better advantage of WSGI middleware, and he's absolutely correct. It makes no sense for a Django project to have it's own special middlewear for using Python's profiling modules, when it can be done more generic a level up, all the code is in Python afterall. However, there are also things that you can't abstract out like that, because they require a knowledge of what components you are using, SQLAlchemy has one transation model, Django has another.

The fact that an application is tied to the modules it uses is not an argument against it. A Django application is no tightly coupled to Django's ORM and template system is than a Turbo Gears application that uses SQL Alchemy and Mako, which is to say of course they're tied to it, they import those modules, they use them, and unless the other implements the same API you can't just swap them out. And that's not a bad thing.

You can find the rest here. There are view comments.

Django Models - Digging a Little Deeper

Posted November 13th, 2008. Tagged with foreignkey, python, models, django, orm, metaclass.

For those of you who read my last post on Django models you probably noticed that I skirted over a few details, specifically for quite a few items I said we, "added them to the new class". But what exactly does that entail? Here I'm going to look at the add_to_class method that's present on the ModelBase metaclass we look at earlier, and the contribute_to_class method that's present on a number of classes throughout Django.

So first, the add_to_class method. This is called for each item we add to the new class, and what it does is if that has a contribute_to_class method than we call that with the new class, and it's name(the name it should attach itself to the new class as) as arguments. Otherwise we simply set that attribute to that value on the new class. So for example new_class.add_to_class('abc', 3), 3 doesn't have a contribute_to_class method, so we just do setattr(new_class, 'abc', 3).

The contribute_to_class method is more common for things you set on your class, like Fields or Managers. The contribute_to_class method on these objects is responsible for doing whatever is necessary to add it to the new class and do it's setup. If you remember from my first blog post about User Foreign Keys, we used the contribute_to_class method to add a new manager to our class. Here we're going to look at what a few of the builtin contribute_to_class methods do.

The first case is a manager. The manager sets it's model attribute to be the model it's added to. Then it checks to see whether or not the model already has an _default_manager attribute, if it doesn't, or if it's creation counter is lower than that of the current creation counter, it sets itself as the default manager on the new class. The creation counter is essentially a way for Django to keep track of which manager was added to the model first. Lastly, if this is an abstract model, it adds itself to the abstract_managers list in _meta on the model.

The next case is if the object is a field, different fields actually do slightly different things, but first we'll cover the general field case. It also, first, sets a few of it's internal attributes, to know what it's name is on the new model, additionally calculating it's column name in the db, and it's verbose_name if one isn't explicitly provided. Next it calls add_field on _meta of the model to add itself to _meta. Lastly, if the field has choices, it sets the get_FIELD_display method on the class.

Another case is for file fields. They do everything a normal field does, plus some more stuff. They also add a FileDescriptor to the new class, and they also add a signal receiver so that when an instance of the model is deleted the file also gets deleted.

The final case is for related fields. This is also the most complicated case. I won't describe exactly what this code does, but it's biggest responsibility is to set up the reverse descriptors on the related model, those are nice things that let you author_obj.books.all().

Hopefully this gives you a good idea of what to do if you wanted to create a new field like object in Django. For another example of using these techniques, take a look at the generic foreign key field in django.contrib.contenttypes, here.

You can find the rest here. There are view comments.

More Laziness with Foreign Keys

Posted November 4th, 2008. Tagged with foreignkey, models, django, orm.

Yesterday we looked at building a form field to make the process of getting a ForeignKey to the User model more simple, and to provide us with some useful tools, like the manager. But this process can be generalized, and made more robust. First we want to have a lazy ForeignKey field for all models(be careful not to confuse the term lazy, here I use it to refer to the fact that I am a lazy person, not the fact that foreign keys are lazy loaded).

A more generic lazy foreign key field might look like:

from django.db.models import ForeignKey, Manager

class LazyForeignKey(ForeignKey):
    def __init__(self, *args, **kwargs):
        model = kwargs.get('to')
        if model_name is None:
            model = args[0]
        try:
            name = model._meta.object_name.lower()
        except AttributeError:
            name = model.split('.')[-1].lower()
        self.manager_name = kwargs.pop('manager_name', 'for_%s' % name)
        super(ForeignKey, self).__init__(*args, **kwargs)

    def contribute_to_class(self, cls, name):
        super(ForeignKey, self).contribute_to_class(cls, name)

        class MyManager(Manager):
            def __call__(self2, obj):
                return cls._default_manager.filter(**{self.name: obj})

        cls.add_to_class(self.manager_name, MyManager())

As you can see, a lot of the code is the same as before. Most of the new code is in getting the mode's name, either through _meta, or through the last part of the string(i.e. User in "auth.User"). And now you will have a manager on your class, named either for_X where X is the name of the model the foreign key is to lowercase, or named whatever the kwarg manager_name is.

So if your model has this:

teacher = LazyForeignKey(Teacher)

You would be able to do:

MyModel.for_teacher(Teacher.objects.get(id=3))

That's all for today. Since tonight is election night, tomorrow I'll probably post about my application election-sim, and about PyGTK and PyProcessing(aka multiprocessing).

You can find the rest here. There are view comments.

Lazy User Foreign Keys

Posted November 3rd, 2008. Tagged with foreignkey, models, django, orm.

A very common pattern in Django is for models to have a foreign key to django.contrib.auth.User for the owner(or submitter, or whatever other relation with User) and then to have views that filter this down to the related objects for a specific user(often the currently logged in user). If we think ahead, we can make a manager with a method to filter down to a specific user. But since we are really lazy we are going to make a field that automatically generates the foreign key to User, and gives us a manager, automatically, to filter for a specific User, and we can reuse this for all types of models.

So what does the code look like:

from django.db.models import ForeignKey, Manager

from django.contrib.auth.models import User

class LazyUserForeignKey(ForeignKey):
    def __init__(self, **kwargs):
        kwargs['to'] = User
        self.manager_name = kwargs.pop('manager_name', 'for_user')
        super(ForeignKey, self).__init__(**kwargs)

    def contribute_to_class(self, cls, name):
        super(ForeignKey, self).contribute_to_class(cls, name)

        class MyManager(Manager):
            def __call__(self2, user):
                return cls._default_manager.filter(**{self.name: user})

        cls.add_to_class(self.manager_name, MyManager())

So now, what does this do?

We are subclassing ForeignKey. In __init__ we make sure to is set to User and we also set self.manager_name equal to either the manager_name kwarg, if provided or 'for_user'. contribute_to_class get called by the ModelMetaclass to add each item to the Model itself. So here we call the parent method, to get the ForeignKey itself set on the model, and then we create a new subclass of Manager. And we define an __call__ method on it, this lets us call an instance as if it were a function. And we make __call__ return the QuerySet that would be returned by filtering the default manager for the class where the user field is equal to the given user. And then we add it to the class with the name provided earlier.

And that's all. Now we can do things like:

MyModel.for_user(request.user)

Next post we'll probably look at making this more generic.

You can find the rest here. There are view comments.