Python: hash, id and Dictionary Order
I would still love to hear your feedback in the comments below. Enjoy!
TL;DR - really, seriously, no kidding, don’t even think about relying on dictionary traverse order, because it may change even from seemingly unrelated code changes.
Today I ran into one of the most bizarre behaviors in Python I’ve ever seen. A co-worker was debugging some code and called me to help. There was a class that polled a few database tables for specific entries and for some reason, it failed to find them (although they existed). So, we added a breakpoint with pdb:
While we were debugging the code, it seemed to work. Huh, we thought to ourselves, let’s remove the pdb
statement and see if it works now. It didn’t. Our program only worked with the pdb
statement.
Here’s a watered-down version of the code we were debugging:
The real bug was the return statement - we did it regardless of whether we found the entries or not, while we should have only returned if we found them (if not, we should continue the loop to the other entries), but that’s a side issue for now. The result of this bug was that only one table was checked when we called poll. Specifically, the entry we were looking for was in TableC
, but the code only polled TableA
. When we added the pdb
statement, however, it only polled TableC
(which seemed to solve the bug). So the order in which the dictionary returned the items was consistent as long as no code changes were introduced. Yep, we also tried to add other function calls like sleep(1)
with the same result.
So let’s do some testing. I took the following piece of code (very similar to the code above) and run it several thousand times:
The result is that in every run the order is static and happens to be:
That kind of surprises me, so let’s continue with the testing and add some redundant code to the end of our script:
Now see what happens - again, the order is static but it changes from the first version of our code. Now it looks like this:
So let’s figure out what was happening here.
To understand what determines the order of the items in iteritems() we need to first understand a bit about the internals of CPython’s dict implementation [attribution]:
Python dictionaries are implemented as hash tables.
When a new dictionary is initialized it starts with 8 slots. When adding entries to the table, CPython uses a mask to take the
x
most lower bits of the entry’s hash and uses these bits to determine the entry’s initial location. When the dictionary has 8 slots,x
is 3.When the initial slot for an entry is taken, CPython uses (psuedo) random probing to select a new location, that uses the other bits of the entry’s hash.
The default CPython implementation of the
__hash__
method isid(entry)/16
.In turn, the default id implementation is simply the object’s memory address.
The order of the items that iteritems returns is the same as their order in the slots - slot 0 first, then slot 1, etc.
I found that in every run, every class’ hash was different. It makes sense, because the memory addresses change by what is available to the Python process. However, I also found that in every single run - the lower 3 bits of every class’ hash remained equal. These are the values for classes A..H respectively with the first version of the code:
That’s the reason their order remained the same. Can you guess what happens with the second version? The lower three bits have different values than the first version! And this is consistent across multiple runs. And remember, not only did we only add redundant lines of code - it was also after the classes were instantiated, so naively we wouldn’t expect it to have any effect! Here are the values - again, for classes A..H
respectively:
So now the hashes are changed in such a way that the lower bits changed as well. Obviously, this also changed the order in which the dict is traversed.
Why is this happening? Good question. Since this is getting really specific into Python’s implementation I can only make an educated guess - and that is that Python allocates a segment in the memory for code and that that segment is somehow 128-bit-aligned so that the lower 3 bits of the hash (which is (id/16)%8
as we recall) is always the same. I would love it if someone more intimate with CPython’s implementation could comment and let us know what’s really happening here.
In any case, the final upshot of this is: I know you’ve already been warned not to rely on dictionary order, but even if you thought to yourself “well, this is a small script and I can empirically see that the order remains static so I can just rely on it this time” - don’t. The dictionary order can change for reasons that are not obvious and are unexpected. If you do like to have a dictionary with a predictable and reliable order, use OrderedDict
instead.
Follow me on Twitter and Facebook