Dictionaries
Dictionaries are a lot more complex than lists and what is under the hood is both terrifying and beautiful at the same time. If you want to review the code for the dictionary object it can be found here: https://github.com/python/cpython/blob/main/Objects/dictobject.c That said, it's nearly 7500 lines and my C is not good enough to unravel the mystery. We can, however, empirically understand dictionaries by watching them operate and knowing a few facts gleaned from the link above. So let's start.
To create and empty dictionary and check its memory usage we need this code:
from sys import getsizeof
from pympler import asizeof
a = {}
print(f'Object size: {getsizeof(a)}, Total size: {asizeof.asizeof(a)}.')
The output is:
Object size: 64, Total size: 64.
This is an empty dictionary with no extra data in memory. Now if we add a single key:value pair we get this:
from sys import getsizeof
from pympler import asizeof
a = {}
a[0] = 1
print(f'Object size: {getsizeof(a)}, Total size: {asizeof.asizeof(a)}.')
The output is:
Object size: 224, Total size: 288.
We know from the investigation of lists that python now resizes the dictionary to allow for future expansion. We can see from line 116 in the dictobject.c code that this minimum size is 8 so there are 8 slots available with one filled. the difference between the object size and the total size is the memory required for the data. In this case it takes 288 - 224 = 64 bytes to store the key:value pair
Growth
Growth is covered in the dictobject.c document between lines 561 and 638. Importantly, line 570 states that usable fractions between 1/2 and 2/3 work well and lines 610 and 619 state the growth rate is used * 3. Spoiler alert, the testing that follows does not quite fit that pattern but line 611 states "... that dicts double in size when growing without deletions ...".
We start by knowing that the initial dictionary has 8 slots, so we start filling until the dictionary resizes.
from sys import getsizeof
from pympler import asizeof
a = {}
key = 0
value = key + 1
while len(a) < 6:
a[key] = value
print(f'Object size: {getsizeof(a)}, Total size: {asizeof.asizeof(a)}, No. of k:v pairs: {len(a)}.')
key += 2
print(a)
Output:
Object size: 224, Total size: 288, No. of k:v pairs: 1.
Object size: 224, Total size: 352, No. of k:v pairs: 2.
Object size: 224, Total size: 416, No. of k:v pairs: 3.
Object size: 224, Total size: 480, No. of k:v pairs: 4.
Object size: 224, Total size: 544, No. of k:v pairs: 5.
Object size: 352, Total size: 736, No. of k:v pairs: 6.
{0: 1, 2: 3, 4: 5, 6: 7, 8: 9, 10: 11}
Note that it is the addition of the 6th pair that causes the growth. If the usable fraction is set to 2/3 (and I think it is) then.
The addition of the 6th pair pushes the load over 0.6667 and the dictionary resizes. We cannot be sure, as yet, how big the new dictionary is. 5 * 3 = 15 and 6 * 3 = 18, neither of these seems like a nice (binary) number. While doubling to 16 does look more reasonable.
We can test our theory by looking for the next resize. If the dictionary has resized to 16 and if the usable fraction is 0.6667 then we should see the next resize occur when the 11th k:v pair is added. Note that 10/16 = 0.625 (below 2/3) and 11/16 = 0.6875 (above 2/3). We test the theory:
from sys import getsizeof
from pympler import asizeof
a = {}
key = 0
while len(a) < 12:
value = key + 1
a[key] = value
print(f'Object size: {getsizeof(a)}, Total size: {asizeof.asizeof(a)}, No. of k:v pairs: {len(a)}.')
key += 2
print(a)
Output:
Object size: 224, Total size: 288, No. of k:v pairs: 1.
Object size: 224, Total size: 352, No. of k:v pairs: 2.
Object size: 224, Total size: 416, No. of k:v pairs: 3.
Object size: 224, Total size: 480, No. of k:v pairs: 4.
Object size: 224, Total size: 544, No. of k:v pairs: 5.
Object size: 352, Total size: 736, No. of k:v pairs: 6.
Object size: 352, Total size: 800, No. of k:v pairs: 7.
Object size: 352, Total size: 864, No. of k:v pairs: 8.
Object size: 352, Total size: 928, No. of k:v pairs: 9.
Object size: 352, Total size: 992, No. of k:v pairs: 10.
Object size: 632, Total size: 1336, No. of k:v pairs: 11.
Object size: 632, Total size: 1400, No. of k:v pairs: 12.
{0: 1, 2: 3, 4: 5, 6: 7, 8: 9, 10: 11, 12: 13, 14: 15, 16: 17, 18: 19, 20: 21, 22: 23}
This doubling when the usage reaches 2/3 can be checked with this code:
from sys import getsizeof
from pympler import asizeof
a = {}
key = 0
slots = 8
while len(a) < 5000:
initial_size = getsizeof(a)
value = key + 1
a[key] = value
final_size = getsizeof(a)
if final_size > initial_size:
print(f'Object size: {getsizeof(a)}, Total size: {asizeof.asizeof(a)}, No. of k:v pairs: {len(a)}.')
print(f'Initial usage: {(len(a)-1) / slots}, Final usage: {len(a) / slots}')
if len(a) > 1:
slots *= 2
print(f'Slots increased to {slots}.')
print(f'')
key += 2
The output is:
Object size: 224, Total size: 288, No. of k:v pairs: 1.
Initial usage: 0.0, Final usage: 0.125
Slots increased to 8.
Object size: 352, Total size: 736, No. of k:v pairs: 6.
Initial usage: 0.625, Final usage: 0.75
Slots increased to 16.
Object size: 632, Total size: 1336, No. of k:v pairs: 11.
Initial usage: 0.625, Final usage: 0.6875
Slots increased to 32.
Object size: 1168, Total size: 2576, No. of k:v pairs: 22.
Initial usage: 0.65625, Final usage: 0.6875
Slots increased to 64.
Object size: 2264, Total size: 5016, No. of k:v pairs: 43.
Initial usage: 0.65625, Final usage: 0.671875
Slots increased to 128.
Object size: 4688, Total size: 10192, No. of k:v pairs: 86.
Initial usage: 0.6640625, Final usage: 0.671875
Slots increased to 256.
Object size: 9304, Total size: 20248, No. of k:v pairs: 171.
Initial usage: 0.6640625, Final usage: 0.66796875
Slots increased to 512.
Object size: 18512, Total size: 40400, No. of k:v pairs: 342.
Initial usage: 0.666015625, Final usage: 0.66796875
Slots increased to 1024.
Object size: 36952, Total size: 80664, No. of k:v pairs: 683.
Initial usage: 0.666015625, Final usage: 0.6669921875
Slots increased to 2048.
Object size: 73808, Total size: 161232, No. of k:v pairs: 1366.
Initial usage: 0.66650390625, Final usage: 0.6669921875
Slots increased to 4096.
Object size: 147544, Total size: 322328, No. of k:v pairs: 2731.
Initial usage: 0.66650390625, Final usage: 0.666748046875
Slots increased to 8192.
This supports the view that the number of available slots doubles whenever the usage rises above 2/3.
External memory requirements
To check how much memory is required outside of the dictionary to store the actual key:value pairs we will used this code:
from sys import getsizeof
from pympler import asizeof
a = {}
key = 0
while len(a) < 16:
value = key + 1
a[key] = value
data_memory = asizeof.asizeof(a) - getsizeof(a)
memory_per_pair = data_memory / len(a)
print(f'Total memory required: {data_memory}, which gives {memory_per_pair} bytes per k:v pair.')
key += 2
The output is:
Total memory required: 64, which gives 64.0 bytes per k:v pair.
Total memory required: 128, which gives 64.0 bytes per k:v pair.
Total memory required: 192, which gives 64.0 bytes per k:v pair.
Total memory required: 256, which gives 64.0 bytes per k:v pair.
Total memory required: 320, which gives 64.0 bytes per k:v pair.
Total memory required: 384, which gives 64.0 bytes per k:v pair.
Total memory required: 448, which gives 64.0 bytes per k:v pair.
Total memory required: 512, which gives 64.0 bytes per k:v pair.
Total memory required: 576, which gives 64.0 bytes per k:v pair.
Total memory required: 640, which gives 64.0 bytes per k:v pair.
Total memory required: 704, which gives 64.0 bytes per k:v pair.
Total memory required: 768, which gives 64.0 bytes per k:v pair.
Total memory required: 832, which gives 64.0 bytes per k:v pair.
Total memory required: 896, which gives 64.0 bytes per k:v pair.
Total memory required: 960, which gives 64.0 bytes per k:v pair.
Total memory required: 1024, which gives 64.0 bytes per k:v pair.
This shows each key:value pair requires 64 bytes of memory.
The hidden inner workings
The dictionary is made up of a hash table (sparse) and an entities list (dense). The key:value pairs are stored sequentially in the entities table. The index to those entities is stored in the hash table at an index determined by the hash of the key. For the following demonstration I will be using the SHA-1 hash rather than the python inbuilt hash function as the seed for that varies each time the code is run so it's hard to do demonstrations.
The initial demo starts with the hash table as a series of empty slots and the entities table as an empty list:
from hashlib import sha1
slots = [[], [], [], [], [], [], [], []] # 8 slots
entries = []
To this we will add the key:value pair of "001":"Andrew". We take the SHA-1 hash of the key as a numerical value.
## new data
key = '001'
value = 'Andrew'
hash = int(sha1(key.encode()).hexdigest(), 16)
print(hash)
The resulting hash is:
1287815081429651280071948832174133786712345661042
We now find the value of hash % number of slots:
location = hash % len(slots)
And the result is 2. The data stored in the entities table is:
(1287815081429651280071948832174133786712345661042, '001', 'Andrew')
And since it is the first entry in that table it has the index 0. That index, 0, is stored in slot 2. The code to achieve all of this looks like this:
from hashlib import sha1
slots = [[], [], [], [], [], [], [], []] # 8 slots
entries = []
## new data
key = '001'
value = 'Andrew'
hash = int(sha1(key.encode()).hexdigest(), 16)
location = hash % len(slots)
data = (hash, key, value)
entries.append(data)
slots[location].append(entries.index(data))
print(slots)
print(entries)
And the output looks like this:
[[], [], [0], [], [], [], [], []]
[(1287815081429651280071948832174133786712345661042, '001', 'Andrew')]
To add another key:value pair, append something like this under the code above:
## next data
key = '002'
value = 'Jane'
hash = int(sha1(key.encode()).hexdigest(), 16)
location = hash % len(slots)
data = (hash, key, value)
entries.append(data)
slots[location].append(entries.index(data))
print(slots)
print(entries)
The hash for '002' is 638190938522342344790495567374591849013679820108 and this value modulo 8 is 4 so the entities table becomes:
[(1287815081429651280071948832174133786712345661042, '001', 'Andrew'),
(638190938522342344790495567374591849013679820108, '002', 'Jane')]
and since the hash of the key is 4 the index 1 is stored in slot 4 of the hash table. it now looks like this:
[[], [], [0], [], [1], [], [], []]
If we add '003':'Simon' to the dictionary we will find that the hash of '003' is 194552376310485427249522880942607198860839170867 The modulo 8 of that hash is 3 so the entities table now looks like:
[(1287815081429651280071948832174133786712345661042, '001', 'Andrew'),
(638190938522342344790495567374591849013679820108, '002', 'Jane'),
(194552376310485427249522880942607198860839170867, '003', 'Simon')]
and the hash table looks like:
[[], [], [0], [2], [1], [], [], []]
The reason for this is to speed up the look up. If you want the value for the key, 003, rather than iterate through the dictionary entities table looking for the key you can just find the hash of the key, modulo with the size of the hash table and find the index of the data in the entities table. Very clever.
Internal memory use
I will admit that at this point I cannot explain how the internal memory is used. The C code is a mystery to me and investigation leads to inexplicable results. Allow me to demonstrate.
The first key:value pair causes the dictionary to increase its capacity to 8 slots, and it takes an additional 224 - 64 = 160 bytes of memory. This appears to be 20 bytes per slot. I've no idea what those 20 bytes are used for but that's what the numbers suggest.
When the dictionary resizes it increases to 16 slots and 352 - 64 = 288 bytes. This works out to 18 bytes per slot.
To see the long term trend we need to create a big dictionary. This code will do that:
from sys import getsizeof
a = {}
key = 0
slots = 8
while len(a) < 10000:
initial_size = getsizeof(a)
value = key + 1
a[key] = value
final_size = getsizeof(a)
if final_size > initial_size:
if len(a) > 1:
slots *= 2
print(f'Slots increased to {slots}.')
print(f'Object size: {getsizeof(a)} bytes.')
print(f'Memory per slot: {(final_size - 64) / slots} bytes.')
print(f'')
key += 2
The output is:
Slots increased to 8.
Object size: 224 bytes.
Memory per slot: 20.0 bytes.
Slots increased to 16.
Object size: 352 bytes.
Memory per slot: 18.0 bytes.
Slots increased to 32.
Object size: 632 bytes.
Memory per slot: 17.75 bytes.
Slots increased to 64.
Object size: 1168 bytes.
Memory per slot: 17.25 bytes.
Slots increased to 128.
Object size: 2264 bytes.
Memory per slot: 17.1875 bytes.
Slots increased to 256.
Object size: 4688 bytes.
Memory per slot: 18.0625 bytes.
Slots increased to 512.
Object size: 9304 bytes.
Memory per slot: 18.046875 bytes.
Slots increased to 1024.
Object size: 18512 bytes.
Memory per slot: 18.015625 bytes.
Slots increased to 2048.
Object size: 36952 bytes.
Memory per slot: 18.01171875 bytes.
Slots increased to 4096.
Object size: 73808 bytes.
Memory per slot: 18.00390625 bytes.
Slots increased to 8192.
Object size: 147544 bytes.
Memory per slot: 18.0029296875 bytes.
Slots increased to 16384.
Object size: 294992 bytes.
Memory per slot: 18.0009765625 bytes.
The long term trend seems to be ~18 bytes per slot. If you can shed any more light on this end send me an email.
Last modified: 03 July 2024