Currently, I use python mostly for data analysis and modeling. Whenever I can I take a pipeline-like approach, where data is processed in multiple steps. Those are implemented in separate py files with cPickle used for data persistence and exchange. You can think of this as of a poor man’s mapreduce.

Usually the development is an iterative process. I run and change given step over and over till I’m happy (at least for a moment) with the result. This means loading of same data with cPickle multiple times. Which, turns out, I’ve been doing wrong for a very long time.

So, what was wrong? Consider this code for pickling and unpickling a large numpy array:

import numpy as np
import cPickle
import time

data = np.random.sample(int(1e7))
t1 = time.time()
with open("data.pkl", "wb") as of_:
    cPickle.dump(data, of_)
print "Write took", time.time()-t1

t1 = time.time()
with open("data.pkl", "rb") as of_:
    cPickle.load(of_)
print "Read took", time.time()-t1

On my laptop, the execution of the write/read parts of the above takes 12 and 6.5 seconds respectively. The code is straightforward, nothing looks wrong. Except the fact I have omitted pickle protocol version specification. Which, as it turns out, has a dramatic impact on the performance. If we set the protocol to the latest and greatest:

with open("data.pkl", "wb") as of_:
    cPickle.dump(data, of_, cPickle.HIGHEST_PROTOCOL)

the write/read times drop to 0.7 and 0.1 seconds respectively. Which is nearly two orders of magnitude difference!

It is nothing unusual now to have data big enough to make the total loading time (i.e. summed over for all “change it and run” iterations during a day) significant, e.g. half an hour. Which is 100% wasted. If this feels to you like overreacting think how including a half of minute lag in a start of a program would affect your comfort as a developer.

Of course we could take another approach by loading our data once inside ipython notebook and do all of the development there. This I try to avoid whenever I can, but that’s a topic for another post.

So remember – pickle protocol matters!


MongoDB for Developers (python flavour) - course review Unittesting print statements

  1. You shouldn’t pickle Numpy arrays in the first place. This does not make sense at all, most of all if you want to store the data for longer.

    Use the IO functions Numpy provides: numpy.save and numpy.load. If you want to be really professional, use HDF5 (via h5py).

  2. I disagree, in lots of cases it’s good enough (but one should be aware of possible issues). Up to now, a long time persistence was never a requirement for my projects (at least those numpy based). Anyway, most likely I’ll switch to numpy.save, as you pointed out (thanks!). Possible memory improvements look very promising. But the main reason for this is rather silly – slightly more compact code with numpy.save/load…

  3. In Python 2, the default pickle protocol is version 0, the human-readable ascii format, and the slowest. The highest 2.x version is 2., the second binary protocol, and, as you found out, much faster. In Python 3, the default is version 3, with version 4 added in 3.4. See the docs for more.

Leave a Reply

Your email address will not be published.