Currently, I use python mostly for data analysis and modeling. Whenever I can I take a pipeline-like approach, where data is processed in multiple steps. Those are implemented in separate py files with cPickle used for data persistence and exchange. You can think of this as of a poor man’s mapreduce.
Usually the development is an iterative process. I run and change given step over and over till I’m happy (at least for a moment) with the result. This means loading of same data with cPickle multiple times. Which, turns out, I’ve been doing wrong for a very long time.
So, what was wrong? Consider this code for pickling and unpickling a large numpy array:
import numpy as np import cPickle import time data = np.random.sample(int(1e7)) t1 = time.time() with open("data.pkl", "wb") as of_: cPickle.dump(data, of_) print "Write took", time.time()-t1 t1 = time.time() with open("data.pkl", "rb") as of_: cPickle.load(of_) print "Read took", time.time()-t1
On my laptop, the execution of the write/read parts of the above takes 12 and 6.5 seconds respectively. The code is straightforward, nothing looks wrong. Except the fact I have omitted pickle protocol version specification. Which, as it turns out, has a dramatic impact on the performance. If we set the protocol to the latest and greatest:
with open("data.pkl", "wb") as of_: cPickle.dump(data, of_, cPickle.HIGHEST_PROTOCOL)
the write/read times drop to 0.7 and 0.1 seconds respectively. Which is nearly two orders of magnitude difference!
It is nothing unusual now to have data big enough to make the total loading time (i.e. summed over for all “change it and run” iterations during a day) significant, e.g. half an hour. Which is 100% wasted. If this feels to you like overreacting think how including a half of minute lag in a start of a program would affect your comfort as a developer.
Of course we could take another approach by loading our data once inside ipython notebook and do all of the development there. This I try to avoid whenever I can, but that’s a topic for another post.
So remember – pickle protocol matters!