Being defensive with pickle in evolving environment

Pickle is an in-house python object persistence solution. Although very useful, care must be taken when using it with class definitions that may change, i.e. are under active development. Consider the following example

import cPickle

class Test():
    def __init__(self):
        self.var1 = 1
        self.var2 = 2

t1 = Test()
print t1.__dict__
t_pickle_str = cPickle.dumps(t1)

class Test(Versionable):
    def __init__(self):
        self.var3 = 3

t2 = cPickle.loads(t_pickle_str)
print t2.__dict__

Both printouts will show you var1 and var2 instance variables and no var3, despite the fact that class logic changed in meantime. This is a normal and expected behavior.

At some point, I needed to include a protection against this in one of my data cleaning algorithms that was wrapped inside a class. The set of parameters used by the algorithm (and stored as instance variables) was too expensive to determine (train) each time when used in the production code but quite cheap to pickle. The algorithm itself was tweaked and changed from time to time which could lead to subtle (and non-verbose) bugs if wrong (old) pickle file was used with class definition with updated algorithm.

In order to handle such situation, it is possible to exploit the fact that pickle doesn’t serialize class level variables. During deserialization, those are simply taken from the current class definition. Thanks to this it is possible to introduce class version control during pickle/unpickle. Consider the following mixin:

class Versionable(object):
    def __getstate__(self):
        if not hasattr(self, "_class_version"):
            raise Exception("Your class must define _class_version class variable")
        return dict(_class_version=self._class_version, **self.__dict__)
    def __setstate__(self, dict_):
        version_present_in_pickle = dict_.pop("_class_version")
        if version_present_in_pickle != self._class_version:
            raise Exception("Class versions differ: in pickle file: {}, "
                            "in current class definition: {}"
                            .format(version_present_in_pickle,
                                    self._class_version))
        self.__dict__ = dict_

Here  __getstate__ and __setstate__ pickling protocol methods (which should not be confused with pickle protocol version) are provided. The __getstate__ method attaches to pickled data the current class version (taken from _class_version class level variable that must be defined in a subclass). The __setstate__ method compares this value red from pickle with the one from the current class definition. If there is mismatch exception is thrown.

The following code shows Versionable mixin (saved into versionable.py file) in action:

from versionable import Versionable
import cPickle

class TestVersioning(Versionable):
    _class_version = 1

t1 = TestVersioning()

t_pickle_str = cPickle.dumps(t1)

class TestVersioning(Versionable):
    _class_version = 2

t2 = cPickle.loads(t_pickle_str)

This leads to the following output

Traceback (most recent call last):
  File "/home/tfruboes/test.py", line 16, in <module>
    t2 = cPickle.loads(t_pickle_str)
  File "/home/tfruboes/versionable.py", line 20, in __setstate__
    self._class_version))
Exception: Class versions differ: in pickle file: 1, in current class definition: 2

So as long as you remember to bump the version number when incompatible changes are made you are safe.

Some random notes:

  • For the _class_version class variable, you can use anything that is comparable with the “==” operator. So if you want to be more descriptive and provide more than one version number (e.g. minor and major) dict will be also fine.
  • There is an alternative approach possible in order to implement safety using versioning using copy_reg module. For this see item 44 from the “Effective Python” book (if you haven’t visited the books section it’s a right time 🙂 )

Leave a Reply

Your email address will not be published.