Serialization of stored/cached data

By default, both cache and data files (created using the APIs described in Persistent data) are cached using cPickle. This provides a great compromise in terms of speed and the ability to store arbitrary objects.

When changing or specifying a serializer, use the name under which the serializer is registered with the workflow.manager object.

Warning

When it comes to cache data, it is strongly recommended to stick with the default. cPickle is very fast and fully supports standard Python data structures (dict, list, tuple, set etc.).

If you really must customise the cache data format, you can change the default cache serialization format to pickle thus:

1
2
wf = Workflow()
wf.cache_serializer = 'pickle'

Unlike the stored data API, the cached data API can’t determine the format of the cached data. If you change the serializer without clearing the cache, errors will probably result as the serializer tries to load data in a foreign format.

In the case of stored data, you are free to specify either a global default serializer or one for each individual datastore:

1
2
3
4
5
6
wf = Workflow()
# Use `pickle` as the global default serializer
wf.data_serializer = 'pickle'

# Use the JSON serializer only for these data
wf.store_data('name', data, serializer='json')

This is primarily so you can create files that are human-readable or useable by other software. The generated JSON is formatted to make it readable.

The stored_data() method can automatically determine the serialization of the stored data (based on the file extension, which is the same as the name the serializer is registered under), provided the corresponding serializer is registered. If it isn’t, a ValueError will be raised.

Built-in serializers

There are 3 built-in, pre-configured serializers:

  • cpickle — the default serializer for both cached and stored data, with very good support for native Python data types;
  • pickle — a more flexible, but much slower alternative to cpickle; and
  • json — a very common data format, but with limited support for native Python data types.

See the built-in cPickle, pickle and json libraries for more information on the serialization formats.

Managing serializers

You can add your own serializer, or replace the built-in ones, using the configured instance of SerializerManager at workflow.manager, e.g. from workflow import manager.

A serializer object must have load() and dump() methods that work the same way as in the built-in json and pickle libraries, i.e.:

1
2
3
4
# Reading
obj = serializer.load(open('filename', 'rb'))
# Writing
serializer.dump(obj, open('filename', 'wb'))

To register a new serializer, call the register() method of the workflow.manager object with the name of the serializer and the object that performs serialization:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
 from workflow import Workflow, manager


 class MySerializer(object):

     @classmethod
     def load(cls, file_obj):
         # load data from file_obj

     @classmethod
     def dump(cls, obj, file_obj):
         # serialize obj to file_obj

 manager.register('myformat', MySerializer())

Note

The name you specify for your serializer will be the file extension of the stored files.

Serializer interface

A serializer must conform to this interface (like json and pickle):

1
2
serializer.load(file_obj)
serializer.dump(obj, file_obj)

See the Serializers section of the API documentation for more information.