Parallelizing and memorizing Python programs with Joblib

A Library for Many Jobs

In Memory

The previous example made use of a small and practically useless f(x) function. Some functions, however, regardless of whether they are parallelized, perform very time- and resource-consuming computations. If the input values are unknown before starting the program, a function like this might process the same arguments multiple times; this overhead could be unnecessary.

For more complicated functions, therefore, it makes sense to save the results (memorization). If the function is invoked with the same argument, it just grabs the existing result, rather than figuring things out again. Here, too programmers can look to Joblib for help – in the form of the Memory class this time.

Joblib provides a cache() method that serves as the decorator for arbitrary functions with one or more functional arguments. The results of the decorated function are then saved on disk by the memory object. The next time the function is called, it checks to see whether the same argument or the same arguments have been processed and, if so, returns the result directly (Figure 2). Listing 3 shows an implementation, again with a primitive sample function f(x).

Figure 2: Memory stores the function results and delivers them without recomputing when re-requested.

Listing 3

Store Function Results

01 from joblib import Memory
02
03 memory = Memory(cachedir='/tmp/example/')
04
05 @memory.cache
06 def f(x):
07     return x

The computed results are stored on disk in the JOBLIB directory below the directory defined by the cachedir parameter. Here again, each memorized function has its own subdirectory, which, among other things, contains the original Python source code of the function in the func_code.py file.

Memory for Names

A separate subdirectory also exists for each different argument – or, depending on the function, each different combination of several arguments. It is named for a hash value of the arguments passed in and contains two files: input_args.json and output.pkl. The first file shows the input arguments in the human-readable JSON format, and the second is the corresponding result in the binary pickle format used by Python to serialize and store objects.

This structure makes accessing the cached results of the memory function pleasantly transparent. The Python pickle module, for example, parses the results in the Python interpreter:

import pickle
result = pickle.load(open("output.pkl"))

Memory does not clean up on its own at end of the program, which means the stored results are still available the next time the program starts. However, this also means you need to clean up the disk space yourself, if necessary, which is done either by calling the clear() method of the Memory object or simply by deleting the folder.

Additionally, you should note that Memory bases its reading of the stored results exclusively on the function name. If you change the implementation, Memory might erroneously return the results previously generated by the old version of the function the next time you start it. Furthermore, Memory does not work with lambda functions – that is, nameless functions that are defined directly in the call.

As a general rule, use of the Memory class is recommended for functions whose results are so large they stress your computer's RAM. If a frequently called function only generates small results, however, creating a dictionary-based in-memory cache makes more sense. A sample implementation is shown on the ActiveState website [2].

Fast and Frugal

On request, the Memory class uses a procedure that saves a lot of time for large stored objects: memory mapping. The core idea behind this approach is to write a file as a bit-for-bit copy of an object from memory to the hard disk. When the software opens the object again, it copies the relevant part of the file into contiguous memory so that the objects it contains are directly available. This approach keeps the system from having to allocate memory, which can involve many system calls.

Joblib uses the memory mapping method provided by the NumPy [3] Python module. The constructor in the Memory class uses the optional mmap_mode parameter to accept the same arguments as the numpy.memmap class: r+, r, w+, and c.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Parallel Python with Joblib

    The Joblib Python Library handles frequent problems – like parallelization, memorization, and saving and loading objects – in almost no time, giving programmers more freedom to push on with their core tasks.

comments powered by Disqus
Subscribe to our ADMIN Newsletters
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs



Support Our Work

ADMIN content is made possible with support from readers like you. Please consider contributing when you've found an article to be beneficial.

Learn More”>
	</a>

<hr>		    
			</div>
		    		</div>

		<div class=