Blog

Optimisation with Python and Dask

3 quick wins to speed up your data processing

Dask is a Python library which makes it easy to run computations in parallel, either on a single machine or on a distributed cluster. Use Dask when your data is too large to fit into memory.

Here are three quick wins to speed up your Dask code.

1: Use built-in xarray methods

If you’re working with large datasets, chances are you’re already using xarray. If you’re not, check it out!

Xarray makes it easy to work with multi-dimensional datasets, also known as data cubes. By default, Xarray uses Numpy arrays under the hood. But it doesn’t have to be this way. It’s trivial to use Dask with Xarray by passing in the chunks keyword arg.

Now you can use your favourite Xarray methods as normal and you’ll get Dask for free. If part of your code isn’t playing nicely with Dask, check out xr.map_blocks and xr.apply_ufunc.

If you’re not using Xarray, you’re (hopefully!) using Numpy arrays. The developers of Dask have made life easy for you: the Dask Array interface is almost identical to the Numpy array interface. Chances are you can get a long way with in-place subsitution. Instead of writing np.do_something use da.do_something instead.

If you’re not using array functions at all - i.e. if your code is full of loops - I’m afraid there aren’t any quick wins for you. Upgrade your code to use array functions first (you’ll see a massive benefit just from that) and then revisit Dask.

2: Use the largest chunk size possible

In a perfect world, if you spread a parallel computation over 4 machines it’s 4 times faster. In practice this doesn’t work: there are overheads involved like moving the data to each machine and combining the result. This slows the comptutation down.

Dask has the same problem if you use a tiny chunk size. We spend all our time moving the data around and faffing with it instead of actually running out computation.

The official Dask documentation on chunk size helpfully tells us “Select a good chunk size”. Thanks for that cracking advice, folks.

The vast majority of the time you want to use the largest chunk size your hardware can handle. If you’re not sure, benchmark it!

3: Calculate all outputs in one step

Lets say we have two outputs, a mean and a count, which we want to generate. They come from the same base calculation but diverge towards the end.

We could do something like:

mean = expensive_computation.mean(dim='time')
mean = mean.compute()
save(mean)

count = expensive_computation.count(dim='time')
count = count.compute()
save(count)

Here we’re calculating expensive_computation twice. Wouldn’t it be nice if we could cache the result somehow between calls?

Luckily we don’t need to do anything that complex. Just tell Dask to compute both at the same time and it’ll work out that the first part of the computation graph is shared:

mean = expensive_computation.mean(dim='time')
count = expensive_computation.count(dim='time')
# This is approx 2x faster than the previous snippet
mean, count = dask.compute(mean, count)
save(mean)
save(count)

Easy.

Further Optimisation

We’ve covered the most important ways to get the most out of your Dask-powered code. If you want to squeeze still more performance out of your software, the best place to start is the documentation page on Dask best practice.

Daniel Tipping

Daniel Tipping