Optimisation with Python and Dask
3 quick wins to speed up your data processing
Dask is a Python library which makes it easy to run computations in parallel, either on a single machine or on a distributed cluster. Use Dask when your data is too large to fit into memory.
Here are three quick wins to speed up your Dask code.
1: Use built-in xarray methods
If you're working with large datasets, chances are you're already using xarray. If you're not, check it out!
Xarray makes it easy to work with multi-dimensional datasets, also known as data cubes.
By default, Xarray uses Numpy arrays under the hood. But it doesn't have to be this way. It's
trivial to use Dask with Xarray by passing in the
chunks keyword arg.
Now you can use your favourite Xarray methods as normal and you'll get Dask for free. If part of
your code isn't playing nicely with Dask, check out
If you're not using Xarray, you're (hopefully!) using Numpy arrays. The developers of Dask have
made life easy for you: the Dask Array interface is almost identical to the Numpy
array interface. Chances are you can get a long way with in-place subsitution. Instead of writing
If you're not using array functions at all - i.e. if your code is full of loops - I'm afraid there aren't any quick wins for you. Upgrade your code to use array functions first (you'll see a massive benefit just from that) and then revisit Dask.
2: Use the largest chunk size possible
In a perfect world, if you spread a parallel computation over 4 machines it's 4 times faster. In practice this doesn't work: there are overheads involved like moving the data to each machine and combining the result. This slows the comptutation down.
Dask has the same problem if you use a tiny chunk size. We spend all our time moving the data around and faffing with it instead of actually running out computation.
The official Dask documentation on chunk size helpfully tells us “Select a good chunk size”. Thanks for that cracking advice, folks.
The vast majority of the time you want to use the largest chunk size your hardware can handle. If you're not sure, benchmark it!
3: Calculate all outputs in one step
Lets say we have two outputs, a mean and a count, which we want to generate. They come from the same base calculation but diverge towards the end.
We could do something like:
mean = expensive_computation.mean(dim='time') mean = mean.compute() save(mean) count = expensive_computation.count(dim='time') count = count.compute() save(count)
Here we're calculating
expensive_computation twice. Wouldn't it be nice if we could cache the
result somehow between calls?
Luckily we don't need to do anything that complex. Just tell Dask to compute both at the same time and it'll work out that the first part of the computation graph is shared:
mean = expensive_computation.mean(dim='time') count = expensive_computation.count(dim='time') # This is approx 2x faster than the previous snippet mean, count = dask.compute(mean, count) save(mean) save(count)
We've covered the most important ways to get the most out of your Dask-powered code. If you want to squeeze still more performance out of your software, the best place to start is the documentation page on Dask best practice.