Optimised Distributed Satellite Data Processing

Satellite Applications Catapult

In March 2020, we were engaged by the Satellite Applications Catapult, a government-backed Space technology and innovation company at the heart of the UK Space sector.

We helped them make their data processing code run more than 10x faster.

Thank you for your outstanding work. We really appreciated your methodology and collaboration approach, and I personally feel that Old Reliable really benefited the project.

Federica Moscato
Federica Moscato
Software Engineering Manager, Satellite Applications Cataplut

Problem

Part of the Catapult’s work is to create data products by processing satellite imagery and other earth observation data. Broadly, there are two steps to develop a data product:

  1. Research to determine the scientific approach and base algorithms.
  2. Translating the scientific code into production software for automated product generation.

Our role was to optimise the Catapult’s scientific code so that the data products could be generated quickly and efficiently. The code was initially synchronous and written in Jupyter Notebooks.

We optimised the notebooks for distributed parallel processing using Python and a Dask cluster backed by Kubernetes. From the optimised notebooks we created a selection of optimised Python jobs to automatically generate the products.

Process

We worked in the middle ground between the Catapult’s earth observation scientists and their software developers. It was essential to understand both the scientific requirements for the products as well as the system infrastructure which would be used for automatic generation.

The core tech stack - Python, Jupyter Notebooks, Dask, and Kubernetes - includes some of our favourites, so no surprises there!

The secret to success was clear and open communication between the scientists and developers. This let us write fast, well-optimised code which maintained its original scientific rigour.

We made use of Git and GitHub best practice to help build our knowledge and decision-making process into the Catapult’s codebases.

Solution

The result was a series of Python tasks to automatically generate the Catapult’s data products. Our optimised code gave at least 10x speedup compared to the synchronous code, and anywhere from 1.5x to 8x speedup from the Catapult’s first crack at parallelising the computation.

We left our Dask optimisation methodology with the Catapult in the form of a written report so that they’re able to optimise their own code in the future. You can read the basics of optimisation with Python and Dask on our blog.

Looking Forward

Working with the Catapult was absolutely fantastic and we hope to work closely with them in the future. Their work is part of the backbone of the UK Space sector. We’re happy to have done our part to help!

You can read more about the Satellite Applications Catapult on their website.