Introduction

Introduction

Note

This introduction assumes that you have some familiarity with Dask.

The Dask project provides “collections” which behave as parallelized and/or distributed versions of the core PyData data types:

  • dask.array provides a NumPy like interface for creating task graphs.

  • dask.dataframe provides a Pandas like interface for creating task graphs.

  • dask.bag provides a functional interface for creating task graphs.

  • dask.delayed provides an interface for custom task graphs.

With dask-awkward, we aim to provide an additional interface:

  • dask-awkward provides an Awkward-Array-like interface for creating task graphs.

We accomplish this by creating a new collection type: dask_awkward’s Array class, which is a partitioned representation of a concrete Awkward Array.

Imagine a dataset of multiple line delimited JSON files (data.00.json, data.01.json, and so on). Loading that data, and selecting a subset of the dataset based on the total number of entries in a nested attribute of the data can be done with both awkward and dask-awkward with the same programming style; on the left we operate eagerly with awkward and on the right we operate lazily with dask-awkward (notice the compute() method call):

import awkward._v2 as ak                import dask_awkward as dak
x = ak.from_json("data.00.json")        x = dak.from_json("data.*.json")
x = x[ak.num(x.foo) > 2]                x = x[dak.num(x.foo).compute()

Note

dask-awkward depends on the in-development version 2 of awkward; which exists in the awkward._v2 namespace.

On the left (the eager version) the second line will immediately begin to read data from disk and decode the JSON. Sequentially after that, the selection step will execute.

On the right (the lazy version) the second line will stage the reading of each detected JSON file (task graph creation), the next line will stage the selection (extending the task graph) and then call compute on that graph .Dask will execute the JSON reading and decoding of each file in parallel, and when each reading task is done, the selection tasks will follow. Dask will schedule the tasks itself (and it will attempt to optimize its work).

For example usage of dask-awkward, we have a demo repository.