Data IO

Data IO

Creating dask-awkward collections typically begins with reading from either local disk or cloud storage. There is built-in support for datasets stored in Parquet or JSON format.

Take this code-block for example:

>>> import dask_awkward as dak
>>> ds1 = dak.from_parquet("s3://path/to/dataset")
>>> ds2 = dak.from_json("/path/to/json-files/*.json")

Both the from_parquet() and from_json() calls will create new dask_awkward.Array instances. In the Parquet example we will read data from Amazon S3; in the JSON example we’re reading data from local disk (notice the wildcard syntax: all JSON files in that directory will be discovered, and each file will become a partition in the collection).

Support for the ROOT file format is provided by the Uproot project.

It’s also possible to instantiate dask-awkward dask_awkward.Array instances from other Dask collections (like dask.array.Array), or concrete objects like existing awkward Array instances or Python lists.

See the IO API docs page for more information on the possible ways to instantiate a new dask-awkward Array.