Data IO

Data IO#

Creating dask-awkward collections typically begins with reading from either local disk or cloud storage. There is built-in support for datasets stored in Parquet or JSON format, along support for reading text files with each line treated as an element of an array.

Take this code-block for example:

>>> import dask_awkward as dak
>>> ds1 = dak.from_parquet("s3://path/to/dataset")
>>> ds2 = dak.from_json("/path/to/json-files")
>>> ds3 = dak.from_text("s3://some/text/*.txt")

In the Parquet and text examples we will read data from Amazon S3; in the JSON example we’re reading data from local disk. These collections will be partitioned on a per-file basis

Support for the ROOT file format is provided by the Uproot project.

The dask-awkward repository contains a Jupyter notebook tutorial going into more details about IO. You can find that notebook at docs/examples/io-tutorial.

It’s also possible to instantiate dask-awkward dask_awkward.Array instances from other Dask collections (like dask.array.Array), or concrete objects like existing awkward Array instances or Python lists.

See the IO API docs page for more information on the possible ways to instantiate a new dask-awkward Array.