dask_awkward.from_json

dask_awkward.from_json

dask_awkward.from_json(urlpath, schema=None, highlevel=True, *, blocksize=None, delimiter=None, one_obj_per_file=False, compression='infer', meta=None, behavior=None, derive_meta_kwargs=None, storage_options=None)[source]

Create an Awkward Array collection from JSON data.

There are three styles supported for reading JSON data:

  1. Line delimited style: file(s) with one JSON object per line. The function argument defaults are setup to handle this style. This method assumes newline characters are not embedded in JSON values.

  2. Single JSON object per file (this requires one_obj_per_file to be set to True. These objects must be arrays.

  3. Reading some number of bytes at a time. If at least one of blocksize or delimiter are defined, Dask’s read_bytes() function will be used to lazily read bytes (blocksize bytes per partition) and split on delimiter). This method assumes line delimited JSON without newline characters embedded in JSON values.

Parameters
  • urlpath (str | list[str]) – The source of the JSON dataset.

  • blocksize (int | str, optional) – If defined, each partition will be created from a block of JSON bytes of this size. If delimiter is defined (not None) but this value remains None, a default value of 128 MiB will be used.

  • delimiter (bytes, optional) – If defined (not None), this will be the byte(s) to split on when reading blocksizes. If this is None but blocksize is defined (not None), the default byte charater will be the newline (b"\n").

  • one_obj_per_file (bool) – If True each file will be considered a single JSON object.

  • compression (str, optional) – Compression of the files in the dataset.

  • meta (Any, optional) – The metadata for the collection. If None (the default), them metadata will be determined by scanning the beginning of the dataset.

  • derive_meta_kwargs (dict[str, Any], optional) – Dictionary of arguments to be passed to derive_json_meta for determining the collection metadata if meta is None.

  • storage_options (dict[str, Any], optional) – Storage options passed to fsspec.

  • schema (dict | None) –

  • highlevel (bool) –

  • behavior (dict | None) –

Returns

The resulting Dask Awkward Array collection.

Return type

Array

Examples

One partition per file:

>>> import dask_awkard as dak
>>> a = dak.from_json("dataset*.json")

One partition ber 200 MB of JSON data:

>>> a = dak.from_json("dataset*.json", blocksize="200 MB")

Same as previous call (explicit definition of the delimiter):

>>> a = dak.from_json(
...     "dataset*.json", blocksize="200 MB", delimiter=b"\n",
... )