dask_awkward.from_json

Contents

dask_awkward.from_json#

dask_awkward.from_json(source, *, line_delimited=True, schema=None, nan_string=None, posinf_string=None, neginf_string=None, complex_record_fields=None, buffersize=65536, initial=1024, resize=8, highlevel=True, behavior=None, attrs=None, blocksize=None, delimiter=None, compression='infer', storage_options=None, meta_sample_rows=100, meta_sample_bytes='10 kiB')[source]#

Create an Array collection from JSON data.

See ak.from_json() for more information.

Parameters:
  • source (str | list[str]) – Local or remote directory or list of files containing JSON data to load. May contain glob patterns (passed to fsspec).

  • line_delimited (bool) – If True (the default) treat each line in the file as a JSON object, if False, entire files will be treated as single objects.

  • schema (str | dict | list, optional) – If defined the schema will be used by the parser to skip type discovery. If not defined (None, the default), dask-awkward’s optimization capabilities will potentially be used to generate a JSONSchema that contains the minimal necessary parts of the JSON data that should be used to build an Array to complete the desired computation. See dask-awkward’s optimization documentation for more information.

  • nan_string (str, optional) – See ak.from_json()

  • posinf_string (str, optional) – See ak.from_json()

  • neginf_string (str, optional) – See ak.from_json()

  • complex_record_fields (tuple[str, str], optional) – See ak.from_json()

  • buffersize (int) – See ak.from_json()

  • initial (int) – See ak.from_json()

  • resize (float) – See ak.from_json()

  • highlevel (bool) – Argument specific to awkward-array that is always True for dask-awkward.

  • behavior (dict, optional) – See ak.from_json()

  • blocksize (int, str, optional) – If None (default), the collection will be partitioned on a per-file bases. If defined, this sets the size (in bytes) of each partition. Can be a string of the form "10 MiB".

  • delimiter (bytes, optional) – Delimiter to use for separating blocks; if blocksize is defined but this argument is not defined, the default is the bytestring newline: b"\n".

  • compression (str, optional) – The compression of the dataset (default is to infer based on file suffix)

  • storage_options (dict[str, Any], optional) – Storage options based to fsspec.

  • meta_sample_rows (int, optional) – Number of rows to sample from files for determining metadata. When reading files partitioned on a per-file basis this will be the number of lines extracted from the first file to determine the collection’s metadata.

  • meta_sample_bytes (int | str) – Number of bytes to sample from files for determining metadata. When reading file partitioned on a blocksize basis this will be the number of bytes sampled from the first partition to determine the collection’s metadata.

  • attrs (Mapping[str, Any] | None)

Returns:

Resulting collection.

Return type:

Array

Examples

An example where data is stored in an S3 data; this will grab all JSON files under the path with blocksizes of 50 MB and we sample the first 10 MB to determine metadata:

>>> import dask_awkward as dak
>>> ds = dak.from_json(
...     "s3://path/to/data",
...     blocksize="50 MB",
...     meta_sample_byes="10 MB",
... )

An example where a JSONSchema is pre-defined. In this case dask-awkward’s optimization infrastructure will not attempt to generate a minimal necessary schema, it will use the one provided:

>>> import dask_awkward as dak
>>> my_schema = ...
>>> ds = dak.from_json(["file1.json", "file2.json"], schema=my_schema)

An example where each discovered file will be treated as a single JSON object when creating the Array collection:

>>> import dask_awkward as dak
>>> ds = dak.from_json("/path/to/files/**.json", line_delimited=False)