dask_awkward.from_json

dask_awkward.from_json

dask_awkward.from_json(source, *, line_delimited=True, schema=None, nan_string=None, posinf_string=None, neginf_string=None, complex_record_fields=None, buffersize=65536, initial=1024, resize=8, highlevel=True, behavior=None, blocksize=None, delimiter=None, compression='infer', storage_options=None, meta_sample_rows=100, meta_sample_bytes='10 kiB')[source]

Create an Array collection from JSON data.

See ak.from_json() for more information.

Parameters
  • source (str | list[str]) – Local or remote directory or list of files containing JSON data to load. May contain glob patterns (passed to fsspec).

  • line_delimited (bool) – If True (the default) treat each line in the file as a JSON object, if False, entire files will be treated as single objects.

  • schema (str | dict | list, optional) – If defined the schema will be used by the parser to skip type discovery. If not defined (None, the default), dask-awkward’s optimization capabilities will potentially be used to generate a JSONSchema that contains the minimal necessary parts of the JSON data that should be used to build an Array to complete the desired computation. See dask-awkward’s optimization documentation for more information.

  • nan_string (str, optional) – See ak.from_json()

  • posinf_string (str, optional) – See ak.from_json()

  • neginf_string (str, optional) – See ak.from_json()

  • complex_record_fields (tuple[str, str], optional) – See ak.from_json()

  • buffersize (int) – See ak.from_json()

  • initial (int) – See ak.from_json()

  • resize (float) – See ak.from_json()

  • highlevel (bool) – Argument specific to awkward-array that is always True for dask-awkward.

  • behavior (dict, optional) – See ak.from_json()

  • blocksize (int, str, optional) – If None (default), the collection will be partitioned on a per-file bases. If defined, this sets the size (in bytes) of each partition. Can be a string of the form "10 MiB".

  • delimiter (bytes, optional) – Delimiter to use for separating blocks; if blocksize is defined but this argument is not defined, the default is the bytestring newline: b"\n".

  • compression (str, optional) – The compression of the dataset (default is to infer based on file suffix)

  • storage_options (dict[str, Any], optional) – Storage options based to fsspec.

  • meta_sample_rows (int, optional) – Number of rows to sample from files for determining metadata. When reading files partitioned on a per-file basis this will be the number of lines extracted from the first file to determine the collection’s metadata.

  • meta_sample_bytes (int | str) – Number of bytes to sample from files for determining metadata. When reading file partitioned on a blocksize basis this will be the number of bytes sampled from the first partition to determine the collection’s metadata.

Returns

Resulting collection.

Return type

Array

Examples

An example where data is stored in an S3 data; this will grab all JSON files under the path with blocksizes of 50 MB and we sample the first 10 MB to determine metadata:

>>> import dask_awkward as dak
>>> ds = dak.from_json(
...     "s3://path/to/data",
...     blocksize="50 MB",
...     meta_sample_byes="10 MB",
... )

An example where a JSONSchema is pre-defined. In this case dask-awkward’s optimization infrastructure will not attempt to generate a minimal necessary schema, it will use the one provided:

>>> import dask_awkward as dak
>>> my_schema = ...
>>> ds = dak.from_json(["file1.json", "file2.json"], schema=my_schema)

An example where each discovered file will be treated as a single JSON object when creating the Array collection:

>>> import dask_awkward as dak
>>> ds = dak.from_json("/path/to/files/**.json", line_delimited=False)