dask_awkward.from_json#
- dask_awkward.from_json(source, *, line_delimited=True, schema=None, nan_string=None, posinf_string=None, neginf_string=None, complex_record_fields=None, buffersize=65536, initial=1024, resize=8, highlevel=True, behavior=None, attrs=None, blocksize=None, delimiter=None, compression='infer', storage_options=None, meta_sample_rows=100, meta_sample_bytes='10 kiB')[source]#
Create an Array collection from JSON data.
See
ak.from_json()
for more information.- Parameters:
source (str | list[str]) – Local or remote directory or list of files containing JSON data to load. May contain glob patterns (passed to
fsspec
).line_delimited (bool) – If
True
(the default) treat each line in the file as a JSON object, ifFalse
, entire files will be treated as single objects.schema (str | dict | list, optional) – If defined the schema will be used by the parser to skip type discovery. If not defined (
None
, the default), dask-awkward’s optimization capabilities will potentially be used to generate a JSONSchema that contains the minimal necessary parts of the JSON data that should be used to build an Array to complete the desired computation. See dask-awkward’s optimization documentation for more information.nan_string (str, optional) – See
ak.from_json()
posinf_string (str, optional) – See
ak.from_json()
neginf_string (str, optional) – See
ak.from_json()
complex_record_fields (tuple[str, str], optional) – See
ak.from_json()
buffersize (int) – See
ak.from_json()
initial (int) – See
ak.from_json()
resize (float) – See
ak.from_json()
highlevel (bool) – Argument specific to awkward-array that is always
True
for dask-awkward.behavior (dict, optional) – See
ak.from_json()
blocksize (int, str, optional) – If
None
(default), the collection will be partitioned on a per-file bases. If defined, this sets the size (in bytes) of each partition. Can be a string of the form"10 MiB"
.delimiter (bytes, optional) – Delimiter to use for separating blocks; if
blocksize
is defined but this argument is not defined, the default is the bytestring newline:b"\n"
.compression (str, optional) – The compression of the dataset (default is to infer based on file suffix)
storage_options (dict[str, Any], optional) – Storage options based to
fsspec
.meta_sample_rows (int, optional) – Number of rows to sample from files for determining metadata. When reading files partitioned on a per-file basis this will be the number of lines extracted from the first file to determine the collection’s metadata.
meta_sample_bytes (int | str) – Number of bytes to sample from files for determining metadata. When reading file partitioned on a blocksize basis this will be the number of bytes sampled from the first partition to determine the collection’s metadata.
- Returns:
Resulting collection.
- Return type:
Examples
An example where data is stored in an S3 data; this will grab all JSON files under the path with blocksizes of 50 MB and we sample the first 10 MB to determine metadata:
>>> import dask_awkward as dak >>> ds = dak.from_json( ... "s3://path/to/data", ... blocksize="50 MB", ... meta_sample_byes="10 MB", ... )
An example where a JSONSchema is pre-defined. In this case dask-awkward’s optimization infrastructure will not attempt to generate a minimal necessary schema, it will use the one provided:
>>> import dask_awkward as dak >>> my_schema = ... >>> ds = dak.from_json(["file1.json", "file2.json"], schema=my_schema)
An example where each discovered file will be treated as a single JSON object when creating the Array collection:
>>> import dask_awkward as dak >>> ds = dak.from_json("/path/to/files/**.json", line_delimited=False)