dask_awkward.to_parquet

Contents

dask_awkward.to_parquet#

dask_awkward.to_parquet(array, destination, list_to32=False, string_to32=True, bytestring_to32=True, emptyarray_to=None, categorical_as_dictionary=False, extensionarray=False, count_nulls=True, compression='zstd', compression_level=None, row_group_size=67108864, data_page_size=None, parquet_flavor=None, parquet_version='2.4', parquet_page_version='1.0', parquet_metadata_statistics=True, parquet_dictionary_encoding=False, parquet_byte_stream_split=False, parquet_coerce_timestamps=None, parquet_old_int96_timestamps=None, parquet_compliant_nested=False, parquet_extra_options=None, storage_options=None, write_metadata=False, compute=True, prefix=None)[source]#

Write data to Parquet format.

This will create one output file per partition.

See the documentation for ak.to_parquet() for more information; there are many optional function arguments that are described in that documentation.

Parameters:
  • array (Array) – The dask_awkward.Array collection to write to disk.

  • destination (str) – Where to store the output; this can be a local filesystem path or a remote filesystem path.

  • list_to32 (bool) – See ak.to_parquet()

  • string_to32 (bool) – See ak.to_parquet()

  • bytestring_to32 (bool) – See ak.to_parquet()

  • emptyarray_to (Any | None) – See ak.to_parquet()

  • categorical_as_dictionary (bool) – See ak.to_parquet()

  • extensionarray (bool) – See ak.to_parquet()

  • count_nulls (bool) – See ak.to_parquet()

  • compression (str | dict | None) – See ak.to_parquet()

  • compression_level (int | dict | None) – See ak.to_parquet()

  • row_group_size (int | None) – See ak.to_parquet()

  • data_page_size (int | None) – See ak.to_parquet()

  • parquet_flavor (Literal['spark'] | None) – See ak.to_parquet()

  • parquet_version (Literal['1.0', '2.4', '2.6']) – See ak.to_parquet()

  • parquet_page_version (Literal['1.0', '2.0']) – See ak.to_parquet()

  • parquet_metadata_statistics (bool | dict) – See ak.to_parquet()

  • parquet_dictionary_encoding (bool | dict) – See ak.to_parquet()

  • parquet_byte_stream_split (bool | dict) – See ak.to_parquet()

  • parquet_coerce_timestamps (Literal['ms'] | ~typing.Literal['us'] | None) – See ak.to_parquet()

  • parquet_old_int96_timestamps (bool | None) – See ak.to_parquet()

  • parquet_compliant_nested (bool) – See ak.to_parquet()

  • parquet_extra_options (dict | None) – See ak.to_parquet()

  • storage_options (dict[str, Any] | None) – Storage options passed to fsspec.

  • write_metadata (bool) – Write Parquet metadata.

  • compute (bool) – If True, immediately compute the result (write data to disk). If False a Scalar collection will be returned such that compute can be explicitly called.

  • prefix (str | None) – An addition prefix for output files. If None all parts inside the destination directory will be named "partN.parquet"; if defined, the names will be f"{prefix}-partN.parquet".

Returns:

If compute is False a dask_awkward.Scalar object is returned such that it can be computed later. If compute is True, the collection is immediately computed (and data will be written to disk) and None is returned.

Return type:

Scalar | None

Examples

>>> import awkward as ak
>>> import dask_awkward as dak
>>> a = ak.Array([{"a": [1, 2, 3]}, {"a": [4, 5]}])
>>> d = dak.from_awkward(a, npartitions=2)
>>> d.npartitions
2
>>> dak.to_parquet(d, "/tmp/my-output", prefix="data")
>>> import os
>>> os.listdir("/tmp/my-output")
['data-part0.parquet', 'data-part1.parquet']