dask_awkward.report_necessary_columns

dask_awkward.report_necessary_columns#

dask_awkward.report_necessary_columns(*args, traverse=True)[source]#

Get columns necessary to compute a collection

This function is specific to sources that are columnar (e.g. Parquet).

Parameters:
  • *args (Dask collections or HighLevelGraphs) – The collection (or collection graph) of interest. These can be individual objects, lists, sets, or dictionaries.

  • traverse (bool, optional) – If True (default), builtin Python collections are traversed looking for any Dask collections they might contain.

Returns:

Mapping that pairs the input layers in the graph to the set of necessary IO columns that have been identified by column optimisation of the given layer. If the layer is not backed by a columnar source, then None is returned instead of a set.

Return type:

dict[str, frozenset[str] | None]

Examples

If we have a hypothetical parquet dataset (ds) with the fields

  • “foo”

  • “bar”

  • “baz”

And the “baz” field has fields

  • “x”

  • “y”

The calculation of ds.bar + ds.baz.x will only require the bar and baz.x columns from the parquet file.

>>> import dask_awkward as dak
>>> ds = dak.from_parquet("some-dataset")
>>> ds.fields
["foo", "bar", "baz"]
>>> ds.baz.fields
["x", "y"]
>>> x = ds.bar + ds.baz.x
>>> dak.report_necessary_columns(x)
{
    "from-parquet-abc123": frozenset({"bar", "baz.x"})
}