Skip to content

[python] Add per-partition bucket pruning for HASH_FIXED tables#7804

Open
TheR1sing3un wants to merge 2 commits intoapache:masterfrom
TheR1sing3un:py-bucket-pruning-per-partition
Open

[python] Add per-partition bucket pruning for HASH_FIXED tables#7804
TheR1sing3un wants to merge 2 commits intoapache:masterfrom
TheR1sing3un:py-bucket-pruning-per-partition

Conversation

@TheR1sing3un
Copy link
Copy Markdown
Member

@TheR1sing3un TheR1sing3un commented May 10, 2026

Background

PR-5.4 (#7744) added bucket pruning for HASH_FIXED tables but only on
the bucket-key dimension. Predicates that mix a partition column and
a bucket column under a top-level OR — e.g.
(part='a' AND id=1) OR (part='b' AND id=2) — couldn't be pruned:
the OR mixes two dimensions, so the existing logic gave up and read
every bucket in both partitions. PR-5.4 left this as a TODO in the
module docstring.

Effect

Same query now reads exactly one bucket per partition (the bucket
holding id=1 in part='a', the bucket holding id=2 in
part='b'). The selector evaluates the predicate per partition
value first — the OR collapses to a single AND inside each partition
— and bucket selection runs on that simplified form.

Soundness contract is unchanged: the bucket set remains a superset
of the buckets that contain matching rows; any error falls open to
"all buckets accept", never drops a bucket with matches.

Two commits — helper + FileScanner wiring. 9 unit tests cover the
predicate-fold walker and the per-partition cache; one e2e test on a
2-partition × 4-bucket table proves the mixed-OR query reads ≤ 2
splits instead of one per (partition, bucket).

Adds the predicate-replace + AND/OR fold infrastructure that lets the
bucket selector specialise itself per concrete partition value, the
piece called out as a TODO at the bottom of the bucket_select_converter
module docstring.

Three pieces ship in this commit, all internal:

* ``replace_partition_predicate(predicate, partition_field_names,
  partition_values)``: walker that substitutes partition leaves with
  their evaluated truth value and folds AND/OR. Three-way return —
  ``None`` (cleared / always true), ``False`` (always false), or
  the simplified ``Predicate``.
* ``_Selector`` is now keyed by ``(partition_tuple, total_buckets)``
  and accepts a third positional ``partition`` arg in ``__call__``.
  Two-arg legacy callers (early manifest filter) still work — they
  get the partition-agnostic over-approximation.
* ``create_bucket_selector`` now takes an optional ``partition_fields``
  list. The selector built without it (or with a predicate that does
  not touch any partition column) keeps the existing shape and result.

This commit does not yet wire the partition into ``FileScanner``;
``_filter_manifest_entry`` still calls the selector with two args, so
all existing pushdown_bucket tests stay green.

Tests: nine new unit cases covering ``replace_partition_predicate``
folding, the per-partition cache, fall-through when partition is
unknown, and the empty-bucket-set result for an unsatisfiable
partition.
Switches ``_filter_manifest_entry`` to call the bucket selector with
the entry's partition row, and passes the table's partition fields
into ``create_bucket_selector`` so the selector can specialise the
predicate per concrete partition value.

The early manifest filter (``_build_early_bucket_filter``) still uses
the two-arg form because the partition row hasn't been deserialised
at that stage; the selector internally falls back to a sound
partition-agnostic over-approximation there. Per-partition tightening
runs on the late filter once the entry is fully decoded.

End-to-end test: ``(part='a' AND id=1) OR (part='b' AND id=2)`` on a
two-partition four-bucket table, asserting both correctness (only the
two matching rows come back) and pruning effectiveness (≤ 2 splits
instead of one per (partition, bucket) combination).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant