[python] Add per-partition bucket pruning for HASH_FIXED tables by TheR1sing3un · Pull Request #7804 · apache/paimon

TheR1sing3un · 2026-05-10T07:44:45Z

Background

PR-5.4 (#7744) added bucket pruning for HASH_FIXED tables but only on
the bucket-key dimension. Predicates that mix a partition column and
a bucket column under a top-level OR — e.g.
(part='a' AND id=1) OR (part='b' AND id=2) — couldn't be pruned:
the OR mixes two dimensions, so the existing logic gave up and read
every bucket in both partitions. PR-5.4 left this as a TODO in the
module docstring.

Effect

Same query now reads exactly one bucket per partition (the bucket
holding id=1 in part='a', the bucket holding id=2 in
part='b'). The selector evaluates the predicate per partition
value first — the OR collapses to a single AND inside each partition
— and bucket selection runs on that simplified form.

Soundness contract is unchanged: the bucket set remains a superset
of the buckets that contain matching rows; any error falls open to
"all buckets accept", never drops a bucket with matches.

Two commits — helper + FileScanner wiring. 9 unit tests cover the
predicate-fold walker and the per-partition cache; one e2e test on a
2-partition × 4-bucket table proves the mixed-OR query reads ≤ 2
splits instead of one per (partition, bucket).

Adds the predicate-replace + AND/OR fold infrastructure that lets the bucket selector specialise itself per concrete partition value, the piece called out as a TODO at the bottom of the bucket_select_converter module docstring. Three pieces ship in this commit, all internal: * ``replace_partition_predicate(predicate, partition_field_names, partition_values)``: walker that substitutes partition leaves with their evaluated truth value and folds AND/OR. Three-way return — ``None`` (cleared / always true), ``False`` (always false), or the simplified ``Predicate``. * ``_Selector`` is now keyed by ``(partition_tuple, total_buckets)`` and accepts a third positional ``partition`` arg in ``__call__``. Two-arg legacy callers (early manifest filter) still work — they get the partition-agnostic over-approximation. * ``create_bucket_selector`` now takes an optional ``partition_fields`` list. The selector built without it (or with a predicate that does not touch any partition column) keeps the existing shape and result. This commit does not yet wire the partition into ``FileScanner``; ``_filter_manifest_entry`` still calls the selector with two args, so all existing pushdown_bucket tests stay green. Tests: nine new unit cases covering ``replace_partition_predicate`` folding, the per-partition cache, fall-through when partition is unknown, and the empty-bucket-set result for an unsatisfiable partition.

Switches ``_filter_manifest_entry`` to call the bucket selector with the entry's partition row, and passes the table's partition fields into ``create_bucket_selector`` so the selector can specialise the predicate per concrete partition value. The early manifest filter (``_build_early_bucket_filter``) still uses the two-arg form because the partition row hasn't been deserialised at that stage; the selector internally falls back to a sound partition-agnostic over-approximation there. Per-partition tightening runs on the late filter once the entry is fully decoded. End-to-end test: ``(part='a' AND id=1) OR (part='b' AND id=2)`` on a two-partition four-bucket table, asserting both correctness (only the two matching rows come back) and pruning effectiveness (≤ 2 splits instead of one per (partition, bucket) combination).

TheR1sing3un added 2 commits May 10, 2026 15:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python] Add per-partition bucket pruning for HASH_FIXED tables#7804

[python] Add per-partition bucket pruning for HASH_FIXED tables#7804
TheR1sing3un wants to merge 2 commits intoapache:masterfrom
TheR1sing3un:py-bucket-pruning-per-partition

TheR1sing3un commented May 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TheR1sing3un commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Effect

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

TheR1sing3un commented May 10, 2026 •

edited

Loading