[python] Add per-partition bucket pruning for HASH_FIXED tables#7804
Open
TheR1sing3un wants to merge 2 commits intoapache:masterfrom
Open
[python] Add per-partition bucket pruning for HASH_FIXED tables#7804TheR1sing3un wants to merge 2 commits intoapache:masterfrom
TheR1sing3un wants to merge 2 commits intoapache:masterfrom
Conversation
Adds the predicate-replace + AND/OR fold infrastructure that lets the bucket selector specialise itself per concrete partition value, the piece called out as a TODO at the bottom of the bucket_select_converter module docstring. Three pieces ship in this commit, all internal: * ``replace_partition_predicate(predicate, partition_field_names, partition_values)``: walker that substitutes partition leaves with their evaluated truth value and folds AND/OR. Three-way return — ``None`` (cleared / always true), ``False`` (always false), or the simplified ``Predicate``. * ``_Selector`` is now keyed by ``(partition_tuple, total_buckets)`` and accepts a third positional ``partition`` arg in ``__call__``. Two-arg legacy callers (early manifest filter) still work — they get the partition-agnostic over-approximation. * ``create_bucket_selector`` now takes an optional ``partition_fields`` list. The selector built without it (or with a predicate that does not touch any partition column) keeps the existing shape and result. This commit does not yet wire the partition into ``FileScanner``; ``_filter_manifest_entry`` still calls the selector with two args, so all existing pushdown_bucket tests stay green. Tests: nine new unit cases covering ``replace_partition_predicate`` folding, the per-partition cache, fall-through when partition is unknown, and the empty-bucket-set result for an unsatisfiable partition.
Switches ``_filter_manifest_entry`` to call the bucket selector with the entry's partition row, and passes the table's partition fields into ``create_bucket_selector`` so the selector can specialise the predicate per concrete partition value. The early manifest filter (``_build_early_bucket_filter``) still uses the two-arg form because the partition row hasn't been deserialised at that stage; the selector internally falls back to a sound partition-agnostic over-approximation there. Per-partition tightening runs on the late filter once the entry is fully decoded. End-to-end test: ``(part='a' AND id=1) OR (part='b' AND id=2)`` on a two-partition four-bucket table, asserting both correctness (only the two matching rows come back) and pruning effectiveness (≤ 2 splits instead of one per (partition, bucket) combination).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Background
PR-5.4 (#7744) added bucket pruning for HASH_FIXED tables but only on
the bucket-key dimension. Predicates that mix a partition column and
a bucket column under a top-level OR — e.g.
(part='a' AND id=1) OR (part='b' AND id=2)— couldn't be pruned:the OR mixes two dimensions, so the existing logic gave up and read
every bucket in both partitions. PR-5.4 left this as a TODO in the
module docstring.
Effect
Same query now reads exactly one bucket per partition (the bucket
holding
id=1inpart='a', the bucket holdingid=2inpart='b'). The selector evaluates the predicate per partitionvalue first — the OR collapses to a single AND inside each partition
— and bucket selection runs on that simplified form.
Soundness contract is unchanged: the bucket set remains a superset
of the buckets that contain matching rows; any error falls open to
"all buckets accept", never drops a bucket with matches.
Two commits — helper +
FileScannerwiring. 9 unit tests cover thepredicate-fold walker and the per-partition cache; one e2e test on a
2-partition × 4-bucket table proves the mixed-OR query reads ≤ 2
splits instead of one per (partition, bucket).