Skip to content

Simplified filter expression has a Null type instead of Int64 type across the FFI layer #1551

@jwimberl

Description

@jwimberl

Describe the bug

A custom table provider for a ParquetSource with a trivial catalog and an Int64 column yields some errors when a SQL query has a filter with a literal limit on that column of the form

assertion `left == right` failed: Simplified expression should have the same data type as the original
  left: Null
 right: Int64

The error does not occur when using datafusion-python 52; it also does not occur when running the query purely in a Rust SessionContext; the backtrace for the above error shows it coming from datafusion-ffi code as well.

This may of course not be a bug, but instead some bad practice that the version 52 set of crates tolerates but which is now invalid. A MRE of this custom table provider can be found in the public repo https://github.com/jwimberl/datafusion_python_53_int64filter_repro, which contains

  • a non-working datafusion53 version (in branch main)
  • a baseline working version (in branch datafusion52)

and a canned dummy dataset. The README.md of this repo has more details.

To Reproduce

In the main branch, build the py_repro_provider crate with maturin develop and run python repro.py. This loads the dummy dataset as a table dummy_table and runs two queries

  • SELECT * FROM dummy_table LIMIT 1, which is successful
  • SELECT * FROM dummy_table LIMIT 5, which panics

Itss output should be something like

Successful query:
   a   b
0  0  42
Unsuccesful query:

thread '<unnamed>' panicked at /home/jwimberley/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/datafusion-physical-expr-53.1.0/src/simplifier/mod.rs:76:17:
assertion `left == right` failed: Simplified expression should have the same data type as the original
  left: Null
 right: Int64
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

followed by backtrace information.

Expected behavior

In the datafusion52 branch, build thepy_repro_provider with maturin develop and run python repro.py. It runs the same two queries, and its output should be

Successful query:
   a   b
0  0  42
Also successful query:
   a   b
0  0  42
1  1  42
2  2  42
3  3  42
4  4  42

Additional context

In either the main branch or datafusion52 branch, the Rust code for the table provider is in the directory repro_provider, and there is a corresponding cargo test that runs SELECT * FROM dummy_table WHERE a < 5. Without the FFI layer, this is successful with both datafusion 52 and 53.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions