This repository is Firebolt's fork of Apache Arrow,
currently based on the apache-arrow-20.0.0 tag. Only the cpp/ subtree is
consumed downstream (by Firebolt's query engine); the other language bindings
and ecosystem files are left in place but are neither built nor maintained here.
The rest of this document describes what this fork changes and why. Most of it would not make sense to upstream: these changes exist to make Arrow fit Firebolt's execution model.
- Base:
apache-arrow-20.0.0. - Branching model: we track upstream release tags and carry Firebolt patches on top. When bumping to a new upstream version, rebase (or re-cherry-pick) this set of changes.
- Only
cpp/is compiled and shipped. Changes outside ofcpp/(e.g. CI configuration) exist only to keep this repository's own CI working.
Firebolt's external scan flow (S3, GCS, etc., driven by our Buffer Manager)
does not fetch whole files. It fetches them as a sequence of fixed-size
chunks (typically 2 MB) that are non-contiguous in memory and may arrive
out of order. Upstream Arrow's Parquet reader assumes a RandomAccessFile
that hands back contiguous byte ranges, so we had to teach it to read from a
list of slices without copying into a contiguous staging buffer first.
Everything with "corded" in its name exists to support that. It's a bit of a misnomer since the cord data structure (aka rope) is something entirely different.
arrow::CordedBuffer(cpp/src/arrow/corded_buffer.{h,cc}): a non-owning, non-contiguous buffer: astd::span<const Slice>plus a current read position. SupportsPeek,Advance, zero-copy reads within a single slice, and copying reads that span slices.arrow::io::CordedInputStreamandarrow::io::CordedRandomAccessFile(cpp/src/arrow/io/interfaces.h,cpp/src/arrow/io/memory.{h,cc}): the streaming / random-access file interfaces adapted for corded data.- Corded-aware decompression:
cpp/src/arrow/util/compression_corded.ccandcompression_snappy.cc, plus hooks incompression.{h,cc}.
- Parquet page reader reads directly from corded buffers, with a variant in
place rather than fake
arrow::Bufferwrappers around slices. CRC32 checks are preserved. - Footer / metadata parsing works from corded buffers. The happy path (entire footer in one slice) avoids copying; only a multi-slice footer is copied into a temporary contiguous buffer.
- A
ReaderProperties::corded_bufferknob turns on the corded code path. FileReader::GetColumnReader(int row_group, int column, ...)sets up a per-row-group column reader. Complements our chunked fetch model: we can materialize one column of one row group without the whole-file assumptions of the default reader.
The test strategy for all of this is to run the existing Parquet reader/writer
tests through corded buffers at several slice sizes (typically 10 bytes, 42
bytes, 10 KiB) so we exercise both "fits in one slice" and "spans many slices"
paths. See cpp/src/parquet/test_corded_file.{h,cc}.
Separate from the corded work, there is an in-progress C++ implementation of selective Parquet footer parsing, similar to what is described in https://arrow.apache.org/blog/2025/10/23/rust-parquet-metadata/
ReaderProperties::set_firebolt_columns_filter()takes anunordered_set<string_view>of top-level column names. During thrift deserialization ofFileMetaData, columns not in the set are skipped in the schema, in per-row-groupcolumn_chunks, and incolumn_orders, without allocating for the names/stats/etc. that we're about to throw away.SchemaElementgained afirebolt_leaf_indexso that, even when elements are skipped, the remaining leaf columns still map correctly onto the leaf-indexed row-group metadata.- Nested column filtering is not yet implemented, the filter is top-level only. A struct with 100 fields still parses all 100 even if you only want one. Noted for a later pass.
- Measured impact: ~8× on a "count(*) over many-column Parquet files" scenario (19 s -> 2.5 s in a customer workload benchmark).
- This is currently entirely name-based, so not suitable for Iceberg scans, which have to resolve columns based on field IDs rather than names.
- Future optimization: wire two sets, one for columns to scan and one for columns to filter on, and read statistics only for those columns to filter on.
To add the primitives the fast-metadata path needed (readStringView,
skipping strings without materializing them), we need to patch Thrift. We
previously used a public mirror we couldn't push to. Now:
- A trimmed-down Apache Thrift C++ source tree is copied into
third_party/thrift/(started from Apache Thrift commit2a93df80f27739ccabb5b885cb12a8dc7595ecdf, then pruned aggressively). - Thrift sources are compiled into the Parquet library. Arrow's
ThirdpartyToolchain.cmakeno longer treats Thrift as an external dependency; nothrift::thriftlink target is produced. thrift_internal.halways usesTConfigurationto lift size limits, since we no longer have a generatedthrift/config.h.- Additional cleanup of vendored Thrift: dropped
TVirtualProtocol(unused fallback class), dropped unused network transports, dropped the recursion-depth tracker.
FireboltAllocator/firebolt_memory_pool(cpp/src/arrow/memory_pool.{h,cc}) is anarrow::MemoryPoolthat routes all allocations throughoperator new/operator delete. This is deliberate: it lets Firebolt'sMemoryTracker(which hooksnew/delete) see every Arrow allocation. jemalloc is still the underlying allocator in non-sanitizer builds, soReleaseUnused()delegates to the jemalloc pool.- The default "system" pool on Firebolt builds is switched to the Firebolt pool,
so Arrow code that uses
default_memory_pool()is automatically tracked.
Arrow is built with ARROW_ENABLE_THREADS=false as we don't want Arrow spawning
thread pools; our execution engine schedules its own work. But some of Arrow's
single-threaded code paths (notably SerialExecutor) rely on static global
state that assumes only one thread-at-a-time calls into Arrow at all. That's
not true for us: multiple Firebolt threads call into Arrow independently, and
TSAN caught the race.
- Added
ARROW_ENABLE_CONCURRENT_SERIAL_EXECUTOR(on by default; seecpp/cmake_modules/DefineOptions.cmake,config.h.cmake). When enabled, the static-globals path inSerialExecutorand a few dependents is disabled in favor of code that is safe for concurrent callers. - Removed
AfterForkState, as upstream PRapache/arrow#14594already documented it as dead, and it broke our gtest death tests.
- Dictionary-encoded booleans (
cpp/src/parquet/decoder.cc+encoding_test.cc). Upstream refuses to read these. Snowflake produces them in Iceberg tables, so we need to be able to read them. - Unaligned buffers (
cpp/src/arrow/ipc/reader.cc). Flight / raw socket reads can hand us aSplice()that is not aligned to the type it contains. UBSAN complains; old compilers can SIGSEGV even on x86. If a buffer isn't 8-byte aligned, we copy it into a 64-byte-aligned buffer. - Overly-large IPC allocations (
cpp/src/arrow/ipc/reader.cc). Added upfront allocation-size checks so the ASan/UBSan fuzzers stop tripping on maliciously crafted IPC frames.
- Timestamp resolution on import is microseconds, not nanoseconds
(
cpp/src/arrow/adapters/orc/util.cc,adapter_test.cc). Our engine's native timestamp type is microseconds; nanos widened unnecessarily and tripped range checks on valid values. - Fix NULL-timestamp conversion: the previous code touched the value even when the row was null.
None of this is functionally interesting; it exists to keep this repo compilable and its CI green.
cpp/requires C++20 (needed forstd::span, used byCordedBuffer).- LLVM/clang 18; sanitizer builds run on Ubuntu 24.04 (22.04's boost is too old for clang-18).
- CI is stripped to C++ and Linux only, no macOS, no Windows, no other language
bindings, no examples. See
.github/workflows/. - A handful of C++20 deprecation warnings are silenced with
#pragma GCC diagnostic ignoredin places where the upstream fix hasn't landed.
git log apache-arrow-20.0.0..HEAD -- cpp/is the authoritative list of Firebolt's changes.- The largest single change is the corded-buffer work. Read
cpp/src/arrow/corded_buffer.hfirst, then the Parquet-side integration incpp/src/parquet/file_reader.ccandcolumn_reader.cc. - The fast-metadata work. Start at
ReaderProperties::set_firebolt_columns_filterand follow the filter intocpp/src/generated/parquet_types.tcc.
Apache Arrow is licensed under Apache 2.0 (see LICENSE.txt and NOTICE.txt). All Firebolt modifications in this fork are also distributed under the Apache 2.0 license.