GH-3561 Harden variants against malformed metadata.#3562
Conversation
636e75d to
9905728
Compare
9905728 to
db1cfe2
Compare
- reject oversized metadata/value declarations - range checking Only low cost checks are made. There's also a depth check consistent with the json parser; it's arguable as to whether that is needed. It will defend against StackOverflowExceptions by anything trying to treewalk, but should that code be the place to do the checks? The new test suite TestHardenedReader can be configured to actually emit the malformed files, to see how applications deal with them. Contains contributions from Claude AI.
db1cfe2 to
27d92e6
Compare
|
@rdblue why not review this PR before worrying about the iceberg one, as its where some things could be clarified and then repeated.
|
|
@Fokko could you look at this while I work on that thrift update? |
Fokko
left a comment
There was a problem hiding this comment.
This look good to me, thanks @steveloughran for working on this. I left a few comments. Should we also test the performance implications for list/dicts? Maybe a small JMH microbenchmark.
| // version=1, offsetSize=1, dictSize=0, single end-offset=0 — minimum well-formed empty metadata | ||
| static final ByteBuffer EMPTY_METADATA = ByteBuffer.wrap(new byte[] {0b1, 0x00, 0x00}); |
There was a problem hiding this comment.
I assume that we don't accept the single byte anymore? Should we still allow that, or make it explicit to the users?
| this.dictSize = 0; | ||
| } | ||
| this.metadataCache = null; | ||
| VariantUtil.validateValueShallow(this.value, dictSize); |
There was a problem hiding this comment.
Since we now validate in the constructor, should we add some Javadoc to this public function? Including the @throws IllegalArgumentException
Rationale for this change
Malformed parquet files could be distruptive enough to not only affect the execution of a single worker thread (which will ultimately reject it), but other threads on the same process. This can be disruptive.
What changes are included in this PR?
Only low cost checks are made, equivalent to arrow variant
try_new_with_metadata_and_shallow_validation()There's no equivalent
with_full_validation()logic is omitted. The caching logic of #3481 may be able to do this when it builds a dictionary, as range checking the increasing dictionary offsets is the key work there.There's also a depth check consistent with the json parser; it's arguable as to whether that is needed. It will defend against StackOverflowExceptions by anything trying to treewalk, but shouldn't that code be the place to do the checks?
Are these changes tested?
The new test suite TestHardenedReader can be configured to actually emit the malformed files, to see how applications deal with them.
Existing tests needed changes because they'd been getting away with malformed byte arrays; now this stuff is being checked the test cases need valid variants.
Are there any user-facing changes?
No
Closes #3561