Optimize Sheet#rows_generator method to use less memory and parse faster#138
Open
connorshea wants to merge 2 commits into
Open
Optimize Sheet#rows_generator method to use less memory and parse faster#138connorshea wants to merge 2 commits into
Sheet#rows_generator method to use less memory and parse faster#138connorshea wants to merge 2 commits into
Conversation
Benchmarked on a generated 20,000-row x 40-col xlsx (mixed shared-string, numeric, and empty cells). Median full-parse time dropped ~1.68x (1.85s -> 1.11s) and allocations dropped ~3x (28.6M -> ~9M), with byte-identical output verified on both default- and prefixed-namespace files across rows, simple_rows, and rows_with_meta_data. Changes, in order of impact: - Resolve the namespace prefix once via a `namespace_resolved` flag instead of re-checking `node.namespaces` on every node. Worksheets use a default namespace, so the old `prefix.empty?` guard never latched and allocated a namespaces hash for every node in the stream. This is the bulk of the wall-clock win. - Hoist `node.name`/`node.node_type` into locals (each was read up to 4x per node in the if/elsif chain) and reorder branches so the hottest nodes (<v>/<c>) are tested first. - Use `node.attribute_hash` instead of `node.attributes` for cell and row nodes. `Reader#attributes` is `attribute_hash.merge(namespaces)`, so it built and merged a namespaces hash on every call; we only need the element's own attributes. This is an allocation/GC win (drops Hash#merge and the per-node namespaces hash) with negligible wall-clock change. Safe because xlsx row/cell elements never declare their own namespaces (the namespace hash is empty for them), so the result is identical -- confirmed byte-for-byte including the self-closing-row + meta-data path. - In fill_in_empty_cells, drop the redundant `.to_a` on the column range and use `delete_suffix` instead of `gsub` to strip the row number. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Follow-up to the rows_generator optimization. Fetch the three cell
attributes with individual node.attribute('t'/'s'/'r') C calls instead of
building a per-cell attribute_hash and indexing it. With hundreds of
thousands of cells the per-cell Hash allocation dominates, so three cheap
lookups are both faster and leaner.
Same 20,000-row x 40-col benchmark, vs the attribute_hash version:
| Cell-attr approach | Median | Allocations |
|------------------------|---------|-------------|
| node.attribute_hash | ~1.10s | 9,226,711 |
| node.attribute() x3 | ~1.08s | 7,626,711 |
=> ~3-4% faster, 1.6M (~17%) fewer allocations (deterministic: ~640K
cells x the per-cell hash no longer built).
Output byte-identical across rows, simple_rows, and rows_with_meta_data on
both default- and prefixed-namespace files. Suite green (47 examples).
The row-opener node keeps attribute_hash (only ~one per row, and it
preserves the exact meta-data row hash).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
After noticing that this method uses a lot of memory/takes a lot of time in the Rails app at work, I used Claude to optimize this and then verify that the output was identical from the released gem vs this change on a number of different files.
This should hopefully help resolve some long-standing issues with the performance/memory usage of the gem.
Benchmarked on a generated 20,000-row x 40-col xlsx (mixed shared-string, numeric, and empty cells). Median full-parse time dropped ~1.72x (1.85s -> ~1.08s) and allocations dropped ~3.75x (28.6M -> 7.63M), with byte-identical output verified on both default- and prefixed-namespace files across
rows,simple_rows, androws_with_meta_data(including the self-closing-row + meta-data path).Changes, in order of impact:
namespace_resolvedflag instead of re-checkingnode.namespaceson every node. Worksheets use a default namespace, so the oldprefix.empty?guard never latched and allocated a namespaces hash for every node in the stream (660,002 calls -> 2). This is the bulk of the wall-clock win.node.name/node.node_typeinto locals (each was read up to 4x per node in the if/elsif chain) and reorder branches so the hottest nodes (<v>/<c>) are tested first.node.attribute('t'/'s'/'r')C calls instead ofnode.attributes.Reader#attributesisattribute_hash.merge(namespaces), so it built and merged a namespaces hash on every call. With hundreds of thousands of cells the per-cell Hash allocation dominates, so three cheap lookups are both faster and leaner than building and indexing a hash.node.attribute_hashfor the row-opener node (only ~one per row): this avoids the namespaces merge thatnode.attributesperforms while preserving the exact meta-data row hash. Safe because xlsx row elements never declare their own namespaces (the namespace hash is empty for them), so the result is identical -- confirmed byte-for-byte including the self-closing-row + meta-data path.fill_in_empty_cells, drop the redundant.to_aon the column range and usedelete_suffixinstead ofgsubto strip the row number.Reader#namespacescallsReader#attributescalls