Optimize `Sheet#rows_generator` method to use less memory and parse faster by connorshea · Pull Request #138 · pythonicrubyist/creek

connorshea · 2026-06-05T22:10:31Z

After noticing that this method uses a lot of memory/takes a lot of time in the Rails app at work, I used Claude to optimize this and then verify that the output was identical from the released gem vs this change on a number of different files.

This should hopefully help resolve some long-standing issues with the performance/memory usage of the gem.

Benchmarked on a generated 20,000-row x 40-col xlsx (mixed shared-string, numeric, and empty cells). Median full-parse time dropped ~1.72x (1.85s -> ~1.08s) and allocations dropped ~3.75x (28.6M -> 7.63M), with byte-identical output verified on both default- and prefixed-namespace files across rows, simple_rows, and rows_with_meta_data (including the self-closing-row + meta-data path).

Changes, in order of impact:

Resolve the namespace prefix once via a namespace_resolved flag instead of re-checking node.namespaces on every node. Worksheets use a default namespace, so the old prefix.empty? guard never latched and allocated a namespaces hash for every node in the stream (660,002 calls -> 2). This is the bulk of the wall-clock win.
Hoist node.name/node.node_type into locals (each was read up to 4x per node in the if/elsif chain) and reorder branches so the hottest nodes (<v>/<c>) are tested first.
Fetch cell attributes with three individual node.attribute('t'/'s'/'r') C calls instead of node.attributes. Reader#attributes is attribute_hash.merge(namespaces), so it built and merged a namespaces hash on every call. With hundreds of thousands of cells the per-cell Hash allocation dominates, so three cheap lookups are both faster and leaner than building and indexing a hash.
Use node.attribute_hash for the row-opener node (only ~one per row): this avoids the namespaces merge that node.attributes performs while preserving the exact meta-data row hash. Safe because xlsx row elements never declare their own namespaces (the namespace hash is empty for them), so the result is identical -- confirmed byte-for-byte including the self-closing-row + meta-data path.
In fill_in_empty_cells, drop the redundant .to_a on the column range and use delete_suffix instead of gsub to strip the row number.

Metric	Baseline	Optimized	Change
Median full-parse time	1.853s	~1.08s	~1.72× faster
Min full-parse time	1.827s	~1.04s	~1.76× faster
Allocations per parse	28.6M	7.63M	~3.75× fewer
`Reader#namespaces` calls	660,002	2	~330,000× fewer
`Reader#attributes` calls	660,000	0	eliminated

Stage	Median	Allocations
Baseline	1.853s	28.6M
+ namespace flag, hoist, reorder	~1.11s	~9.8M
+ attribute_hash (rows)	~1.10s	9.23M
+ node.attribute() per cell attr	~1.08s	7.63M
Total	~1.08s	7.63M

Benchmarked on a generated 20,000-row x 40-col xlsx (mixed shared-string, numeric, and empty cells). Median full-parse time dropped ~1.68x (1.85s -> 1.11s) and allocations dropped ~3x (28.6M -> ~9M), with byte-identical output verified on both default- and prefixed-namespace files across rows, simple_rows, and rows_with_meta_data. Changes, in order of impact: - Resolve the namespace prefix once via a `namespace_resolved` flag instead of re-checking `node.namespaces` on every node. Worksheets use a default namespace, so the old `prefix.empty?` guard never latched and allocated a namespaces hash for every node in the stream. This is the bulk of the wall-clock win. - Hoist `node.name`/`node.node_type` into locals (each was read up to 4x per node in the if/elsif chain) and reorder branches so the hottest nodes (<v>/<c>) are tested first. - Use `node.attribute_hash` instead of `node.attributes` for cell and row nodes. `Reader#attributes` is `attribute_hash.merge(namespaces)`, so it built and merged a namespaces hash on every call; we only need the element's own attributes. This is an allocation/GC win (drops Hash#merge and the per-node namespaces hash) with negligible wall-clock change. Safe because xlsx row/cell elements never declare their own namespaces (the namespace hash is empty for them), so the result is identical -- confirmed byte-for-byte including the self-closing-row + meta-data path. - In fill_in_empty_cells, drop the redundant `.to_a` on the column range and use `delete_suffix` instead of `gsub` to strip the row number. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Follow-up to the rows_generator optimization. Fetch the three cell attributes with individual node.attribute('t'/'s'/'r') C calls instead of building a per-cell attribute_hash and indexing it. With hundreds of thousands of cells the per-cell Hash allocation dominates, so three cheap lookups are both faster and leaner. Same 20,000-row x 40-col benchmark, vs the attribute_hash version: | Cell-attr approach | Median | Allocations | |------------------------|---------|-------------| | node.attribute_hash | ~1.10s | 9,226,711 | | node.attribute() x3 | ~1.08s | 7,626,711 | => ~3-4% faster, 1.6M (~17%) fewer allocations (deterministic: ~640K cells x the per-cell hash no longer built). Output byte-identical across rows, simple_rows, and rows_with_meta_data on both default- and prefixed-namespace files. Suite green (47 examples). The row-opener node keeps attribute_hash (only ~one per row, and it preserves the exact meta-data row hash). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

connorshea and others added 2 commits June 5, 2026 16:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize `Sheet#rows_generator` method to use less memory and parse faster#138

Optimize `Sheet#rows_generator` method to use less memory and parse faster#138
connorshea wants to merge 2 commits into
pythonicrubyist:masterfrom
connorshea:optimize-rows-generator

connorshea commented Jun 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

connorshea commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

connorshea commented Jun 5, 2026 •

edited

Loading