Skip to content

Optimize Sheet#rows_generator method to use less memory and parse faster#138

Open
connorshea wants to merge 2 commits into
pythonicrubyist:masterfrom
connorshea:optimize-rows-generator
Open

Optimize Sheet#rows_generator method to use less memory and parse faster#138
connorshea wants to merge 2 commits into
pythonicrubyist:masterfrom
connorshea:optimize-rows-generator

Conversation

@connorshea
Copy link
Copy Markdown

@connorshea connorshea commented Jun 5, 2026

After noticing that this method uses a lot of memory/takes a lot of time in the Rails app at work, I used Claude to optimize this and then verify that the output was identical from the released gem vs this change on a number of different files.

This should hopefully help resolve some long-standing issues with the performance/memory usage of the gem.


Benchmarked on a generated 20,000-row x 40-col xlsx (mixed shared-string, numeric, and empty cells). Median full-parse time dropped ~1.72x (1.85s -> ~1.08s) and allocations dropped ~3.75x (28.6M -> 7.63M), with byte-identical output verified on both default- and prefixed-namespace files across rows, simple_rows, and rows_with_meta_data (including the self-closing-row + meta-data path).

Changes, in order of impact:

  • Resolve the namespace prefix once via a namespace_resolved flag instead of re-checking node.namespaces on every node. Worksheets use a default namespace, so the old prefix.empty? guard never latched and allocated a namespaces hash for every node in the stream (660,002 calls -> 2). This is the bulk of the wall-clock win.
  • Hoist node.name/node.node_type into locals (each was read up to 4x per node in the if/elsif chain) and reorder branches so the hottest nodes (<v>/<c>) are tested first.
  • Fetch cell attributes with three individual node.attribute('t'/'s'/'r') C calls instead of node.attributes. Reader#attributes is attribute_hash.merge(namespaces), so it built and merged a namespaces hash on every call. With hundreds of thousands of cells the per-cell Hash allocation dominates, so three cheap lookups are both faster and leaner than building and indexing a hash.
  • Use node.attribute_hash for the row-opener node (only ~one per row): this avoids the namespaces merge that node.attributes performs while preserving the exact meta-data row hash. Safe because xlsx row elements never declare their own namespaces (the namespace hash is empty for them), so the result is identical -- confirmed byte-for-byte including the self-closing-row + meta-data path.
  • In fill_in_empty_cells, drop the redundant .to_a on the column range and use delete_suffix instead of gsub to strip the row number.

Metric Baseline Optimized Change
Median full-parse time 1.853s ~1.08s ~1.72× faster
Min full-parse time 1.827s ~1.04s ~1.76× faster
Allocations per parse 28.6M 7.63M ~3.75× fewer
Reader#namespaces calls 660,002 2 ~330,000× fewer
Reader#attributes calls 660,000 0 eliminated
Stage Median Allocations
Baseline 1.853s 28.6M
+ namespace flag, hoist, reorder ~1.11s ~9.8M
+ attribute_hash (rows) ~1.10s 9.23M
+ node.attribute() per cell attr ~1.08s 7.63M
Total ~1.08s 7.63M

connorshea and others added 2 commits June 5, 2026 16:01
Benchmarked on a generated 20,000-row x 40-col xlsx (mixed shared-string,
numeric, and empty cells). Median full-parse time dropped ~1.68x
(1.85s -> 1.11s) and allocations dropped ~3x (28.6M -> ~9M), with
byte-identical output verified on both default- and prefixed-namespace
files across rows, simple_rows, and rows_with_meta_data.

Changes, in order of impact:

- Resolve the namespace prefix once via a `namespace_resolved` flag
  instead of re-checking `node.namespaces` on every node. Worksheets use
  a default namespace, so the old `prefix.empty?` guard never latched and
  allocated a namespaces hash for every node in the stream. This is the
  bulk of the wall-clock win.
- Hoist `node.name`/`node.node_type` into locals (each was read up to 4x
  per node in the if/elsif chain) and reorder branches so the hottest
  nodes (<v>/<c>) are tested first.
- Use `node.attribute_hash` instead of `node.attributes` for cell and row
  nodes. `Reader#attributes` is `attribute_hash.merge(namespaces)`, so it
  built and merged a namespaces hash on every call; we only need the
  element's own attributes. This is an allocation/GC win (drops Hash#merge
  and the per-node namespaces hash) with negligible wall-clock change.
  Safe because xlsx row/cell elements never declare their own namespaces
  (the namespace hash is empty for them), so the result is identical --
  confirmed byte-for-byte including the self-closing-row + meta-data path.
- In fill_in_empty_cells, drop the redundant `.to_a` on the column range
  and use `delete_suffix` instead of `gsub` to strip the row number.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Follow-up to the rows_generator optimization. Fetch the three cell
attributes with individual node.attribute('t'/'s'/'r') C calls instead of
building a per-cell attribute_hash and indexing it. With hundreds of
thousands of cells the per-cell Hash allocation dominates, so three cheap
lookups are both faster and leaner.

Same 20,000-row x 40-col benchmark, vs the attribute_hash version:

  | Cell-attr approach     | Median  | Allocations |
  |------------------------|---------|-------------|
  | node.attribute_hash    | ~1.10s  | 9,226,711   |
  | node.attribute() x3    | ~1.08s  | 7,626,711   |

  => ~3-4% faster, 1.6M (~17%) fewer allocations (deterministic: ~640K
     cells x the per-cell hash no longer built).

Output byte-identical across rows, simple_rows, and rows_with_meta_data on
both default- and prefixed-namespace files. Suite green (47 examples).
The row-opener node keeps attribute_hash (only ~one per row, and it
preserves the exact meta-data row hash).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant