Skip to content

fix(bdata): stop embedding call stack and source code in saved headers#140

Merged
KenyaOtsuka merged 3 commits into
KamitaniLab:devfrom
KenyaOtsuka:chore/remove-callstack-header
Jul 3, 2026
Merged

fix(bdata): stop embedding call stack and source code in saved headers#140
KenyaOtsuka merged 3 commits into
KamitaniLab:devfrom
KenyaOtsuka:chore/remove-callstack-header

Conversation

@KenyaOtsuka

@KenyaOtsuka KenyaOtsuka commented Jun 26, 2026

Copy link
Copy Markdown

Summary

BData.save() previously embedded callstack and callstack_code in every saved
.h5 file's /header. These fields include absolute file paths, line numbers,
and the full source code of files on the call stack, which can leak internal
paths or unpublished code when BData files are shared.

This PR stops writing those fields to saved files by default while keeping
ctime and ctime_epoch.

Closes #131.

Changes

  • Remove the inspect-based call-stack/source-code collection from save().

  • Omit legacy callstack / callstack_code fields from the header written to
    saved files, so old files become clean after load→save.

  • Leave the in-memory BData.header unchanged when omitting those legacy fields
    from the saved file.

  • Emit a UserWarning when legacy fields are present in a header being written
    and are omitted from the saved file.

  • Keep user-set header fields, except that legacy callstack / callstack_code
    are not written to saved files.

Behavior change

Newly saved files no longer contain callstack or callstack_code. Re-saving
older files writes a clean output file without those fields and emits a warning,
but does not remove them from the in-memory BData.header.

Testing

  • Added tests for new saves, legacy header omission, and preserving in-memory
    headers.

  • pytest tests/bdata/test_bdata.py → 22 passed.

BData.save() previously recorded the full call stack (absolute file
paths) and the entire source code of every file on it into the saved
file's header. This is undocumented and can leak internal paths and
unpublished code when BData files are shared or published.

Stop collecting `callstack` / `callstack_code`, keeping only `ctime` /
`ctime_epoch`. To avoid re-writing legacy fields restored on load, strip
any `callstack` / `callstack_code` from the header on save and emit a
UserWarning so the change is visible.

Closes KamitaniLab#131.
@KenyaOtsuka KenyaOtsuka requested a review from Copilot June 26, 2026 01:27

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

@micchu micchu left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for addressing this.
I believe the main code itself looks fine, but I had some questions about the behavior of the test code, so I left a comment.
I would appreciate it if you could take a look.

Comment thread tests/bdata/test_bdata.py Outdated

# Legacy fields are stripped both in memory and in the saved file,
# while intentionally set header fields survive.
self.assertNotIn('callstack', bdata.header)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the past, I have saved BData in the following way in order to avoid saving the header.

bdata_obj._BData__save_h5(save_filepath, header=None)

If __save_h5 is called with header=None, is there any problem with how this line behaves?

Since directly calling a private function is somewhat hacky behavior, I think it would be reasonable to treat that kind of usage as outside the scope of support. On the other hand, it also seems to me that this could be considered a usage pattern that is possible under the specification of __save_h5 itself.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked this path. Calling _BData__save_h5(..., header=None) still writes no /header group at all, so this usage should continue to work as before for writing an HDF5 file without any header. In particular, it does not write ctime, ctime_epoch, callstack, or callstack_code.

This is different from the public BData.save() path. BData.save() still writes a header containing ctime and ctime_epoch, but it no longer writes callstack or callstack_code.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I misunderstood “this line” as referring to

bdata_obj._BData__save_h5(save_filepath, header=None)

rather than line 298 of the test.

To answer your question directly: with the current implementation, this assertion would not hold if _BData__save_h5(..., header=None) is called directly.

self.assertNotIn('callstack', bdata.header)

In that private path, header=None only means that no /header group is written to the output file. It does not remove callstack / callstack_code from the in-memory bdata.header.

The assertion currently holds only for the public BData.save() path, because the legacy call-stack fields are stripped in save() before calling __save_h5.

I will update this with a small refactor: move the legacy call-stack stripping logic into __save_h5. With that change, direct calls to _BData__save_h5(..., header=None) will still write no /header group, but will also remove callstack / callstack_code from the in-memory bdata.header.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in d414b1d.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move the call-stack legacy-field stripping out of save() and into
__save_h5() so it runs on every save path. Previously the documented
private workaround `_BData__save_h5(path, header=None)` wrote no header
group but left `callstack` / `callstack_code` in the in-memory header.
Now that path also strips them (with the same UserWarning), keeping the
in-memory header consistent regardless of how the data is saved.
@KenyaOtsuka KenyaOtsuka force-pushed the chore/remove-callstack-header branch from 7c48f7f to d414b1d Compare June 26, 2026 09:30
@KenyaOtsuka KenyaOtsuka marked this pull request as ready for review June 26, 2026 10:13

@ganow ganow left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much! Mostly LGTM. I've left one comment about the behavior of the save function

Comment thread bdpy/bdata/bdata.py Outdated
Comment on lines +896 to +905
legacy_keys = [k for k in ('callstack', 'callstack_code') if k in self.__header]
if legacy_keys:
warnings.warn(
"Removing legacy header field(s) %s on save for privacy; "
"they will not be written to the file." % ', '.join(legacy_keys),
UserWarning,
stacklevel=3,
)
for k in legacy_keys:
del self.__header[k]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a privacy perspective, I think it is desirable for data saved in the future to automatically drop header information contained in the callstack and callstack_code keys.

On the other hand, I think we should avoid making a function named save erase data held in memory, as that behavior is counterintuitive. Any side effects of a function named save should be limited to data on storage, and the function should not modify data that is currently held in memory.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the comment. I agree.

In 45df2d1, I changed this so that legacy callstack / callstack_code fields are omitted only from the saved file, while the in-memory header is left unchanged. I also updated the warning wording and tests.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

…mory

Previously the legacy call-stack stripping deleted `callstack` /
`callstack_code` from the in-memory header (`self.__header`). Issue KamitaniLab#131
only asks to stop *writing* those fields to files, and silently
destroying in-memory data the user holds is an over-reach.

Filter the legacy fields out of a copy used for writing instead, leaving
both `self.__header` and the passed-in `header` dict untouched. The
`header=None` path writes no header group and emits no warning. A
UserWarning is raised only when legacy fields are actually omitted from a
written header.
@KenyaOtsuka

Copy link
Copy Markdown
Author

@micchu @ganow

Thank you for reviewing this PR!

I believe I have addressed all the points raised. If everything looks good to you, could you please approve it?

@ganow ganow left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you very much!

@KenyaOtsuka KenyaOtsuka merged commit 21563e8 into KamitaniLab:dev Jul 3, 2026
3 of 6 checks passed
@KenyaOtsuka KenyaOtsuka deleted the chore/remove-callstack-header branch July 3, 2026 02:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants