fix(bdata): stop embedding call stack and source code in saved headers#140
Conversation
BData.save() previously recorded the full call stack (absolute file paths) and the entire source code of every file on it into the saved file's header. This is undocumented and can leak internal paths and unpublished code when BData files are shared or published. Stop collecting `callstack` / `callstack_code`, keeping only `ctime` / `ctime_epoch`. To avoid re-writing legacy fields restored on load, strip any `callstack` / `callstack_code` from the header on save and emit a UserWarning so the change is visible. Closes KamitaniLab#131.
micchu
left a comment
There was a problem hiding this comment.
Thank you for addressing this.
I believe the main code itself looks fine, but I had some questions about the behavior of the test code, so I left a comment.
I would appreciate it if you could take a look.
|
|
||
| # Legacy fields are stripped both in memory and in the saved file, | ||
| # while intentionally set header fields survive. | ||
| self.assertNotIn('callstack', bdata.header) |
There was a problem hiding this comment.
In the past, I have saved BData in the following way in order to avoid saving the header.
bdata_obj._BData__save_h5(save_filepath, header=None)
If __save_h5 is called with header=None, is there any problem with how this line behaves?
Since directly calling a private function is somewhat hacky behavior, I think it would be reasonable to treat that kind of usage as outside the scope of support. On the other hand, it also seems to me that this could be considered a usage pattern that is possible under the specification of __save_h5 itself.
There was a problem hiding this comment.
I checked this path. Calling _BData__save_h5(..., header=None) still writes no /header group at all, so this usage should continue to work as before for writing an HDF5 file without any header. In particular, it does not write ctime, ctime_epoch, callstack, or callstack_code.
This is different from the public BData.save() path. BData.save() still writes a header containing ctime and ctime_epoch, but it no longer writes callstack or callstack_code.
There was a problem hiding this comment.
Sorry, I misunderstood “this line” as referring to
bdata_obj._BData__save_h5(save_filepath, header=None)rather than line 298 of the test.
To answer your question directly: with the current implementation, this assertion would not hold if _BData__save_h5(..., header=None) is called directly.
self.assertNotIn('callstack', bdata.header)In that private path, header=None only means that no /header group is written to the output file. It does not remove callstack / callstack_code from the in-memory bdata.header.
The assertion currently holds only for the public BData.save() path, because the legacy call-stack fields are stripped in save() before calling __save_h5.
I will update this with a small refactor: move the legacy call-stack stripping logic into __save_h5. With that change, direct calls to _BData__save_h5(..., header=None) will still write no /header group, but will also remove callstack / callstack_code from the in-memory bdata.header.
There was a problem hiding this comment.
Move the call-stack legacy-field stripping out of save() and into __save_h5() so it runs on every save path. Previously the documented private workaround `_BData__save_h5(path, header=None)` wrote no header group but left `callstack` / `callstack_code` in the in-memory header. Now that path also strips them (with the same UserWarning), keeping the in-memory header consistent regardless of how the data is saved.
7c48f7f to
d414b1d
Compare
ganow
left a comment
There was a problem hiding this comment.
Thank you very much! Mostly LGTM. I've left one comment about the behavior of the save function
| legacy_keys = [k for k in ('callstack', 'callstack_code') if k in self.__header] | ||
| if legacy_keys: | ||
| warnings.warn( | ||
| "Removing legacy header field(s) %s on save for privacy; " | ||
| "they will not be written to the file." % ', '.join(legacy_keys), | ||
| UserWarning, | ||
| stacklevel=3, | ||
| ) | ||
| for k in legacy_keys: | ||
| del self.__header[k] |
There was a problem hiding this comment.
From a privacy perspective, I think it is desirable for data saved in the future to automatically drop header information contained in the callstack and callstack_code keys.
On the other hand, I think we should avoid making a function named save erase data held in memory, as that behavior is counterintuitive. Any side effects of a function named save should be limited to data on storage, and the function should not modify data that is currently held in memory.
There was a problem hiding this comment.
Thank you for the comment. I agree.
In 45df2d1, I changed this so that legacy callstack / callstack_code fields are omitted only from the saved file, while the in-memory header is left unchanged. I also updated the warning wording and tests.
…mory Previously the legacy call-stack stripping deleted `callstack` / `callstack_code` from the in-memory header (`self.__header`). Issue KamitaniLab#131 only asks to stop *writing* those fields to files, and silently destroying in-memory data the user holds is an over-reach. Filter the legacy fields out of a copy used for writing instead, leaving both `self.__header` and the passed-in `header` dict untouched. The `header=None` path writes no header group and emits no warning. A UserWarning is raised only when legacy fields are actually omitted from a written header.
ganow
left a comment
There was a problem hiding this comment.
LGTM! Thank you very much!
Summary
BData.save()previously embeddedcallstackandcallstack_codein every saved.h5file's/header. These fields include absolute file paths, line numbers,and the full source code of files on the call stack, which can leak internal
paths or unpublished code when BData files are shared.
This PR stops writing those fields to saved files by default while keeping
ctimeandctime_epoch.Closes #131.
Changes
Remove the
inspect-based call-stack/source-code collection fromsave().Omit legacy
callstack/callstack_codefields from the header written tosaved files, so old files become clean after load→save.
Leave the in-memory
BData.headerunchanged when omitting those legacy fieldsfrom the saved file.
Emit a
UserWarningwhen legacy fields are present in a header being writtenand are omitted from the saved file.
Keep user-set header fields, except that legacy
callstack/callstack_codeare not written to saved files.
Behavior change
Newly saved files no longer contain
callstackorcallstack_code. Re-savingolder files writes a clean output file without those fields and emits a warning,
but does not remove them from the in-memory
BData.header.Testing
Added tests for new saves, legacy header omission, and preserving in-memory
headers.
pytest tests/bdata/test_bdata.py→ 22 passed.