Skip to content

perf(lava): WAL journaling for the LAVA database#285

Open
OneSixForensics wants to merge 3 commits into
abrignoni:mainfrom
OneSixForensics:lava-wal-perf
Open

perf(lava): WAL journaling for the LAVA database#285
OneSixForensics wants to merge 3 commits into
abrignoni:mainfrom
OneSixForensics:lava-wal-perf

Conversation

@OneSixForensics

@OneSixForensics OneSixForensics commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

perf(lava): WAL journaling for the LAVA database

_lava_artifacts.db is opened with no journal/synchronous pragmas, i.e. the default rollback journal + synchronous=FULL. The media insert helpers (lava_insert_sqlite_media_item / lava_insert_sqlite_media_references) commit() per row, so a media-heavy artifact (e.g. a device backup with tens of thousands of files) triggers tens of thousands of fsync'd commits — each fsyncing the whole database. On such artifacts the run can take many minutes and looks like a hang.

This sets journal_mode=WAL + synchronous=NORMAL right after sqlite3.connect. Same effective durability for a single tool run (only a power loss during the final checkpoint window could lose the tail), and dramatically faster.

Benchmark

~34,000 per-row commits (≈ a 17k-file media artifact: one media item + one reference each):

mode time
default (rollback, synchronous=FULL) ~52 s per 8k commits (≈ 220 s)
WAL + synchronous=NORMAL ~0.4 s per 8k commits (≈ 2 s)

~125× on the commit workload. Benefits every module that registers media, not just one parser.

Found while bringing a 17k-file device-backup artifact into LAVA (companion Synchronoss parser PR).

@stark4n6

Copy link
Copy Markdown
Collaborator

@OneSixForensics can you resync your fork, there were some base code updates to catch up to the other LEAPP projects.

The media insert helpers (lava_insert_sqlite_media_item /
lava_insert_sqlite_media_references) commit() per row, so a media-heavy
artifact (e.g. a device backup of tens of thousands of files) triggers tens
of thousands of fsync'd commits. The db was opened with default rollback
journal + synchronous=FULL, so each commit fsyncs the whole database and the
run can take many minutes (looks like a hang).

Set journal_mode=WAL + synchronous=NORMAL right after connect. Same
durability for a tool run; ~125x faster on a 34k-commit workload in testing
(52s -> 0.4s). Benefits every module, not just one parser.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@OneSixForensics

Copy link
Copy Markdown
Contributor Author

Thanks @stark4n6 — done. Resynced the fork onto current main and rebased all three PR branches (#284, #285, #287); all show MERGEABLE/CLEAN now.

Heads up from the resync: the base-code update moved file resolution into Context.get_source_file_path(), which verifies candidates with Path.match (glob). That still drops real filenames containing glob metacharacters like IMG_0347[1].jpg — so I folded a small fix for it into #284 (closes #286), and reworked the Synchronoss/Kik parsers to use the stock check_in_media rather than a local helper. Re-validated both against the new base.

@JamesHabben

Copy link
Copy Markdown
Collaborator

my biggest concern is around LEAPP output going directly to network drives.

WAL does not work over a network filesystem.
https://www.sqlite.org/wal.html

WAL isn't guaranteed to fail, as much as that statement suggests, but there is definitely risk of it. this can cause corrupt files or even crash LEAPP. FULL seems to be the most compatible mode when writing to network storage. we have had complaint in the past about sqlite issues on network storage.

at the very least, I think we should test the journal mode after setting before switching sync mode. this doesn't truly detect network paths, but provides a little bit of safety. something along these lines:

mode = lava_db.execute("PRAGMA journal_mode=WAL").fetchone()[0]
if mode.lower() == "wal":
    lava_db.execute("PRAGMA synchronous=NORMAL")

it would be better to implement a network path detection routine (can we?) that can be used as part of the decision before enabling this mode. open to thoughts.

OneSixForensics and others added 2 commits June 25, 2026 17:31
…aths

WAL is not safe over a network filesystem (https://www.sqlite.org/wal.html),
and examiners commonly write LEAPP output straight to NAS/mapped drives, where
unconditional WAL risks DB corruption or a crashed run.

Add scripts/storage_safety.py with best-effort network-path detection
(Windows GetDriveType + UNC; Linux /proc/mounts fstype) and have initialize_lava
enable WAL + synchronous=NORMAL only when output_path is confirmed local. On
network or undetermined paths it stays on the network-safe rollback journal.
After requesting WAL it verifies the mode actually took effect before tuning
synchronous, per review feedback.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
LEAPP prepends the Windows extended-length prefix \?\ to every drive-letter
output path (rleapp.py), so initialize_lava receives e.g. \?\X:\Reports\...
_is_unc_path keyed on a leading \, so it misclassified that local path as a
UNC network share and disabled WAL on a local drive (the slow path).

Strip the extended-length prefix before the UNC test:
  \?\UNC\srv\share -> \srv\share  (still network)
  \?\X:\dir        -> X:\dir        (local; GetDriveType decides)
Found while validating WAL gating against a real ICAC return written to X:.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@OneSixForensics

Copy link
Copy Markdown
Contributor Author

Thanks @JamesHabben — good catch, and you're right to flag it. I went with the network-path detection route you floated rather than just the sync-mode guard.

Added scripts/storage_safety.py: it determines whether the LAVA DB's output path is local or network-backed (Windows: GetDriveType on the drive root + UNC \\server\share detection; Linux: longest-match mount in /proc/mounts against a known set of network fstypes — nfs/cifs/smbfs/sshfs/etc.). initialize_lava then enables WAL + synchronous=NORMAL only when the path is affirmatively local. On a network path — or anything we can't classify with confidence — it stays on the default rollback journal, which is safe everywhere. I also kept your backstop: after requesting WAL it verifies the journal mode actually took effect before touching synchronous.

Validated against a real ~32 GB provider return, same input, two output targets:

Output Journal Result Wall time
Network share (SMB/NAS) WAL declined → rollback journal completed, no corruption 2h13m (I/O-bound copying ~17k media files, not DB-bound)
Local disk WAL enabled completed 2m31s (~53× faster)

One thing that fell out of testing worth a separate look: LEAPP prepends the Windows extended-length prefix \\?\ to every drive-letter output path (rleapp.py), so initialize_lava actually receives \\?\X:\.... My first cut keyed UNC detection on a leading \\ and so misread that local path as a network share — which would have silently denied WAL to every local-drive run. Fixed by normalizing the \\?\ (and \\?\UNC\) prefix before the UNC test. Both commits are pushed.

Happy to tweak the network fstype list or the detection heuristics if you'd rather be more/less conservative.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants