Reduce DEME Jitify startup overhead#66
Open
yvrob wants to merge 1 commit into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR reduces DEME startup/Jitify overhead in two places:
Default behavior stays conservative: upstream Jitify header loading is used unless
DEME_PERSISTENT_JITIFY_CACHEis set.Note about the large
jitify.hppdiffThe large file in this PR is intentional, but it deserves a warning. Most of that diff is the full NVIDIA Jitify header copied into DEM-Engine as
src/jitify/jitify.hpp; the actual change inside it is much smaller and adds an optional persistent source-header cache.I did this because DEM-Engine currently gets Jitify through a submodule. Patching only the submodule would require either a separate Jitify fork/PR and a submodule pointer update, or a dependency on a local external patch that is harder to review and reproduce. Keeping the header copy in this PR makes the performance fix self-contained and lets DEM-Engine keep the old behavior by default.
That said, if maintainers would prefer a different route, such as carrying this in a Jitify fork, upstreaming it to NVIDIA/Jitify first, applying a smaller local patch during the build, or hiding it behind a different CMake layout, feedback is very welcome. The important part for our use case is to avoid paying the repeated Jitify/NVRTC header-discovery cost every time a fresh DEME process starts.
Persistent cache usage
Automatic per-user cache path:
export DEME_PERSISTENT_JITIFY_CACHE=1On Linux/WSL this uses:
/tmp/deme_jitify_header_cache_$USER.binExplicit cache path:
Disable / default behavior:
The cache follows the CUDA/toolchain setup, not simulation inputs. Changing material properties, geometry, particle counts, or timesteps does not invalidate it. If CUDA versions, include paths, or compiler options change, DEME ignores the old cache file and fills it again during that run.
Motivation and observed timings
In profiling on a V100S host, a cold no-mesh DEME micro-startup spent most of its time on host-side Jitify/NVRTC header discovery rather than GPU work.
Observed timings from the profiling run:
Initialize()about111.8 s.94.8 s.Initialize()about3.1 s.500,000spheres initialized in4.18 s.Validation
git diff --check/tmp/deme_pr_cmake_checkcoreDEMGPU smoke validation was also run in the downstream pyDEME environment:
Initialize()about3.26 s.N=100: success,Initialize()about4.2 s.N=500000: success,Initialize()about4.18 s.