Register SyncBatchNorm as quantization module#1491
Conversation
Signed-off-by: Bryce Ferenczi <bryce.ferenczi@Arkeus.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughThis PR adds Changesnn.SyncBatchNorm Quantization Support
🎯 1 (Trivial) | ⏱️ ~3 minutes 🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
|
@5had3z thanks for the PR. Could you share some background on the use cases for BatchNorm quantization? |
|
@meenchen EDIT: Sorry, I thought I had comments of where the problems occur in the example code. Maybe I tweaked something in the test script, re-copied it over and was missing the comments. EDIT2: Ahh no yes I have comments in the first block, just not the second that is the full end-to-end test code. |
cjluo-nv
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
Small, well-scoped bug fix that registers nn.SyncBatchNorm analogously to the existing BatchNorm1d/2d/3d registrations. The two YAML configs that enumerate BatchNorm variants are both updated consistently, and the existing parameterized batchnorm tests are extended to cover SyncBatchNorm across test_no_quant, test_fake_quant_per_tensor, and test_fake_quant_per_channel. Explicit registration is necessary since QuantModuleRegistry uses exact class matching. The PR body documents a clear repro and notes the back-compat implication for checkpoints saved before this fix.
meenchen
left a comment
There was a problem hiding this comment.
Correctness — Incomplete Coverage of Exclusion Lists
[BLOCKER] SyncBatchNorm not added to _default_disabled_quantizer_cfg in config.py
The hardcoded list at modelopt/torch/quantization/config.py:211-213 excludes nn.BatchNorm{1,2,3}d but not nn.SyncBatchNorm. This list is used by FP8_DEFAULT_CFG, INT8_DEFAULT_CFG, and other built-in *_CFG dicts. After this PR, any user using these configs with a SyncBN-containing model will silently have SyncBN layers quantized — changing behavior for non-DDP users too.
[BLOCKER] 5 general PTQ recipes still have inline BatchNorm exclusions without SyncBatchNorm
The PR updates the $import unit (default_disabled_quantizers.yaml) and one model-specific recipe (Step3.5-Flash). But these recipes have inline exclusion lists that don't yet use the unit:
modelopt_recipes/general/ptq/fp8_default-fp8_kv.ymlmodelopt_recipes/general/ptq/nvfp4_default-fp8_kv.ymlmodelopt_recipes/general/ptq/nvfp4_experts_only-fp8_kv.ymlmodelopt_recipes/general/ptq/nvfp4_mlp_only-fp8_kv.ymlmodelopt_recipes/general/ptq/nvfp4_omlp_only-fp8_kv.yml
Each needs a - parent_class: 'nn.SyncBatchNorm' entry. Without these updates, SyncBN layers will be quantized when these recipes are used — inconsistent behavior depending on recipe choice.
All of these configs should have imported the changes I have already made in imports:
default_disabled_quantizers: configs/ptq/units/default_disabled_quantizers |
What does this PR do?
Type of change: Bug fix
Registers
nn.SyncBatchNormlayer for quantization. If a model is setup for distributed training before PTQ, none of the SyncBatchNorm layers are recognised and quantized. On loading of a checkpoint there is now a mismatch between the modelopt state of a model that hasn't had DDP/SyncBN applied to it and the checkpoint trained with DDP/SyncBN.Performing PTQ and then applying DDP/SyncBN for QAT works fine, but considering that unwrapping DDP is handled properly for either ordering of the steps, SyncBN conversion should be able to be performed in either order as well.
Usage
Testing
Added
nn.SyncBatchNormto the quantization tests where other BatchNorm layers appear..Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: ✅Additional Information
Code for testing issue, run with
python3 script.pyortorchrun --nproc-per-node=2 script.py.Summary by CodeRabbit
New Features
Tests
Chores