fix: resolve NPU OOM with default training config#620
Conversation
Read mask_token_id from draft_config.dflash_config before falling back to tokenizer.mask_token_id or adding a new special token. Apply the same fallback in both train_dflash.py and train_domino.py for consistency. Closes sgl-project#500
There was a problem hiding this comment.
Code Review
This pull request updates the --num-anchors parameter to 186 in both the run_qwen3.5_4b_dflash_online_npu.sh and run_qwen3.5_4b_domino_online_npu.sh example scripts. There are no review comments, so I have no feedback to provide.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
I think 512 is the default setting for these two methods? I would like to keep them intact if possible, but you have a good point, perhaps we should raise some logs to alert the users to lower this value if the gpu memory is low. |
Motivation
The default NPU training examples for Qwen3.5-4B DFlash use
--num-anchorsvalues (512) that cause out-of-memory errors on common 64GB Ascend NPU cards such as 910B(A2 node) and 910C(A3 node). This PR lowers the default to a value that fits within the available device memory while keeping the examples runnable out-of-the-box.Modifications
examples/run_qwen3.5_4b_dflash_online_npu.sh--num-anchorsfrom512to186examples/run_qwen3.5_4b_domino_online_npu.sh--num-anchorsfrom16to186Both scripts now use the same
--num-anchors 186default, which avoids OOM on 64GB NPU devices.Related Issues
N/A
Accuracy Test
Not applicable — this change only adjusts a training hyper-parameter default in example launch scripts. No model architecture or kernel code is modified.
Benchmark & Profiling
Not applicable — the change reduces memory usage for the default NPU example configuration.
Checklist