Checklist
Motivation
SpecForge Training Configuration
Target Model: Qwen3.6-35B-A3
- 40 layers, Qwen3.5 MoE architecture
- Hidden size: 5120, 32 attention heads, 8 KV heads
Draft Model Config: Based on qwen3.5-35b-a3b-dflash.json with modifications:
- 8 decoder layers
- Block size: 16
Training Script:
python -m torch.distributed.run --standalone --nproc_per_node 8 \
scripts/train_dflash.py \
--train-data-path cache/dataset/opc_train_regen_first_turn.jsonl \
--num-epochs 10 \
--batch-size 8 \
--learning-rate 6e-4 \
--warmup-ratio 0.04 \
--max-grad-norm 1.0 \
--max-length 4096 \
--chat-template qwen3.5 \
--num-anchors 512 \
--loss-decay-gamma 7.0 \
--target-model-backend sglang \
--block-size 16 \
--embedding-key model.language_model.embed_tokens.weight \
--trust-remote-code
Dataset: opc_train_regen_first_turn.jsonl (2,508,380 samples)
- Custom training data, first-turn only, regenerated
Training Progress: Currently at epoch 0, step ~140,000 (training ongoing) around 1 epoch.
Benchmark Results
Evaluation Setup:
- SGLang 0.5.6.post2, TP=4, fa3 attention backend
- Draft window size: 4096
- MTPBench: bsz=64, output_len=28000, datasets: code/math/mtbench
- GSM8K: 128 prompts, concurrency=16
- SWE: 64 samples, 16 workers
- Accept length includes bonus token (+1)
SpecForge Results (Current)
| Training Step |
Code |
Math |
MTBench |
GSM8K |
| 24,000 |
1.96 |
2.07 |
1.99 |
2.55 |
| 122,000 |
2.65 |
2.83 |
2.24 |
- |
| 140,000 |
2.77 |
2.78 |
2.32 |
2.89 |
Official DFlash Pair (Reference)
| Target |
Draft |
Code |
Math |
MTBench |
GSM8K |
| Qwen3.6-35B-A3B |
Official DFlash |
4.45 |
5.24 |
3.23 |
6.84 |
Are these results normal? After ~140K steps (~1 epoch), SpecForge accept_length reaches only ~2.9 on GSM8K. This is significantly lower than the official DFlash pair (4.4-6.8). Should we expect further improvement with 2-3 more epochs?
Related resources
No response
Checklist
Motivation
SpecForge Training Configuration
Target Model: Qwen3.6-35B-A3
Draft Model Config: Based on
qwen3.5-35b-a3b-dflash.jsonwith modifications:Training Script:
python -m torch.distributed.run --standalone --nproc_per_node 8 \ scripts/train_dflash.py \ --train-data-path cache/dataset/opc_train_regen_first_turn.jsonl \ --num-epochs 10 \ --batch-size 8 \ --learning-rate 6e-4 \ --warmup-ratio 0.04 \ --max-grad-norm 1.0 \ --max-length 4096 \ --chat-template qwen3.5 \ --num-anchors 512 \ --loss-decay-gamma 7.0 \ --target-model-backend sglang \ --block-size 16 \ --embedding-key model.language_model.embed_tokens.weight \ --trust-remote-codeDataset:
opc_train_regen_first_turn.jsonl(2,508,380 samples)Training Progress: Currently at epoch 0, step ~140,000 (training ongoing) around 1 epoch.
Benchmark Results
Evaluation Setup:
SpecForge Results (Current)
Official DFlash Pair (Reference)
Are these results normal? After ~140K steps (~1 epoch), SpecForge accept_length reaches only ~2.9 on GSM8K. This is significantly lower than the official DFlash pair (4.4-6.8). Should we expect further improvement with 2-3 more epochs?
Related resources
No response