[Feature] Reproducing Qwen3.6 35B-A3B DFlash Training

### Checklist

- [x] 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/SpecForge/discussions/new/choose Otherwise, it will be closed.
- [x] 2. Please use English, otherwise it will be closed.

### Motivation

## SpecForge Training Configuration

**Target Model:** Qwen3.6-35B-A3
- 40 layers, Qwen3.5 MoE architecture
- Hidden size: 5120, 32 attention heads, 8 KV heads

**Draft Model Config:** Based on `qwen3.5-35b-a3b-dflash.json` with modifications:
- 8 decoder layers
- Block size: 16


**Training Script:**
```bash
python -m torch.distributed.run --standalone --nproc_per_node 8 \
    scripts/train_dflash.py \
    --train-data-path cache/dataset/opc_train_regen_first_turn.jsonl \
    --num-epochs 10 \
    --batch-size 8 \
    --learning-rate 6e-4 \
    --warmup-ratio 0.04 \
    --max-grad-norm 1.0 \
    --max-length 4096 \
    --chat-template qwen3.5 \
    --num-anchors 512 \
    --loss-decay-gamma 7.0 \
    --target-model-backend sglang \
    --block-size 16 \
    --embedding-key model.language_model.embed_tokens.weight \
    --trust-remote-code
```

**Dataset:** `opc_train_regen_first_turn.jsonl` (2,508,380 samples)
- Custom training data, first-turn only, regenerated

**Training Progress:** Currently at epoch 0, step ~140,000 (training ongoing) around 1 epoch.

<img width="2080" height="1280" alt="Image" src="https://github.com/user-attachments/assets/8ef45baf-0c50-4eeb-923c-0f5fa26aa68f" />

---

## Benchmark Results

**Evaluation Setup:**
- SGLang 0.5.6.post2, TP=4, fa3 attention backend
- Draft window size: 4096
- MTPBench: bsz=64, output_len=28000, datasets: code/math/mtbench
- GSM8K: 128 prompts, concurrency=16
- SWE: 64 samples, 16 workers
- Accept length includes bonus token (+1)

### SpecForge Results (Current)

| Training Step | Code | Math | MTBench | GSM8K |
|---|---|---|---|---|
| 24,000 | 1.96 | 2.07 | 1.99 | 2.55 |
| 122,000 | 2.65 | 2.83 | 2.24 | - |
| 140,000 | 2.77 | 2.78 | 2.32 | 2.89 |


### Official DFlash Pair (Reference)

| Target | Draft | Code | Math | MTBench | GSM8K |
|---|---|---|---|---|---|
| Qwen3.6-35B-A3B | Official DFlash | 4.45 | 5.24 | 3.23 | 6.84 |



Are these results normal? After ~140K steps (~1 epoch), SpecForge accept_length reaches only ~2.9 on GSM8K. This is significantly lower than the official DFlash pair (4.4-6.8). Should we expect further improvement with 2-3 more epochs?



### Related resources

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Reproducing Qwen3.6 35B-A3B DFlash Training #560

Checklist

Motivation

SpecForge Training Configuration

Benchmark Results

SpecForge Results (Current)

Official DFlash Pair (Reference)

Related resources

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Training Step	Code	Math	MTBench	GSM8K
24,000	1.96	2.07	1.99	2.55
122,000	2.65	2.83	2.24	-
140,000	2.77	2.78	2.32	2.89

Uh oh!

[Feature] Reproducing Qwen3.6 35B-A3B DFlash Training #560

Description

Checklist

Motivation

SpecForge Training Configuration

Benchmark Results

SpecForge Results (Current)

Official DFlash Pair (Reference)

Related resources

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions