Skip to content

[Feature] Reproducing Qwen3.6 35B-A3B DFlash Training #560

Description

@Sylvan820

Checklist

Motivation

SpecForge Training Configuration

Target Model: Qwen3.6-35B-A3

  • 40 layers, Qwen3.5 MoE architecture
  • Hidden size: 5120, 32 attention heads, 8 KV heads

Draft Model Config: Based on qwen3.5-35b-a3b-dflash.json with modifications:

  • 8 decoder layers
  • Block size: 16

Training Script:

python -m torch.distributed.run --standalone --nproc_per_node 8 \
    scripts/train_dflash.py \
    --train-data-path cache/dataset/opc_train_regen_first_turn.jsonl \
    --num-epochs 10 \
    --batch-size 8 \
    --learning-rate 6e-4 \
    --warmup-ratio 0.04 \
    --max-grad-norm 1.0 \
    --max-length 4096 \
    --chat-template qwen3.5 \
    --num-anchors 512 \
    --loss-decay-gamma 7.0 \
    --target-model-backend sglang \
    --block-size 16 \
    --embedding-key model.language_model.embed_tokens.weight \
    --trust-remote-code

Dataset: opc_train_regen_first_turn.jsonl (2,508,380 samples)

  • Custom training data, first-turn only, regenerated

Training Progress: Currently at epoch 0, step ~140,000 (training ongoing) around 1 epoch.

Image

Benchmark Results

Evaluation Setup:

  • SGLang 0.5.6.post2, TP=4, fa3 attention backend
  • Draft window size: 4096
  • MTPBench: bsz=64, output_len=28000, datasets: code/math/mtbench
  • GSM8K: 128 prompts, concurrency=16
  • SWE: 64 samples, 16 workers
  • Accept length includes bonus token (+1)

SpecForge Results (Current)

Training Step Code Math MTBench GSM8K
24,000 1.96 2.07 1.99 2.55
122,000 2.65 2.83 2.24 -
140,000 2.77 2.78 2.32 2.89

Official DFlash Pair (Reference)

Target Draft Code Math MTBench GSM8K
Qwen3.6-35B-A3B Official DFlash 4.45 5.24 3.23 6.84

Are these results normal? After ~140K steps (~1 epoch), SpecForge accept_length reaches only ~2.9 on GSM8K. This is significantly lower than the official DFlash pair (4.4-6.8). Should we expect further improvement with 2-3 more epochs?

Related resources

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions