SDAR: Synergy of Diffusion and AutoRegression

Shuang Cheng1,2,*, Yihan Bian3,*, Dawei Liu1,4,*, Yuhua Jiang1,5, Yihao Liu1,5,
Linfeng Zhang4, Wenhai Wang1, Qipeng Guo1, Kai Chen1,
Biqing Qi1,†, Bowen Zhou1,5
1Shanghai AI Lab, 2Zhejiang University 3University of Maryland, College Park,
4Shanghai Jiao Tong University, 5Tsinghua University,
*Equal Contribution, Project Leader



TL;DR

We propose SDAR (Synergy of Diffusion and AutoRegression), a large-scale diffusion language model that unites the complementary strengths of autoregressive and discrete diffusion modeling.
We have open-sourced the model weights for our dense models (1.7B, 4B, 8B) and for our 30B MoE model (SDAR-30B-A3B-Chat and SDAR-30B-A3B-Sci).

HighLights

  • 🚀 Training —— Low-Cost AR-to-BlockDiffusion
  • Speed —— 2-4× Faster Inference
  • 🧠 Performance —— Advanced performance on science reasoning benchmarks (e.g., High scores on GPQA and ChemBench; ranked Top‑1 in Physics.)

Results

Figure 1. Accuracy–speedup under static vs. dynamic inference; dynamic threshold sweeps relative to static.

  • SDAR delivers >2× speedup over static inference with negligible accuracy loss; its static speed is comparable to AR models.
  • The speedup scales with model size, making SDAR increasingly favorable for larger models.

Table 1. SDAR v.s. Diffusion v.s. Qwen models across general benchmarks.

SDAR v.s. Diffusion v.s. Qwen3 Baseline

Benchmark SDAR-1.7B-Chat SDAR-4B-Chat SDAR-8B-Chat SDAR-30B-A3B-Chat LLADA-8B Dream-7B Qwen3-1.7B-Base Qwen3-1.7B-AR-SFT Qwen3-30B-Base Qwen3-30B-AR-SFT
MMLU 62.9 (-0.9) 74.9 78.6 82.8 (+0.6) 65.9 69.5 62.6 63.8 81.4 82.2
GSM8K 80.1 (-1.0) 89.9 91.3 91.4 (-1.3) 78.6 81.0 75.4 81.1 91.8 92.7
Math500 63.2 (+1.2) 72.8 78.6 77.8 (+1.0) 43.5 62.0 59.0 76.8
MathBench 63.6 (+3.1) 74.7 76.9 79.3 (+0.9) 60.5 78.4
HumanEval 61.6 (-4.3) 72.0 78.7 87.2 (+2.4) 47.6 55.5 65.9 84.8
MBPP 61.1 (-0.8) 65.4 72.0 71.6 (-3.5) 34.2 58.8 55.4 61.9 74.4 75.1
IFEval 43.4 (+0.1) 56.6 61.4 60.6 (+2.9) 59.9 62.5 43.3 57.7

Table 2. SDAR-Sci compared with external models (sources: InternLM/Intern-S1).

SDAR-sci vs Others

Benchmark AR-30B-A3B-Sci SDAR-30B-A3B-Sci (greedy) SDAR-30B-A3B-Sci (sample) Intern-S1(235B-A22B) InternVL3-78B Qwen2.5-VL-72B DeepSeek-R1-0528 Qwen3-235B-A22B Kimi-K2-Instruct Gemini-2.5 Pro o3 Grok-4
MMLU-pro 78.3 80.2 (+1.9) 80.6 (+2.3) 83.5 73.0 72.1 83.4 82.2 82.7 86.0 85.0 85.9
GPQA-diamond 61.2 73.7 (+12.5) 71.8 (+10.6) 77.3 49.9 49.0 80.6 71.1 77.8 83.8 83.3 87.5
AIME2024 74.9 73.3 (-1.6) 76.2 (+1.3)
AIME2025 60.7 63.3 (+2.6) 62.2 (+1.5) 86.0 10.7 10.9 87.5 81.5 51.4 83.0 88.9 91.7
LiveMathBench-hard 55.4 60.7 (+5.3) 57.9 (+2.5)
LiveCodeBench-v5 51.5 40.7 (-10.8) 49.1 (-2.4)
LiveCodeBench-v6 46.3 42.3 (-4.0) 51.4 (+5.1)
ChemBench 60.5 75.1 (+14.6) 75.1 (+14.6) 83.4 61.3 61.6 75.6 75.8 75.3 82.8 81.6 83.3
PHYSICS 39.0 52.9 (+13.9) 55.6 (+16.6) 44.0 23.1 15.7 40.0 47.9 42.8
ProteinLMBench 59.5 60.7 (+1.2) 60.0 (+0.5) 63.1 61.6 61.0 61.4 59.8 66.7 62.9 67.7 66.2

BibTeX

@misc{JetAstra2025,
  title={SDAR: A Synergistic Diffusion–AutoRegression Paradigm for Scalable Sequence Generation},
  author={Shuang Cheng and Yihan Bian and Dawei Liu and Yuhua Jiang and Yihao Liu and Linfeng Zhang and Wenhai Wang and Qipeng Guo and Kai Chen and Biqing Qi and Bowen Zhou},
  year={2025},
  institution={Shanghai AI Lab},
  url={https://github.com/JetAstra/SDAR}
}