SDAR: Synergy of Diffusion and AutoRegression

Shuang Cheng^1,2,*, Yihan Bian^3,*, Dawei Liu^1,4,*, Yuhua Jiang^1,5, Yihao Liu^1,5,
Linfeng Zhang⁴, Wenhai Wang¹, Qipeng Guo¹, Kai Chen¹,
Biqing Qi^1,†, Bowen Zhou^1,5

¹Shanghai AI Lab, ²Zhejiang University ³University of Maryland, College Park,
⁴Shanghai Jiao Tong University, ⁵Tsinghua University,
^*Equal Contribution, ^†Project Leader

TL;DR

We propose SDAR (Synergy of Diffusion and AutoRegression), a large-scale diffusion language model that unites the complementary strengths of autoregressive and discrete diffusion modeling.
We have open-sourced the model weights for our dense models (1.7B, 4B, 8B) and for our 30B MoE model (SDAR-30B-A3B-Chat and SDAR-30B-A3B-Sci).

Results

Figure 1. Accuracy–speedup under static vs. dynamic inference; dynamic threshold sweeps relative to static.

SDAR delivers >2× speedup over static inference with negligible accuracy loss; its static speed is comparable to AR models.
The speedup scales with model size, making SDAR increasingly favorable for larger models.

Table 1. SDAR v.s. Diffusion v.s. Qwen models across general benchmarks.

SDAR v.s. Diffusion v.s. Qwen3 Baseline

Benchmark	SDAR-1.7B-Chat	SDAR-4B-Chat	SDAR-8B-Chat	SDAR-30B-A3B-Chat	LLADA-8B	Dream-7B	Qwen3-1.7B-Base	Qwen3-1.7B-AR-SFT	Qwen3-30B-Base	Qwen3-30B-AR-SFT
MMLU	62.9 (-0.9)	74.9	78.6	82.8 (+0.6)	65.9	69.5	62.6	63.8	81.4	82.2
GSM8K	80.1 (-1.0)	89.9	91.3	91.4 (-1.3)	78.6	81.0	75.4	81.1	91.8	92.7
Math500	63.2 (+1.2)	72.8	78.6	77.8 (+1.0)	–	–	43.5	62.0	59.0	76.8
MathBench	63.6 (+3.1)	74.7	76.9	79.3 (+0.9)	–	–	–	60.5	–	78.4
HumanEval	61.6 (-4.3)	72.0	78.7	87.2 (+2.4)	47.6	55.5	–	65.9	–	84.8
MBPP	61.1 (-0.8)	65.4	72.0	71.6 (-3.5)	34.2	58.8	55.4	61.9	74.4	75.1
IFEval	43.4 (+0.1)	56.6	61.4	60.6 (+2.9)	59.9	62.5	–	43.3	–	57.7

Table 2. SDAR-Sci compared with external models (sources: InternLM/Intern-S1).

SDAR-sci vs Others

Benchmark	AR-30B-A3B-Sci	SDAR-30B-A3B-Sci (greedy)	SDAR-30B-A3B-Sci (sample)	Intern-S1(235B-A22B)	InternVL3-78B	Qwen2.5-VL-72B	DeepSeek-R1-0528	Qwen3-235B-A22B	Kimi-K2-Instruct	Gemini-2.5 Pro	o3	Grok-4
MMLU-pro	78.3	80.2 (+1.9)	80.6 (+2.3)	83.5	73.0	72.1	83.4	82.2	82.7	86.0	85.0	85.9
GPQA-diamond	61.2	66.7 (+5.5)	66.0 (+4.8)	77.3	49.9	49.0	80.6	71.1	77.8	83.8	83.3	87.5
AIME2024	74.9	73.3 (-1.6)	76.2 (+1.3)	–	–	–	–	–	–	–	–	–
AIME2025	60.7	63.3 (+2.6)	62.2 (+1.5)	86.0	10.7	10.9	87.5	81.5	51.4	83.0	88.9	91.7
LiveMathBench-hard	55.4	60.7 (+5.3)	57.9 (+2.5)	–	–	–	–	–	–	–	–	–
LiveCodeBench-v5	51.5	40.7 (-10.8)	49.1 (-2.4)	–	–	–	–	–	–	–	–	–
LiveCodeBench-v6	46.3	42.3 (-4.0)	51.4 (+5.1)	–	–	–	–	–	–	–	–	–
ChemBench	60.5	72.3 (+11.8)	72.8 (+12.3)	83.4	61.3	61.6	75.6	75.8	75.3	82.8	81.6	83.3
PHYSICS	39.0	37.9 (-1.1)	38.2 (-0.8)	44.0	23.1	15.7	–	–	–	40.0	47.9	42.8
ProteinLMBench	59.5	59.9 (+0.4)	59.6 (+0.1)	63.1	61.6	61.0	61.4	59.8	66.7	62.9	67.7	66.2

BibTeX

@misc{JetAstra2025, title={SDAR: A Synergistic Diffusion–AutoRegression Paradigm for Scalable Sequence Generation}, author={Shuang Cheng and Yihan Bian and Dawei Liu and Yuhua Jiang and Yihao Liu and Linfeng Zhang and Wenhai Wang and Qipeng Guo and Kai Chen and Biqing Qi and Bowen Zhou}, year={2025}, institution={Shanghai AI Lab}, url={https://github.com/JetAstra/SDAR} }