We successfully replicated the model stitching effect of DeepSeek-R1T-Chimera in the Qwen3-30B-A3B series.

Model Highlights:

  • merge method: model stitching

  • Highest precision: bfloat16

  • Context length: 262,144&1010000

Parameter Settings:

Thinking Mode

Temperature=0.6, TopP=0.95, TopK=20,MinP=0.

An interesting observation:

Due to differences in model architecture and parameter scale, we discovered that when replicating the paper's results using the Qwen3-30B-A3B series, simply replacing the routing expert tensors of the instruction model with those of the thinking model failed to generate complete reasoning labels.

Therefore, we replaced both the routing expert tensors and the attention tensors with those from the thinking model, while keeping all other tensor components entirely from the instruction model.

We hope this insight proves valuable to others!

Downloads last month
45
Safetensors
Model size
31B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for YOYO-AI/Qwen3-30B-A3B-YOYO-Thinking-Chimera

Paper for YOYO-AI/Qwen3-30B-A3B-YOYO-Thinking-Chimera