We successfully replicated the model stitching effect of DeepSeek-R1T-Chimera in the Qwen3-30B-A3B series.

Model Highlights:

merge method: model stitching
Highest precision: bfloat16
Context length: 262,144&1010000

Parameter Settings:

Thinking Mode

Temperature=0.6, TopP=0.95, TopK=20,MinP=0.

An interesting observation:

Due to differences in model architecture and parameter scale, we discovered that when replicating the paper's results using the Qwen3-30B-A3B series, simply replacing the routing expert tensors of the instruction model with those of the thinking model failed to generate complete reasoning labels.

Therefore, we replaced both the routing expert tensors and the attention tensors with those from the thinking model, while keeping all other tensor components entirely from the instruction model.

We hope this insight proves valuable to others!