We successfully replicated the model stitching effect of DeepSeek-R1T-Chimera in the Qwen3-30B-A3B series.
Model Highlights:
merge method:
model stitchingHighest precision:
bfloat16Context length:
262,144&1010000
Parameter Settings:
Thinking Mode
Temperature=0.6,TopP=0.95,TopK=20,MinP=0.
An interesting observation:
Due to differences in model architecture and parameter scale, we discovered that when replicating the paper's results using the Qwen3-30B-A3B series, simply replacing the routing expert tensors of the instruction model with those of the thinking model failed to generate complete reasoning labels.
Therefore, we replaced both the routing expert tensors and the attention tensors with those from the thinking model, while keeping all other tensor components entirely from the instruction model.
We hope this insight proves valuable to others!
- Downloads last month
- 45