zacks917
/

AutoDeco-GPT-OSS-120B

+---
+datasets:
+- zwhe99/DeepMath-103K
+base_model:
+- openai/gpt-oss-120b
+---
+# AutoDeco
+Official Implementation of "[The End of Manual Decoding: Towards Truly End-to-End Language Models](https://arxiv.org/abs/2510.26697)"
+**AutoDeco** is a framework that adds token-level adaptive decoding parameter prediction capabilities to Large Language Models (LLMs). By adding lightweight prediction heads on top of pre-trained models, AutoDeco can dynamically predict optimal temperature and top-p parameters for each token during decoding.
+## 🎯 Key Features
+- **Token-Level Decoding Parameter Prediction**: Dynamically predict decoding parameters (temperature and top-p) for each generated token
+- **Lightweight Design**: Only adds two small MLP prediction heads (~5MB), without modifying the base model
+- **Universal Architecture**: Supports multiple mainstream LLM architectures (Llama, Qwen2/2.5, Qwen3, MoE models, etc.)
+- **End-to-End Training**: Complete training with implicit gradient backpropagation through cross-entropy loss only
+- **Flexible Training**: Supports independent training of temperature head, top-p head, or joint training
+- **Efficient Deployment**: Only saves AutoDeco prediction head weights during training, merges with base model during decoding.
+## 🏗️ Architecture
+The AutoDeco framework consists of two core components:
+![AutoDeco Architecture](figure/arch.png)
+### Model Workflow
+```
+Input Tokens
+    ↓
+Base LLM (frozen during head training)
+    ↓
+Hidden States
+    ├──→ LM Head → Logits
+    ├──→ TempHead → Temperature
+    └──→ TopPHead → Top-P
+```
+During training, the base LLM parameters are frozen, and only the two prediction heads are trained.
+## 🤖 Supported Models
+AutoDeco supports all current autoregressive LLMs, and we unified them with the following model architectures `AutoDecoModelForCausalLM` interface.
+<div align="center">
+| **Base Model** | **#Base Params** | **#AutoDeco Params** | **Download** |
+| :------------: | :------------: | :------------: | :------------: |
+| Llama-3.1-Nemotron-Nano-8B-v1 | 8B | 2.1M | [🤗 HuggingFace](https://huggingface.co/Jadeislaw/AutoDeco-Llama-Nemotron-8B)   |
+| DeepSeek-R1-Distill-Qwen-7B   | 7B | 1.84M | [🤗 HuggingFace](https://huggingface.co/Jadeislaw/AutoDeco-R1-Distill-Qwen-7B)   |
+| Qwen3-30B-A3B-Instruct-2507   | 30B | 1.05M | [🤗 HuggingFace](https://huggingface.co/Jadeislaw/AutoDeco-Qwen3-30B-A3B-Instruct-2507)   |
+| OpenAI-GPT-OSS-20B   | 20B | 1.48M | [🤗 HuggingFace](https://huggingface.co/Jadeislaw/AutoDeco-GPT-Oss-20B)   |
+| OpenAI-GPT-OSS-120B   | 120B | 1.48M | [🤗 HuggingFace](https://huggingface.co/Jadeislaw/AutoDeco-GPT-Oss-120B)  |
+| Qwen3-235B-A22B-Thinking   | 235B | 2.1M | [🤗 HuggingFace](https://huggingface.co/zacks917/AutoDeco-Qwen3-235B-A22B-Thinking-2507)  |
+| DeepSeek-V3.1-Terminus   | 671B | - | Comming Soon  |
+</div>
+## 🚀 Installation
+### Recommended Requirements
+- Python >= 3.10
+- PyTorch >= 2.0
+- CUDA >= 12.0 (recommended for training)
+### Install Dependencies
+```bash
+# Clone repository
+cd AutoDeco
+# Install core dependencies
+pip install -r requirements.txt
+# Optional: for training monitoring
+pip install wandb
+```
+## 💡 Quick Start
+### Initialize AutoDeco Model
+```python
+python script/construct_autodeco.py \
+    --base_model_name_or_path path_to_your_base_LLM \
+    --output_dir path_to_your_AutoDeco_model
+```
+<!-- ### 2. Inference
+```python
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("path/to/model")
+inputs = tokenizer("What is the meaning of life?", return_tensors="pt")
+# Forward pass to get predictions
+outputs = model(**inputs)
+# outputs contains:
+# - outputs.logits: Regular language model logits
+# - outputs.temp_logits: Predicted temperature values
+# - outputs.top_p_logits: Predicted top-p values
+```
+### 3. Efficient Inference with vLLM
+We have integrated AutoDeco with vLLM for efficient batch inference:
+- Install vLLM from source code first
+    ```bash
+    cd vllm
+    pip install -e .
+    ```
+- Inference
+    ```bash
+    # Use training script for evaluation
+    python llm_eval.py \
+        --model_name_or_path path/to/autodeco_model \
+        --dataset aime24 \
+        --temp 1.0 \
+        --top_p 1.0 \
+        --k 16 \
+        --tp_size 4
+    ``` -->
+## 🔥 Training
+### Prepare Training Data
+Training data should be in JSONL format, with one sample per line. AutoDeco supports standard conversation format:
+```bash
+{
+  "prompt": "formatted prompt text",
+  "completion": "expected completion"
+}
+# example
+{
+  "prompt": "<|im_start|>user\nEvaluate the limit:$$\\lim_{(x, y) \\to (1, 2)} \\frac{(x-1)(y-2)-x+3}{x^2-2x+y^2-4}$$\nMake sure you output the final answer within \\boxed{}<|im_end|>\n< im_start>assistant\n",
+  "completion": "......### ✅ Final Answer:\n$$\n\\boxed{-1}\n$$""
+}
+```
+### Train AutoDeco Heads
+Use the provided training script:
+```bash
+# Edit script/trl_train.sh to configure parameters
+# Key parameters:
+# - MODEL_NAME_OR_PATH: Your initialized AutoDeco Model Path
+# - DATA_NAME: Training data filename (in data directory)
+# - MAX_LENGTH: Maximum sequence length
+# - train_temp: Whether to train temperature head
+# - train_top_p: Whether to train top-p head
+bash script/trl_train.sh
+```
+Training configuration examples:
+```bash
+# Train only temperature head
+accelerate launch trl_train.py \
+    --model_name_or_path AutoDeco-Llama-3.1-8B \
+    --dataset_name train_data.jsonl \
+    --train_temp true \
+    --train_top_p false \
+    --learning_rate 5e-6 \
+    --num_train_epochs 1 \
+    --output_dir ckpt/llama3_temp_head
+```
+## 📊 Inference
+### Batch Evaluation with vLLM
+```bash
+# Single evaluation
+python llm_eval.py \
+    --model_name_or_path ckpt/autodeco_model \
+    --dataset aime24 \
+    --temp 1.0 \
+    --top_p 1.0 \
+    --k 16 \
+    --seed 42
+# Batch evaluation with script (automatically generates multiple random seeds)
+bash script/test_generation.sh aime24 1.0 1.0 -1 1.0 path/to/model
+```
+Evaluation results are saved in the `generation_log/` directory, including:
+- Pass@K metrics
+- Average accuracy
+- Detailed generation results for each sample
+### Deploy with vLLM
+```bash
+# example
+vllm serve
+```
+## 📁 Project Structure
+```
+AutoDeco/
+├── model/                          # Model definitions
+│   ├── templlm_auto.py            # Unified AutoDeco model (recommended)
+definitions
+│
+├── trainer/                        # Trainers
+│   └── trl_Temp.py                # AutoDeco trainer
+│
+├── script/                         # Scripts
+│   ├── trl_train.sh               # Training launch script
+│   ├── test_generation.sh         # Batch evaluation script
+│   └── merge_autodeco.py          # Merge or split heads
+│
+├── config/                         # Configuration files
+│   └── deepspeed/                 # DeepSpeed configuration
+│       └── deepspeed_zero3_gradaccu4.yaml
+│
+├── trl_train.py                   # Training main program
+├── llm_eval.py                    # Evaluation main program (vLLM)
+├── boxed_extract.py               # Answer extraction tool
+├── requirements.txt               # requirements
+└── README.md                      # This document
+```
+## 🔧 Advanced Usage
+### 1. Extract AutoDeco Heads from AutoDeco Model
+```python
+python merge_autodeco.py split \
+    --full-checkpoint path_to_your_full_model \
+    --output path_to_split_head
+```
+This generates a lightweight checkpoint (~5MB) containing:
+- `config.json`: AutoDeco configuration (including base_model_name_or_path)
+- `autodeco_heads.safetensors`: Heads weights
+### 2. Merge AutoDeco Heads to Base Model (for vLLM Deployment)
+If you need to create a complete model file with heads for inference engines like vLLM:
+```python
+python merge_autodeco.py merge \
+    --autodeco-path path_to_autodeco_heads \
+    --base-model-path path_to_base_LLM \
+    --output path_to_your_full_model
+```
+## 📝 Citation
+If you use AutoDeco in your research, please cite:
+```bibtex
+@misc{wang2025endmanualdecodingtruly,
+      title={The End of Manual Decoding: Towards Truly End-to-End Language Models},
+      author={Zhichao Wang and Dongyang Ma and Xinting Huang and Deng Cai and Tian Lan and Jiahao Xu and Haitao Mi and Xiaoying Tang and Yan Wang},
+      year={2025},
+      eprint={2510.26697},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2510.26697},
+}
+```
+<!-- ## Acknowledgments
+- Built on [Transformers](https://github.com/huggingface/transformers) and [TRL](https://github.com/huggingface/trl)
+- Training framework uses [DeepSpeed](https://github.com/microsoft/DeepSpeed)
+- Inference optimization uses [vLLM](https://github.com/vllm-project/vllm) -->