Qwen3-Embedding Redundancy Detector
A LoRA fine-tuned version of Qwen3-Embedding-0.6B for detecting semantic redundancy in LLM reasoning steps.
Model Description
Problem: Long-reasoning LLMs (e.g., DeepSeek-R1) generate extensive reasoning chains that often contain redundant steps - repeating similar logic, re-stating conclusions, or circling back to previous ideas. This wastes tokens and increases inference cost.
Solution: This model computes embeddings for reasoning steps and uses cosine similarity to detect redundancy. When consecutive steps are semantically similar to previous steps, early stopping can be triggered to save tokens.
Training Details
Base Model
- Qwen3-Embedding-0.6B: A lightweight 0.6B parameter embedding model
LoRA Configuration
| Parameter | Value |
|---|---|
| Rank (r) | 32 |
| Alpha | 64 |
| Dropout | 0.1 |
| Target Modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
Training Setup
| Parameter | Value |
|---|---|
| Framework | MS-Swift |
| Loss Function | InfoNCE (contrastive learning) |
| Batch Size | 16 (effective 64 with grad_accum=4) |
| Learning Rate | 1e-4 |
| Epochs | 5 |
| Hardware | NVIDIA H200 GPU |
Training Data
- 43,389 samples combining:
- v4 merged diverse base samples with semantic labels
- 22k GPT-augmented paraphrase pairs (GPT-4.1, GPT-4o, GPT-4o-mini, GPT-4.1-mini)
- Data format: (anchor_step, positive_redundant_step, negative_novel_steps)
Evaluation Results
Test Data: 30 AIME 2024 questions, 2,549 step pairs (404 redundant, 2,145 novel)
Similarity Analysis:
| Metric | Redundant Steps | Novel Steps |
|---|---|---|
| Prev Similarity (K=1) | 0.539 Β± 0.138 | 0.308 Β± 0.135 |
Best Configuration (K=1, threshold=0.45):
| Metric | Value |
|---|---|
| F1 Score | 0.6010 |
| Recall | 74.75% |
| Precision | 50.25% |
| Accuracy | 84.27% |
Confusion Matrix:
Redundant | Novel
Actual: ------------|----------
Redundant 302 (TP) | 102 (FN) | Total: 404
Novel 299 (FP) | 1846 (TN) | Total: 2145
Window Size Comparison:
| K | Best Threshold | Precision | Recall | F1 |
|---|---|---|---|---|
| 1 | 0.45 | 50.25% | 74.75% | 0.6010 |
| 3 | 0.50 | 40.66% | 70.05% | 0.5145 |
| 6 | 0.55 | 37.99% | 62.62% | 0.4729 |
| 12 | 0.55 | 33.94% | 69.06% | 0.4551 |
| all | 0.60 | 33.86% | 63.37% | 0.4414 |
Key Finding: K=1 (comparing only with immediate previous step) achieves the best F1 score.
Detailed Threshold Analysis (K=1)
| Threshold | Precision | Recall | F1 | F2 |
|---|---|---|---|---|
| 0.30 | 28.6% | 96.8% | 0.442 | 0.655 |
| 0.35 | 36.3% | 94.1% | 0.523 | 0.713 |
| 0.40 | 44.1% | 87.4% | 0.586 | 0.731 |
| 0.45 | 50.2% | 74.8% | 0.601 | 0.681 |
| 0.50 | 56.2% | 59.2% | 0.577 | 0.585 |
| 0.55 | 60.0% | 44.6% | 0.511 | 0.470 |
| 0.60 | 60.3% | 28.2% | 0.384 | 0.316 |
Parameter Selection Guide
Why Recall matters: Recall determines how many redundant steps we detect, directly impacting token savings. Higher Recall = more token savings.
| Use Case | Config | Recall | Precision | F1 |
|---|---|---|---|---|
| Maximum Token Savings | K=1, threshold=0.35 | 94% | 36% | 0.52 |
| Balanced (Recommended) | K=1, threshold=0.45 | 75% | 50% | 0.60 |
| Conservative | K=1, threshold=0.55 | 45% | 60% | 0.51 |
| Extreme Recall | K=12, threshold=0.30 | 99.5% | 17% | 0.29 |
Guideline:
- Token savings priority β lower threshold (0.35)
- Answer quality priority β higher threshold (0.55)
- Default: K=1, threshold=0.45
Usage
Installation
pip install transformers peft torch safetensors
# Optional: for faster inference
pip install vllm
Method 1: Transformers Backend
import json
import torch
import torch.nn.functional as F
import numpy as np
from pathlib import Path
from transformers import AutoModel, AutoTokenizer
from safetensors import safe_open
from peft import get_peft_model, LoraConfig
def last_token_pool(last_hidden_states, attention_mask):
"""Pool the last token's hidden state for embedding."""
left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
if left_padding:
return last_hidden_states[:, -1]
else:
sequence_lengths = attention_mask.sum(dim=1) - 1
batch_size = last_hidden_states.shape[0]
return last_hidden_states[
torch.arange(batch_size, device=last_hidden_states.device),
sequence_lengths
]
def load_model(lora_path: str):
"""Load base model with LoRA weights merged."""
lora_path = Path(lora_path)
# Load adapter config
with open(lora_path / "adapter_config.json", 'r') as f:
adapter_config = json.load(f)
base_model_path = adapter_config.get('base_model_name_or_path')
# Load tokenizer and base model
tokenizer = AutoTokenizer.from_pretrained(
base_model_path,
trust_remote_code=True,
padding_side='left'
)
base_model = AutoModel.from_pretrained(
base_model_path,
trust_remote_code=True
)
# Create PEFT model
lora_config = LoraConfig(
r=adapter_config.get('r', 32),
lora_alpha=adapter_config.get('lora_alpha', 64),
target_modules=adapter_config.get('target_modules', []),
lora_dropout=adapter_config.get('lora_dropout', 0.1),
bias="none",
)
model = get_peft_model(base_model, lora_config)
# Load LoRA weights
adapter_file = lora_path / "adapter_model.safetensors"
state_dict = {}
with safe_open(str(adapter_file), framework="pt", device="cpu") as f:
for key in f.keys():
tensor = f.get_tensor(key)
new_key = key.replace("base_model.model.model.", "base_model.model.")
new_key = new_key.replace(".lora_A.weight", ".lora_A.default.weight")
new_key = new_key.replace(".lora_B.weight", ".lora_B.default.weight")
state_dict[new_key] = tensor
model.load_state_dict(state_dict, strict=False)
model = model.merge_and_unload()
model.eval()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
return model, tokenizer, device
def get_embedding(model, tokenizer, text: str, device, max_length: int = 8192):
"""Get normalized embedding for a single text."""
inputs = tokenizer(
text,
return_tensors="pt",
truncation=True,
max_length=max_length,
padding=True
)
inputs = {k: v.to(device) for k, v in inputs.items()}
with torch.no_grad():
outputs = model(**inputs)
embedding = last_token_pool(
outputs.last_hidden_state,
inputs['attention_mask']
)
embedding = F.normalize(embedding, p=2, dim=1)
return embedding.squeeze().cpu().numpy()
# Usage
model, tokenizer, device = load_model("ZhishanQ/qwen3-embedding-redundancy-detector")
step1 = "First, I need to calculate the total distance: 9 kilometers."
step2 = "The distance Aya walks is 9 km, so I'll use that in my calculation."
emb1 = get_embedding(model, tokenizer, step1, device)
emb2 = get_embedding(model, tokenizer, step2, device)
similarity = float(emb1 @ emb2.T)
print(f"Similarity: {similarity:.4f}")
# If similarity > 0.45, steps are likely redundant
Method 2: vLLM Backend (Faster)
import numpy as np
from vllm import LLM
from vllm.lora.request import LoRARequest
class EmbeddingModelVLLM:
def __init__(self, lora_path: str, base_model: str = "Qwen/Qwen3-Embedding-0.6B"):
self.llm = LLM(
model=base_model,
task="embed",
enable_lora=True,
max_lora_rank=32,
gpu_memory_utilization=0.8,
trust_remote_code=True,
)
self.lora_request = LoRARequest(
lora_name="redundancy_detector",
lora_int_id=1,
lora_path=lora_path,
)
def get_embedding(self, text: str) -> np.ndarray:
outputs = self.llm.embed([text], lora_request=self.lora_request)
embedding = np.array(outputs[0].outputs.embedding)
embedding = embedding / (np.linalg.norm(embedding) + 1e-9)
return embedding
def get_embeddings_batch(self, texts: list) -> np.ndarray:
outputs = self.llm.embed(texts, lora_request=self.lora_request)
embeddings = np.array([o.outputs.embedding for o in outputs])
norms = np.linalg.norm(embeddings, axis=1, keepdims=True) + 1e-9
return embeddings / norms
# Usage
model = EmbeddingModelVLLM("ZhishanQ/qwen3-embedding-redundancy-detector")
emb = model.get_embedding("Let me calculate the total...")
Redundancy Detection Pipeline
class RedundancyDetector:
"""
Detect redundant reasoning steps using embedding similarity.
Args:
model: Embedding model
tokenizer: Tokenizer
device: torch device
threshold: Similarity threshold (default: 0.45)
window_size: Number of previous steps to compare (default: 1)
consecutive_k: Trigger exit after K consecutive redundant steps (default: 3)
"""
def __init__(self, model, tokenizer, device,
threshold=0.45, window_size=1, consecutive_k=3):
self.model = model
self.tokenizer = tokenizer
self.device = device
self.threshold = threshold
self.window_size = window_size
self.consecutive_k = consecutive_k
self.reset()
def reset(self):
"""Reset state for new question."""
self.history = []
self.consecutive_count = 0
def check_step(self, step_text: str) -> dict:
"""
Check if a reasoning step is redundant.
Returns:
dict with keys:
- is_redundant: bool
- similarity: float (max similarity to previous steps)
- should_exit: bool (True if should stop early)
"""
current_emb = get_embedding(self.model, self.tokenizer, step_text, self.device)
is_redundant = False
max_sim = 0.0
if len(self.history) > 0:
window = self.history[-self.window_size:]
similarities = [float(current_emb @ h) for h in window]
max_sim = max(similarities)
is_redundant = max_sim > self.threshold
if is_redundant:
self.consecutive_count += 1
else:
self.consecutive_count = 0
self.history.append(current_emb)
return {
"is_redundant": is_redundant,
"similarity": max_sim,
"should_exit": self.consecutive_count >= self.consecutive_k
}
# Usage example
model, tokenizer, device = load_model("ZhishanQ/qwen3-embedding-redundancy-detector")
detector = RedundancyDetector(model, tokenizer, device, threshold=0.45, window_size=1, consecutive_k=3)
reasoning_steps = [
"Let me understand the problem first...",
"The distance is 9 km and speed is s km/h...",
"So the time for walking is 9/s hours...",
"Wait, the time is 9 divided by s, which is 9/s...", # Redundant!
"Yes, walking time = 9/s hours as I said...", # Redundant!
"Let me recalculate: time = 9/s...", # Redundant! -> Exit
]
for i, step in enumerate(reasoning_steps):
result = detector.check_step(step)
print(f"Step {i+1}: redundant={result['is_redundant']}, sim={result['similarity']:.3f}")
if result["should_exit"]:
print(f"Early stopping at step {i+1}!")
break
Recommended Parameters
| Use Case | Window Size (K) | Threshold | Consecutive K |
|---|---|---|---|
| Balanced (F1) | 1 | 0.45 | 3 |
| High Recall | 12 | 0.45 | 3 |
| Conservative | 1 | 0.50 | 5 |
File Structure
qwen3-embedding-redundancy-detector/
βββ adapter_config.json # LoRA configuration
βββ adapter_model.safetensors # LoRA weights (~80MB)
βββ README.md # This file
Limitations
- Trained primarily on mathematical reasoning (AIME-style problems)
- May not generalize well to other domains without fine-tuning
- Threshold values may need adjustment for different LLMs
Citation
@misc{qwen3-redundancy-detector,
title={Qwen3-Embedding LoRA for Reasoning Redundancy Detection},
author={ZhishanQ},
year={2024},
publisher={Hugging Face},
url={https://huggingface.co/ZhishanQ/qwen3-embedding-redundancy-detector}
}
License
Apache 2.0
- Downloads last month
- 147