Configuration Parsing Warning: In adapter_config.json: "peft.task_type" must be a string

Qwen3-Embedding Redundancy Detector

A LoRA fine-tuned version of Qwen3-Embedding-0.6B for detecting semantic redundancy in LLM reasoning steps.

Model Description

Problem: Long-reasoning LLMs (e.g., DeepSeek-R1) generate extensive reasoning chains that often contain redundant steps - repeating similar logic, re-stating conclusions, or circling back to previous ideas. This wastes tokens and increases inference cost.

Solution: This model computes embeddings for reasoning steps and uses cosine similarity to detect redundancy. When consecutive steps are semantically similar to previous steps, early stopping can be triggered to save tokens.

Training Details

Base Model

Qwen3-Embedding-0.6B: A lightweight 0.6B parameter embedding model

LoRA Configuration

Parameter	Value
Rank (r)	32
Alpha	64
Dropout	0.1
Target Modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Training Setup

Parameter	Value
Framework	MS-Swift
Loss Function	InfoNCE (contrastive learning)
Batch Size	16 (effective 64 with grad_accum=4)
Learning Rate	1e-4
Epochs	5
Hardware	NVIDIA H200 GPU

Training Data

43,389 samples combining:
- v4 merged diverse base samples with semantic labels
- 22k GPT-augmented paraphrase pairs (GPT-4.1, GPT-4o, GPT-4o-mini, GPT-4.1-mini)
Data format: (anchor_step, positive_redundant_step, negative_novel_steps)

Evaluation Results

Test Data: 30 AIME 2024 questions, 2,549 step pairs (404 redundant, 2,145 novel)

Similarity Analysis:

Metric	Redundant Steps	Novel Steps
Prev Similarity (K=1)	0.539 ± 0.138	0.308 ± 0.135

Best Configuration (K=1, threshold=0.45):

Metric	Value
F1 Score	0.6010
Recall	74.75%
Precision	50.25%
Accuracy	84.27%

Confusion Matrix:

              Redundant  |  Novel
Actual:      ------------|----------
Redundant      302 (TP)  |  102 (FN)  | Total: 404
Novel          299 (FP)  | 1846 (TN)  | Total: 2145

Window Size Comparison:

K	Best Threshold	Precision	Recall	F1
1	0.45	50.25%	74.75%	0.6010
3	0.50	40.66%	70.05%	0.5145
6	0.55	37.99%	62.62%	0.4729
12	0.55	33.94%	69.06%	0.4551
all	0.60	33.86%	63.37%	0.4414

Key Finding: K=1 (comparing only with immediate previous step) achieves the best F1 score.

Detailed Threshold Analysis (K=1)

Threshold	Precision	Recall	F1	F2
0.30	28.6%	96.8%	0.442	0.655
0.35	36.3%	94.1%	0.523	0.713
0.40	44.1%	87.4%	0.586	0.731
0.45	50.2%	74.8%	0.601	0.681
0.50	56.2%	59.2%	0.577	0.585
0.55	60.0%	44.6%	0.511	0.470
0.60	60.3%	28.2%	0.384	0.316

Parameter Selection Guide

Why Recall matters: Recall determines how many redundant steps we detect, directly impacting token savings. Higher Recall = more token savings.

Use Case	Config	Recall	Precision	F1
Maximum Token Savings	K=1, threshold=0.35	94%	36%	0.52
Balanced (Recommended)	K=1, threshold=0.45	75%	50%	0.60
Conservative	K=1, threshold=0.55	45%	60%	0.51
Extreme Recall	K=12, threshold=0.30	99.5%	17%	0.29

Guideline:

Token savings priority → lower threshold (0.35)
Answer quality priority → higher threshold (0.55)
Default: K=1, threshold=0.45

Usage

Installation

pip install transformers peft torch safetensors
# Optional: for faster inference
pip install vllm

Method 1: Transformers Backend

import json
import torch
import torch.nn.functional as F
import numpy as np
from pathlib import Path
from transformers import AutoModel, AutoTokenizer
from safetensors import safe_open
from peft import get_peft_model, LoraConfig


def last_token_pool(last_hidden_states, attention_mask):
    """Pool the last token's hidden state for embedding."""
    left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
    if left_padding:
        return last_hidden_states[:, -1]
    else:
        sequence_lengths = attention_mask.sum(dim=1) - 1
        batch_size = last_hidden_states.shape[0]
        return last_hidden_states[
            torch.arange(batch_size, device=last_hidden_states.device),
            sequence_lengths
        ]


def load_model(lora_path: str):
    """Load base model with LoRA weights merged."""
    lora_path = Path(lora_path)

    # Load adapter config
    with open(lora_path / "adapter_config.json", 'r') as f:
        adapter_config = json.load(f)

    base_model_path = adapter_config.get('base_model_name_or_path')

    # Load tokenizer and base model
    tokenizer = AutoTokenizer.from_pretrained(
        base_model_path,
        trust_remote_code=True,
        padding_side='left'
    )
    base_model = AutoModel.from_pretrained(
        base_model_path,
        trust_remote_code=True
    )

    # Create PEFT model
    lora_config = LoraConfig(
        r=adapter_config.get('r', 32),
        lora_alpha=adapter_config.get('lora_alpha', 64),
        target_modules=adapter_config.get('target_modules', []),
        lora_dropout=adapter_config.get('lora_dropout', 0.1),
        bias="none",
    )
    model = get_peft_model(base_model, lora_config)

    # Load LoRA weights
    adapter_file = lora_path / "adapter_model.safetensors"
    state_dict = {}
    with safe_open(str(adapter_file), framework="pt", device="cpu") as f:
        for key in f.keys():
            tensor = f.get_tensor(key)
            new_key = key.replace("base_model.model.model.", "base_model.model.")
            new_key = new_key.replace(".lora_A.weight", ".lora_A.default.weight")
            new_key = new_key.replace(".lora_B.weight", ".lora_B.default.weight")
            state_dict[new_key] = tensor

    model.load_state_dict(state_dict, strict=False)
    model = model.merge_and_unload()
    model.eval()

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = model.to(device)

    return model, tokenizer, device


def get_embedding(model, tokenizer, text: str, device, max_length: int = 8192):
    """Get normalized embedding for a single text."""
    inputs = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        max_length=max_length,
        padding=True
    )
    inputs = {k: v.to(device) for k, v in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs)
        embedding = last_token_pool(
            outputs.last_hidden_state,
            inputs['attention_mask']
        )
        embedding = F.normalize(embedding, p=2, dim=1)

    return embedding.squeeze().cpu().numpy()


# Usage
model, tokenizer, device = load_model("ZhishanQ/qwen3-embedding-redundancy-detector")

step1 = "First, I need to calculate the total distance: 9 kilometers."
step2 = "The distance Aya walks is 9 km, so I'll use that in my calculation."

emb1 = get_embedding(model, tokenizer, step1, device)
emb2 = get_embedding(model, tokenizer, step2, device)

similarity = float(emb1 @ emb2.T)
print(f"Similarity: {similarity:.4f}")
# If similarity > 0.45, steps are likely redundant

Method 2: vLLM Backend (Faster)

import numpy as np
from vllm import LLM
from vllm.lora.request import LoRARequest


class EmbeddingModelVLLM:
    def __init__(self, lora_path: str, base_model: str = "Qwen/Qwen3-Embedding-0.6B"):
        self.llm = LLM(
            model=base_model,
            task="embed",
            enable_lora=True,
            max_lora_rank=32,
            gpu_memory_utilization=0.8,
            trust_remote_code=True,
        )
        self.lora_request = LoRARequest(
            lora_name="redundancy_detector",
            lora_int_id=1,
            lora_path=lora_path,
        )

    def get_embedding(self, text: str) -> np.ndarray:
        outputs = self.llm.embed([text], lora_request=self.lora_request)
        embedding = np.array(outputs[0].outputs.embedding)
        embedding = embedding / (np.linalg.norm(embedding) + 1e-9)
        return embedding

    def get_embeddings_batch(self, texts: list) -> np.ndarray:
        outputs = self.llm.embed(texts, lora_request=self.lora_request)
        embeddings = np.array([o.outputs.embedding for o in outputs])
        norms = np.linalg.norm(embeddings, axis=1, keepdims=True) + 1e-9
        return embeddings / norms


# Usage
model = EmbeddingModelVLLM("ZhishanQ/qwen3-embedding-redundancy-detector")
emb = model.get_embedding("Let me calculate the total...")

Redundancy Detection Pipeline

class RedundancyDetector:
    """
    Detect redundant reasoning steps using embedding similarity.

    Args:
        model: Embedding model
        tokenizer: Tokenizer
        device: torch device
        threshold: Similarity threshold (default: 0.45)
        window_size: Number of previous steps to compare (default: 1)
        consecutive_k: Trigger exit after K consecutive redundant steps (default: 3)
    """

    def __init__(self, model, tokenizer, device,
                 threshold=0.45, window_size=1, consecutive_k=3):
        self.model = model
        self.tokenizer = tokenizer
        self.device = device
        self.threshold = threshold
        self.window_size = window_size
        self.consecutive_k = consecutive_k
        self.reset()

    def reset(self):
        """Reset state for new question."""
        self.history = []
        self.consecutive_count = 0

    def check_step(self, step_text: str) -> dict:
        """
        Check if a reasoning step is redundant.

        Returns:
            dict with keys:
                - is_redundant: bool
                - similarity: float (max similarity to previous steps)
                - should_exit: bool (True if should stop early)
        """
        current_emb = get_embedding(self.model, self.tokenizer, step_text, self.device)

        is_redundant = False
        max_sim = 0.0

        if len(self.history) > 0:
            window = self.history[-self.window_size:]
            similarities = [float(current_emb @ h) for h in window]
            max_sim = max(similarities)
            is_redundant = max_sim > self.threshold

        if is_redundant:
            self.consecutive_count += 1
        else:
            self.consecutive_count = 0

        self.history.append(current_emb)

        return {
            "is_redundant": is_redundant,
            "similarity": max_sim,
            "should_exit": self.consecutive_count >= self.consecutive_k
        }


# Usage example
model, tokenizer, device = load_model("ZhishanQ/qwen3-embedding-redundancy-detector")
detector = RedundancyDetector(model, tokenizer, device, threshold=0.45, window_size=1, consecutive_k=3)

reasoning_steps = [
    "Let me understand the problem first...",
    "The distance is 9 km and speed is s km/h...",
    "So the time for walking is 9/s hours...",
    "Wait, the time is 9 divided by s, which is 9/s...",  # Redundant!
    "Yes, walking time = 9/s hours as I said...",         # Redundant!
    "Let me recalculate: time = 9/s...",                  # Redundant! -> Exit
]

for i, step in enumerate(reasoning_steps):
    result = detector.check_step(step)
    print(f"Step {i+1}: redundant={result['is_redundant']}, sim={result['similarity']:.3f}")

    if result["should_exit"]:
        print(f"Early stopping at step {i+1}!")
        break

Recommended Parameters

Use Case	Window Size (K)	Threshold	Consecutive K
Balanced (F1)	1	0.45	3
High Recall	12	0.45	3
Conservative	1	0.50	5

File Structure

qwen3-embedding-redundancy-detector/
├── adapter_config.json      # LoRA configuration
├── adapter_model.safetensors # LoRA weights (~80MB)
└── README.md                 # This file

Limitations

Trained primarily on mathematical reasoning (AIME-style problems)
May not generalize well to other domains without fine-tuning
Threshold values may need adjustment for different LLMs

Citation

@misc{qwen3-redundancy-detector,
  title={Qwen3-Embedding LoRA for Reasoning Redundancy Detection},
  author={ZhishanQ},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/ZhishanQ/qwen3-embedding-redundancy-detector}
}

License

Apache 2.0

Downloads last month: 147

Model tree for ZhishanQ/qwen3-embedding-redundancy-detector

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

Qwen/Qwen3-Embedding-0.6B

Adapter

(9)

this model