Configuration Parsing Warning: In adapter_config.json: "peft.task_type" must be a string

Qwen3-Embedding Redundancy Detector

A LoRA fine-tuned version of Qwen3-Embedding-0.6B for detecting semantic redundancy in LLM reasoning steps.

Model Description

Problem: Long-reasoning LLMs (e.g., DeepSeek-R1) generate extensive reasoning chains that often contain redundant steps - repeating similar logic, re-stating conclusions, or circling back to previous ideas. This wastes tokens and increases inference cost.

Solution: This model computes embeddings for reasoning steps and uses cosine similarity to detect redundancy. When consecutive steps are semantically similar to previous steps, early stopping can be triggered to save tokens.

Training Details

Base Model

  • Qwen3-Embedding-0.6B: A lightweight 0.6B parameter embedding model

LoRA Configuration

Parameter Value
Rank (r) 32
Alpha 64
Dropout 0.1
Target Modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Training Setup

Parameter Value
Framework MS-Swift
Loss Function InfoNCE (contrastive learning)
Batch Size 16 (effective 64 with grad_accum=4)
Learning Rate 1e-4
Epochs 5
Hardware NVIDIA H200 GPU

Training Data

  • 43,389 samples combining:
    • v4 merged diverse base samples with semantic labels
    • 22k GPT-augmented paraphrase pairs (GPT-4.1, GPT-4o, GPT-4o-mini, GPT-4.1-mini)
  • Data format: (anchor_step, positive_redundant_step, negative_novel_steps)

Evaluation Results

Test Data: 30 AIME 2024 questions, 2,549 step pairs (404 redundant, 2,145 novel)

Similarity Analysis:

Metric Redundant Steps Novel Steps
Prev Similarity (K=1) 0.539 Β± 0.138 0.308 Β± 0.135

Best Configuration (K=1, threshold=0.45):

Metric Value
F1 Score 0.6010
Recall 74.75%
Precision 50.25%
Accuracy 84.27%

Confusion Matrix:

              Redundant  |  Novel
Actual:      ------------|----------
Redundant      302 (TP)  |  102 (FN)  | Total: 404
Novel          299 (FP)  | 1846 (TN)  | Total: 2145

Window Size Comparison:

K Best Threshold Precision Recall F1
1 0.45 50.25% 74.75% 0.6010
3 0.50 40.66% 70.05% 0.5145
6 0.55 37.99% 62.62% 0.4729
12 0.55 33.94% 69.06% 0.4551
all 0.60 33.86% 63.37% 0.4414

Key Finding: K=1 (comparing only with immediate previous step) achieves the best F1 score.

Detailed Threshold Analysis (K=1)

Threshold Precision Recall F1 F2
0.30 28.6% 96.8% 0.442 0.655
0.35 36.3% 94.1% 0.523 0.713
0.40 44.1% 87.4% 0.586 0.731
0.45 50.2% 74.8% 0.601 0.681
0.50 56.2% 59.2% 0.577 0.585
0.55 60.0% 44.6% 0.511 0.470
0.60 60.3% 28.2% 0.384 0.316

Parameter Selection Guide

Why Recall matters: Recall determines how many redundant steps we detect, directly impacting token savings. Higher Recall = more token savings.

Use Case Config Recall Precision F1
Maximum Token Savings K=1, threshold=0.35 94% 36% 0.52
Balanced (Recommended) K=1, threshold=0.45 75% 50% 0.60
Conservative K=1, threshold=0.55 45% 60% 0.51
Extreme Recall K=12, threshold=0.30 99.5% 17% 0.29

Guideline:

  • Token savings priority β†’ lower threshold (0.35)
  • Answer quality priority β†’ higher threshold (0.55)
  • Default: K=1, threshold=0.45

Usage

Installation

pip install transformers peft torch safetensors
# Optional: for faster inference
pip install vllm

Method 1: Transformers Backend

import json
import torch
import torch.nn.functional as F
import numpy as np
from pathlib import Path
from transformers import AutoModel, AutoTokenizer
from safetensors import safe_open
from peft import get_peft_model, LoraConfig


def last_token_pool(last_hidden_states, attention_mask):
    """Pool the last token's hidden state for embedding."""
    left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
    if left_padding:
        return last_hidden_states[:, -1]
    else:
        sequence_lengths = attention_mask.sum(dim=1) - 1
        batch_size = last_hidden_states.shape[0]
        return last_hidden_states[
            torch.arange(batch_size, device=last_hidden_states.device),
            sequence_lengths
        ]


def load_model(lora_path: str):
    """Load base model with LoRA weights merged."""
    lora_path = Path(lora_path)

    # Load adapter config
    with open(lora_path / "adapter_config.json", 'r') as f:
        adapter_config = json.load(f)

    base_model_path = adapter_config.get('base_model_name_or_path')

    # Load tokenizer and base model
    tokenizer = AutoTokenizer.from_pretrained(
        base_model_path,
        trust_remote_code=True,
        padding_side='left'
    )
    base_model = AutoModel.from_pretrained(
        base_model_path,
        trust_remote_code=True
    )

    # Create PEFT model
    lora_config = LoraConfig(
        r=adapter_config.get('r', 32),
        lora_alpha=adapter_config.get('lora_alpha', 64),
        target_modules=adapter_config.get('target_modules', []),
        lora_dropout=adapter_config.get('lora_dropout', 0.1),
        bias="none",
    )
    model = get_peft_model(base_model, lora_config)

    # Load LoRA weights
    adapter_file = lora_path / "adapter_model.safetensors"
    state_dict = {}
    with safe_open(str(adapter_file), framework="pt", device="cpu") as f:
        for key in f.keys():
            tensor = f.get_tensor(key)
            new_key = key.replace("base_model.model.model.", "base_model.model.")
            new_key = new_key.replace(".lora_A.weight", ".lora_A.default.weight")
            new_key = new_key.replace(".lora_B.weight", ".lora_B.default.weight")
            state_dict[new_key] = tensor

    model.load_state_dict(state_dict, strict=False)
    model = model.merge_and_unload()
    model.eval()

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = model.to(device)

    return model, tokenizer, device


def get_embedding(model, tokenizer, text: str, device, max_length: int = 8192):
    """Get normalized embedding for a single text."""
    inputs = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        max_length=max_length,
        padding=True
    )
    inputs = {k: v.to(device) for k, v in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs)
        embedding = last_token_pool(
            outputs.last_hidden_state,
            inputs['attention_mask']
        )
        embedding = F.normalize(embedding, p=2, dim=1)

    return embedding.squeeze().cpu().numpy()


# Usage
model, tokenizer, device = load_model("ZhishanQ/qwen3-embedding-redundancy-detector")

step1 = "First, I need to calculate the total distance: 9 kilometers."
step2 = "The distance Aya walks is 9 km, so I'll use that in my calculation."

emb1 = get_embedding(model, tokenizer, step1, device)
emb2 = get_embedding(model, tokenizer, step2, device)

similarity = float(emb1 @ emb2.T)
print(f"Similarity: {similarity:.4f}")
# If similarity > 0.45, steps are likely redundant

Method 2: vLLM Backend (Faster)

import numpy as np
from vllm import LLM
from vllm.lora.request import LoRARequest


class EmbeddingModelVLLM:
    def __init__(self, lora_path: str, base_model: str = "Qwen/Qwen3-Embedding-0.6B"):
        self.llm = LLM(
            model=base_model,
            task="embed",
            enable_lora=True,
            max_lora_rank=32,
            gpu_memory_utilization=0.8,
            trust_remote_code=True,
        )
        self.lora_request = LoRARequest(
            lora_name="redundancy_detector",
            lora_int_id=1,
            lora_path=lora_path,
        )

    def get_embedding(self, text: str) -> np.ndarray:
        outputs = self.llm.embed([text], lora_request=self.lora_request)
        embedding = np.array(outputs[0].outputs.embedding)
        embedding = embedding / (np.linalg.norm(embedding) + 1e-9)
        return embedding

    def get_embeddings_batch(self, texts: list) -> np.ndarray:
        outputs = self.llm.embed(texts, lora_request=self.lora_request)
        embeddings = np.array([o.outputs.embedding for o in outputs])
        norms = np.linalg.norm(embeddings, axis=1, keepdims=True) + 1e-9
        return embeddings / norms


# Usage
model = EmbeddingModelVLLM("ZhishanQ/qwen3-embedding-redundancy-detector")
emb = model.get_embedding("Let me calculate the total...")

Redundancy Detection Pipeline

class RedundancyDetector:
    """
    Detect redundant reasoning steps using embedding similarity.

    Args:
        model: Embedding model
        tokenizer: Tokenizer
        device: torch device
        threshold: Similarity threshold (default: 0.45)
        window_size: Number of previous steps to compare (default: 1)
        consecutive_k: Trigger exit after K consecutive redundant steps (default: 3)
    """

    def __init__(self, model, tokenizer, device,
                 threshold=0.45, window_size=1, consecutive_k=3):
        self.model = model
        self.tokenizer = tokenizer
        self.device = device
        self.threshold = threshold
        self.window_size = window_size
        self.consecutive_k = consecutive_k
        self.reset()

    def reset(self):
        """Reset state for new question."""
        self.history = []
        self.consecutive_count = 0

    def check_step(self, step_text: str) -> dict:
        """
        Check if a reasoning step is redundant.

        Returns:
            dict with keys:
                - is_redundant: bool
                - similarity: float (max similarity to previous steps)
                - should_exit: bool (True if should stop early)
        """
        current_emb = get_embedding(self.model, self.tokenizer, step_text, self.device)

        is_redundant = False
        max_sim = 0.0

        if len(self.history) > 0:
            window = self.history[-self.window_size:]
            similarities = [float(current_emb @ h) for h in window]
            max_sim = max(similarities)
            is_redundant = max_sim > self.threshold

        if is_redundant:
            self.consecutive_count += 1
        else:
            self.consecutive_count = 0

        self.history.append(current_emb)

        return {
            "is_redundant": is_redundant,
            "similarity": max_sim,
            "should_exit": self.consecutive_count >= self.consecutive_k
        }


# Usage example
model, tokenizer, device = load_model("ZhishanQ/qwen3-embedding-redundancy-detector")
detector = RedundancyDetector(model, tokenizer, device, threshold=0.45, window_size=1, consecutive_k=3)

reasoning_steps = [
    "Let me understand the problem first...",
    "The distance is 9 km and speed is s km/h...",
    "So the time for walking is 9/s hours...",
    "Wait, the time is 9 divided by s, which is 9/s...",  # Redundant!
    "Yes, walking time = 9/s hours as I said...",         # Redundant!
    "Let me recalculate: time = 9/s...",                  # Redundant! -> Exit
]

for i, step in enumerate(reasoning_steps):
    result = detector.check_step(step)
    print(f"Step {i+1}: redundant={result['is_redundant']}, sim={result['similarity']:.3f}")

    if result["should_exit"]:
        print(f"Early stopping at step {i+1}!")
        break

Recommended Parameters

Use Case Window Size (K) Threshold Consecutive K
Balanced (F1) 1 0.45 3
High Recall 12 0.45 3
Conservative 1 0.50 5

File Structure

qwen3-embedding-redundancy-detector/
β”œβ”€β”€ adapter_config.json      # LoRA configuration
β”œβ”€β”€ adapter_model.safetensors # LoRA weights (~80MB)
└── README.md                 # This file

Limitations

  • Trained primarily on mathematical reasoning (AIME-style problems)
  • May not generalize well to other domains without fine-tuning
  • Threshold values may need adjustment for different LLMs

Citation

@misc{qwen3-redundancy-detector,
  title={Qwen3-Embedding LoRA for Reasoning Redundancy Detection},
  author={ZhishanQ},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/ZhishanQ/qwen3-embedding-redundancy-detector}
}

License

Apache 2.0

Downloads last month
147
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for ZhishanQ/qwen3-embedding-redundancy-detector

Adapter
(9)
this model