---
language:
- en
library_name: vllm
pipeline_tag: text-generation
tags:
  - text-generation
  - conversational
  - compressed-tensors
  - awq
  - w4a16
  - quantized
  - moe
base_model: zai-org/GLM-4.7
base_model_relation: quantized
quantized_by: TheHouseOfTheDude
license: other
---

# GLM-4.7 — **Quantized** (compressed-tensors for vLLM, MoE finetune)

This repository provides **quantized runtime builds** of  
**zai-org/GLM-4.7** (a Mixture-of-Experts model), repackaged for **vLLM** using the **compressed-tensors** format.

> **Why this quant is different (MoE-aware calibration)**
> - During calibration we **activate all experts** (not only the few picked by the router for each token).  
> - This captures **worst-case activations** across experts, yielding **more robust scales** and **lower quant error drift** at inference—especially under prompts that trigger atypical experts.
> - The accompanying PR to **llm-compressor** adds GLM-family modeling hooks to support this MoE-aware flow (router/expert handling aligned with GLM architectures).  
> - Net effect: **cleaner perplexity and stability**, fewer edge-case artifacts, and better **domain transfer** when novel experts fire.

> **TL;DR**
> - **Quantized** with **W4A16** (INT4 weights / A16 activations) for vLLM via `--quantization compressed-tensors`.
> - Three branches by **group size**: **W4A16_GS32**, **W4A16_GS64**, **W4A16_GS128**.
> - Calibration: **512** chat samples, **2048** max sequence length, dataset **`neuralmagic/LLM_compression_calibration`** (messages rendered with the model’s chat template).
> - Weight-only **AWQ**; `lm_head` kept high-precision; exported with `save_compressed=True`.

---

## Revisions & Branches

> The **`main`** branch is a landing page (model card + links). Runnable artifacts live in per-quant branches.

- **main** — placeholder / landing page  
- **W4A16_GS32** — INT4 weights, **group size 32** (highest fidelity; most scales)  
- **W4A16_GS64** — INT4, **group size 64** (balanced default)  
- **W4A16_GS128** — INT4, **group size 128** (leanest scales; fastest/lowest VRAM)

**Quick links**

- main: https://huggingface.co/TheHouseOfTheDude/GLM-4.7_Compressed-Tensors/tree/main  
- W4A16_GS32: https://huggingface.co/TheHouseOfTheDude/GLM-4.7_Compressed-Tensors/tree/W4A16_GS32  
- W4A16_GS64: https://huggingface.co/TheHouseOfTheDude/GLM-4.7_Compressed-Tensors/tree/W4A16_GS64  
- W4A16_GS128: https://huggingface.co/TheHouseOfTheDude/GLM-4.7_Compressed-Tensors/tree/W4A16_GS128

---

## What’s inside (per revision)

- Sharded **quantized** weights (`*.safetensors`) + index (`model.safetensors.index.json`)  
- `config.json` with **compressed-tensors** metadata (`weight_format`, `quantization`, `quantization_config`, etc.)  
- Tokenizer artifacts (`tokenizer.json`, `tokenizer.model`, merges/vocab as applicable)  
- Optional: `chat_template.jinja` (inherits the finetune’s chat style)

> Exact file lists may differ between branches — see **Files and versions** for each revision.

---

## Quantization & calibration details (MoE-aware; same recipe family as recent cards)

**Method / flow**  
- `llmcompressor` **oneshot** pipeline with an **AWQModifier** (weight-only).

**MoE handling (GLM-4.7)**  
- Quantize **Linear** layers across **all experts** and shared projections.  
- Router/gating **Linear** modules are quantized like other Linear layers.  
- **Expert activation during calibration:** for each calibration batch, **activate all experts** to gather representative activation ranges across the full mixture (not just top-k). This improves scale robustness when rare experts are triggered at inference.

**Targets / exclusions**  
- Targets: `["Linear"]` (MHA/FFN and MoE expert linears).  
- **Ignore** `lm_head` (kept high-precision).

**Weights / grouping**  
- **INT4** (`num_bits=4`, `type="int"`, `symmetric=True`)  
- Strategy: `"group"` with **group size** ∈ {**32**, **64**, **128**} depending on branch  
- **Activations are not quantized** (runtime **A16**: BF16/FP16)

**Calibration dataset & preprocessing**  
- Dataset: **`neuralmagic/LLM_compression_calibration`**, split **`train`**  
- **NUM_CALIBRATION_SAMPLES = 512** (random subset with fixed seed)  
- **MAX_SEQUENCE_LENGTH = 2048**  
- Each sample’s `messages` is rendered via the tokenizer’s  
  `apply_chat_template(..., tokenize=False)`, then tokenized with:
  - `max_length=2048`, `truncation=True`, `padding=False`, `add_special_tokens=False`

**Compression call**  
- `oneshot(..., max_seq_length=2048, num_calibration_samples=512, tokenizer=tokenizer)` on the preprocessed dataset

**Export for vLLM**  
- Saved with **`save_compressed=True`** so **vLLM** loads the **compressed-tensors** runtime layout directly

---

## Why **group size** matters in AWQ (W4A16)

- **Definition:** Group size controls how many consecutive weights share one set of quantization scales.  
- **Trade-offs (accuracy ↔ throughput/VRAM):**
  - **GS32 (smallest groups):** Most scale sets → **highest fidelity** (often best perplexity/task scores), but **larger scale metadata**, a bit more **bandwidth**, and slightly **lower throughput**.  
  - **GS64 (middle ground):** **Balanced** quality and performance; strong default choice.  
  - **GS128 (largest groups):** Fewest scale sets → **leanest/faster** (less bandwidth/metadata), with **slightly higher quantization error**; good for throughput-critical serving.
- **MoE note:** Smaller groups can especially help when different experts exhibit diverse activation statistics; **GS32** tends to preserve expert-specific nuances best.

---

## Context length

- **Calibration context:** up to **2048 tokens** per sample (as above).  
- **Model context window:** inherited from **zai-org/GLM-4.7**; quantization does **not** change rope/position encodings—only the numeric representation of the weights.

---

## Quickstart — vLLM (compressed-tensors)

Install vLLM (recent version recommended):

    pip install vllm

Serve (adjust to your hardware):

    CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
    vllm serve TheHouseOfTheDude/GLM-4.7_Compressed-Tensors \
      --quantization compressed-tensors \
      --tensor-parallel-size 8 \
      --max-model-len 2048 \
      --gpu-memory-utilization 0.70 \
      --dtype bfloat16

Example Chat Completions:

    curl http://localhost:8000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "TheHouseOfTheDude/GLM-4.7_Compressed-Tensors",
        "messages": [
          {"role":"system","content":"You are GLM-4.7 — helpful, precise, and safe."},
          {"role":"user","content":"Outline a plan for multi-document retrieval with MoE models."}
        ],
        "max_tokens": 512,
        "temperature": 0.7,
        "top_p": 0.95
      }'

> **Note:** `compressed-tensors` is a **vLLM runtime** format. Loading directly with vanilla 🤗 Transformers is **not supported**.  
> For Transformers, use a compatible export (e.g., GPTQ/AWQ for Transformers) or the full-precision finetune.

---

## Prompting / chat template

This package follows the **finetuned parent’s** chat conventions. If a `chat_template.jinja` is present, libraries that support `apply_chat_template` will automatically format messages.

Guidelines:
- Keep the **system** message concise (behavior, tone, safety constraints).  
- Provide clear **user** instructions; for multi-step tasks, list steps explicitly.

---

## Intended use & safety

This quantization:
- **Does not** change underlying behavior or content tendencies.  
- **Only** changes weight storage for efficient inference.

Apply appropriate **content filters / policies** for your deployment context.

---

## Lineage

- **Finetuned parent:** https://huggingface.co/zai-org/GLM-4.7  
- **This repo:** **Quantized child** of the finetune (**compressed-tensors** for vLLM)

---

## Hardware tips

- 100B-class MoE models benefit from **multi-GPU** tensor parallel; interconnect bandwidth (NVLink/IB) matters.  
- Long contexts are **KV-cache** heavy — tune `--max-model-len` and batch size.  
- Prefer **BF16** on GPUs with native support; otherwise **FP16**.  
- Consider CUDA Graphs if stable in your environment.

---

## Changelog

- **v1 (current)** — Initial **compressed-tensors W4A16** quantization with **512-sample / 2048-token** MoE-aware calibration; branches **W4A16_GS32 / W4A16_GS64 / W4A16_GS128** published; vLLM-ready packaging.