phaedawg commited on
Commit
6a9cb65
·
verified ·
1 Parent(s): d63a045

First README.md

Browse files
Files changed (1) hide show
  1. README.md +195 -0
README.md ADDED
@@ -0,0 +1,195 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ library_name: vllm
5
+ pipeline_tag: text-generation
6
+ tags:
7
+ - text-generation
8
+ - conversational
9
+ - compressed-tensors
10
+ - awq
11
+ - w4a16
12
+ - quantized
13
+ - moe
14
+ base_model: zai-org/GLM-4.7
15
+ base_model_relation: quantized
16
+ quantized_by: TheHouseOfTheDude
17
+ license: other
18
+ ---
19
+
20
+ # GLM-4.7 — **Quantized** (compressed-tensors for vLLM, MoE finetune)
21
+
22
+ This repository provides **quantized runtime builds** of
23
+ **zai-org/GLM-4.7** (a Mixture-of-Experts model), repackaged for **vLLM** using the **compressed-tensors** format.
24
+
25
+ > **Why this quant is different (MoE-aware calibration)**
26
+ > - During calibration we **activate all experts** (not only the few picked by the router for each token).
27
+ > - This captures **worst-case activations** across experts, yielding **more robust scales** and **lower quant error drift** at inference—especially under prompts that trigger atypical experts.
28
+ > - The accompanying PR to **llm-compressor** adds GLM-family modeling hooks to support this MoE-aware flow (router/expert handling aligned with GLM architectures).
29
+ > - Net effect: **cleaner perplexity and stability**, fewer edge-case artifacts, and better **domain transfer** when novel experts fire.
30
+
31
+ > **TL;DR**
32
+ > - **Quantized** with **W4A16** (INT4 weights / A16 activations) for vLLM via `--quantization compressed-tensors`.
33
+ > - Three branches by **group size**: **W4A16_GS32**, **W4A16_GS64**, **W4A16_GS128**.
34
+ > - Calibration: **512** chat samples, **2048** max sequence length, dataset **`neuralmagic/LLM_compression_calibration`** (messages rendered with the model’s chat template).
35
+ > - Weight-only **AWQ**; `lm_head` kept high-precision; exported with `save_compressed=True`.
36
+
37
+ ---
38
+
39
+ ## Revisions & Branches
40
+
41
+ > The **`main`** branch is a landing page (model card + links). Runnable artifacts live in per-quant branches.
42
+
43
+ - **main** — placeholder / landing page
44
+ - **W4A16_GS32** — INT4 weights, **group size 32** (highest fidelity; most scales)
45
+ - **W4A16_GS64** — INT4, **group size 64** (balanced default)
46
+ - **W4A16_GS128** — INT4, **group size 128** (leanest scales; fastest/lowest VRAM)
47
+
48
+ **Quick links**
49
+
50
+ - main: https://huggingface.co/TheHouseOfTheDude/GLM-4.7_Compressed-Tensors/tree/main
51
+ - W4A16_GS32: https://huggingface.co/TheHouseOfTheDude/GLM-4.7_Compressed-Tensors/tree/W4A16_GS32
52
+ - W4A16_GS64: https://huggingface.co/TheHouseOfTheDude/GLM-4.7_Compressed-Tensors/tree/W4A16_GS64
53
+ - W4A16_GS128: https://huggingface.co/TheHouseOfTheDude/GLM-4.7_Compressed-Tensors/tree/W4A16_GS128
54
+
55
+ ---
56
+
57
+ ## What’s inside (per revision)
58
+
59
+ - Sharded **quantized** weights (`*.safetensors`) + index (`model.safetensors.index.json`)
60
+ - `config.json` with **compressed-tensors** metadata (`weight_format`, `quantization`, `quantization_config`, etc.)
61
+ - Tokenizer artifacts (`tokenizer.json`, `tokenizer.model`, merges/vocab as applicable)
62
+ - Optional: `chat_template.jinja` (inherits the finetune’s chat style)
63
+
64
+ > Exact file lists may differ between branches — see **Files and versions** for each revision.
65
+
66
+ ---
67
+
68
+ ## Quantization & calibration details (MoE-aware; same recipe family as recent cards)
69
+
70
+ **Method / flow**
71
+ - `llmcompressor` **oneshot** pipeline with an **AWQModifier** (weight-only).
72
+
73
+ **MoE handling (GLM-4.7)**
74
+ - Quantize **Linear** layers across **all experts** and shared projections.
75
+ - Router/gating **Linear** modules are quantized like other Linear layers.
76
+ - **Expert activation during calibration:** for each calibration batch, **activate all experts** to gather representative activation ranges across the full mixture (not just top-k). This improves scale robustness when rare experts are triggered at inference.
77
+
78
+ **Targets / exclusions**
79
+ - Targets: `["Linear"]` (MHA/FFN and MoE expert linears).
80
+ - **Ignore** `lm_head` (kept high-precision).
81
+
82
+ **Weights / grouping**
83
+ - **INT4** (`num_bits=4`, `type="int"`, `symmetric=True`)
84
+ - Strategy: `"group"` with **group size** ∈ {**32**, **64**, **128**} depending on branch
85
+ - **Activations are not quantized** (runtime **A16**: BF16/FP16)
86
+
87
+ **Calibration dataset & preprocessing**
88
+ - Dataset: **`neuralmagic/LLM_compression_calibration`**, split **`train`**
89
+ - **NUM_CALIBRATION_SAMPLES = 512** (random subset with fixed seed)
90
+ - **MAX_SEQUENCE_LENGTH = 2048**
91
+ - Each sample’s `messages` is rendered via the tokenizer’s
92
+ `apply_chat_template(..., tokenize=False)`, then tokenized with:
93
+ - `max_length=2048`, `truncation=True`, `padding=False`, `add_special_tokens=False`
94
+
95
+ **Compression call**
96
+ - `oneshot(..., max_seq_length=2048, num_calibration_samples=512, tokenizer=tokenizer)` on the preprocessed dataset
97
+
98
+ **Export for vLLM**
99
+ - Saved with **`save_compressed=True`** so **vLLM** loads the **compressed-tensors** runtime layout directly
100
+
101
+ ---
102
+
103
+ ## Why **group size** matters in AWQ (W4A16)
104
+
105
+ - **Definition:** Group size controls how many consecutive weights share one set of quantization scales.
106
+ - **Trade-offs (accuracy ↔ throughput/VRAM):**
107
+ - **GS32 (smallest groups):** Most scale sets → **highest fidelity** (often best perplexity/task scores), but **larger scale metadata**, a bit more **bandwidth**, and slightly **lower throughput**.
108
+ - **GS64 (middle ground):** **Balanced** quality and performance; strong default choice.
109
+ - **GS128 (largest groups):** Fewest scale sets → **leanest/faster** (less bandwidth/metadata), with **slightly higher quantization error**; good for throughput-critical serving.
110
+ - **MoE note:** Smaller groups can especially help when different experts exhibit diverse activation statistics; **GS32** tends to preserve expert-specific nuances best.
111
+
112
+ ---
113
+
114
+ ## Context length
115
+
116
+ - **Calibration context:** up to **2048 tokens** per sample (as above).
117
+ - **Model context window:** inherited from **zai-org/GLM-4.7**; quantization does **not** change rope/position encodings—only the numeric representation of the weights.
118
+
119
+ ---
120
+
121
+ ## Quickstart — vLLM (compressed-tensors)
122
+
123
+ Install vLLM (recent version recommended):
124
+
125
+ pip install vllm
126
+
127
+ Serve (adjust to your hardware):
128
+
129
+ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
130
+ vllm serve TheHouseOfTheDude/GLM-4.7_Compressed-Tensors \
131
+ --quantization compressed-tensors \
132
+ --tensor-parallel-size 8 \
133
+ --max-model-len 2048 \
134
+ --gpu-memory-utilization 0.70 \
135
+ --dtype bfloat16
136
+
137
+ Example Chat Completions:
138
+
139
+ curl http://localhost:8000/v1/chat/completions \
140
+ -H "Content-Type: application/json" \
141
+ -d '{
142
+ "model": "TheHouseOfTheDude/GLM-4.7_Compressed-Tensors",
143
+ "messages": [
144
+ {"role":"system","content":"You are GLM-4.7 — helpful, precise, and safe."},
145
+ {"role":"user","content":"Outline a plan for multi-document retrieval with MoE models."}
146
+ ],
147
+ "max_tokens": 512,
148
+ "temperature": 0.7,
149
+ "top_p": 0.95
150
+ }'
151
+
152
+ > **Note:** `compressed-tensors` is a **vLLM runtime** format. Loading directly with vanilla 🤗 Transformers is **not supported**.
153
+ > For Transformers, use a compatible export (e.g., GPTQ/AWQ for Transformers) or the full-precision finetune.
154
+
155
+ ---
156
+
157
+ ## Prompting / chat template
158
+
159
+ This package follows the **finetuned parent’s** chat conventions. If a `chat_template.jinja` is present, libraries that support `apply_chat_template` will automatically format messages.
160
+
161
+ Guidelines:
162
+ - Keep the **system** message concise (behavior, tone, safety constraints).
163
+ - Provide clear **user** instructions; for multi-step tasks, list steps explicitly.
164
+
165
+ ---
166
+
167
+ ## Intended use & safety
168
+
169
+ This quantization:
170
+ - **Does not** change underlying behavior or content tendencies.
171
+ - **Only** changes weight storage for efficient inference.
172
+
173
+ Apply appropriate **content filters / policies** for your deployment context.
174
+
175
+ ---
176
+
177
+ ## Lineage
178
+
179
+ - **Finetuned parent:** https://huggingface.co/zai-org/GLM-4.7
180
+ - **This repo:** **Quantized child** of the finetune (**compressed-tensors** for vLLM)
181
+
182
+ ---
183
+
184
+ ## Hardware tips
185
+
186
+ - 100B-class MoE models benefit from **multi-GPU** tensor parallel; interconnect bandwidth (NVLink/IB) matters.
187
+ - Long contexts are **KV-cache** heavy — tune `--max-model-len` and batch size.
188
+ - Prefer **BF16** on GPUs with native support; otherwise **FP16**.
189
+ - Consider CUDA Graphs if stable in your environment.
190
+
191
+ ---
192
+
193
+ ## Changelog
194
+
195
+ - **v1 (current)** — Initial **compressed-tensors W4A16** quantization with **512-sample / 2048-token** MoE-aware calibration; branches **W4A16_GS32 / W4A16_GS64 / W4A16_GS128** published; vLLM-ready packaging.