---
license: other
license_name: hyperclovax
license_link: LICENSE
library_name: transformers
---

# Overview
HyperCLOVA X SEED 8B Omni is a unified multimodal model that brings text, vision, and speech together, based on an auto-regressive Transformer architecture, enabling consistent multimodal understanding and generation. SEED 8B Omni aligns textual, visual, and audio representations in a shared semantic space and supports bidirectional interactions across modalities, including established text capabilities as well as vision–language QA, text-to-image generation and editing, speech recognition and translation, and text-to-speech, within a 32K context window. As an early pathfinding milestone of HyperCLOVA X toward **Any-to-Any-Korean-First** intelligence, SEED 8B Omni serves as a practical exploration of unified multimodal modeling and provides a reference point for future development and scaling.
---
# Technical Report
- [HyperCLOVAX-SEED-Omni-8B Tech Report (PDF)](./HyperCLOVA_X_8B_Omni.pdf)
---
# Basic Information
- **Architecture** : Transformer-based omni-model architecture (Dense Model)
- **Parameters** : 8B
- **Input Format**: Text/Image/Video/Audio(Speech)
- **Output Format**: Text/Image/Audio(Speech)
- **Context Length** : 32K
- **Knowledge Cutoff**: May 2025
---
# Benchmarks

- **Text-to-Text** : MMLU-Pro, GSM8K, KMMLU-Pro, HAERAE 1.0
- **Vision-to-Text** :SEED-IMG, AI2D, K-MMBench
- **Text-to-Vision**: GenEval, ImgEdit
- **Audio-to-Text**: Librispeech, Ksponspeech
- **Audio-to-Audio**:Fleurs en2ko, Fleurs ko2en
---
# Examples
## Text-to-Image Generation

## Text-based Image Editing



---
# Inference
We provide [OmniServe](https://github.com/NAVER-Cloud-HyperCLOVA-X/OmniServe), a production-ready multimodal inference system with OpenAI-compatible API.
## Capabilities
- **Inputs**: Text, Image, Audio, Video
- **Outputs**: Text, Image, Audio (no video generation)
## Requirements
- 4x NVIDIA A100 80GB
- Docker & Docker Compose
- NVIDIA Driver 525+, CUDA 12.1+
- S3-compatible storage (for image/audio output)
## Installation
```bash
# Clone OmniServe
git clone https://github.com/NAVER-Cloud-HyperCLOVA-X/OmniServe.git
cd OmniServe
# Install dependencies
pip install huggingface_hub safetensors torch openai easydict
# Download model (~16GB)
huggingface-cli download naver-hyperclovax/HyperCLOVAX-SEED-Omni-8B \
--local-dir ./models/HyperCLOVAX-SEED-Omni-8B
# Convert model to component format
python convert_model.py \
--input ./models/HyperCLOVAX-SEED-Omni-8B \
--output ./track_b \
--track b
# Configure environment
cp .env.example .env
# Edit .env with model paths and S3 credentials
# Build and run (Track B only - OMNI model)
docker compose --profile track-b build
docker compose --profile track-b up -d
# Wait for model loading (~5 minutes)
docker compose logs -f omni
# Note: To run both VLM and OMNI models together:
# docker compose --profile track-a --profile track-b up -d
```
## Basic Usage
```python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/b/v1",
api_key="not-needed"
)
# Image understanding
response = client.chat.completions.create(
model="track_b_model",
messages=[
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
{"type": "text", "text": "What is in this image?"}
]
}
],
max_tokens=256,
extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)
print(response.choices[0].message.content)
```
## More Examples
Text to Image
```python
import json
SYSTEM_PROMPT = """You are an AI assistant that generates images. When asked to draw or create an image, you MUST use the t2i_model_generation tool to generate the image. Always respond by calling the tool."""
response = client.chat.completions.create(
model="track_b_model",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": "Draw a sunset over mountains"}
],
tools=[{
"type": "function",
"function": {
"name": "t2i_model_generation",
"description": "Generates an RGB image based on the provided discrete image representation.",
"parameters": {
"type": "object",
"required": ["discrete_image_token"],
"properties": {
"discrete_image_token": {
"type": "string",
"description": "A serialized string of discrete vision tokens, encapsulated by special tokens. The format must be strictly followed: <|discrete_image_start|><|vision_ratio_4:3|><|vision_token|><|visionaaaaa|><|visionbbbbb|>... <|visionzzzzz|><|vision_eol|><|vision_eof|><|discrete_image_end|>.",
"minLength": 1
}
}
}
}
}],
max_tokens=7000,
extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)
if response.choices[0].message.tool_calls:
args = json.loads(response.choices[0].message.tool_calls[0].function.arguments)
print(f"Generated image: {args['discrete_image_token']}")
```
Text to Audio
```python
import base64
# Prompt should explicitly request speech/audio output
response = client.chat.completions.create(
model="track_b_model",
messages=[{
"role": "user",
"content": "Read this text aloud in a cheerful female voice:\nHello! How are you today?"
}],
max_tokens=1000,
extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)
if response.choices[0].message.audio:
audio_url = base64.b64decode(response.choices[0].message.audio.data).decode()
print(f"Generated audio: {audio_url}")
```
Audio Input
```python
import base64
audio_url = "https://example.com/audio.mp3"
audio_data = base64.b64encode(audio_url.encode()).decode()
response = client.chat.completions.create(
model="track_b_model",
messages=[
{
"role": "user",
"content": [
{"type": "input_audio", "input_audio": {"data": audio_data, "format": "mp3"}},
{"type": "text", "text": "What is being said?"}
]
}
],
max_tokens=256,
extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)
print(response.choices[0].message.content)
```
Video Input
```python
response = client.chat.completions.create(
model="track_b_model",
messages=[
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://example.com/video.mp4"}},
{"type": "text", "text": "Describe this video."}
]
}
],
max_tokens=512,
extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)
print(response.choices[0].message.content)
```
Image to Image
```python
import json
SYSTEM_PROMPT = """You are an AI assistant that transforms images. When asked to transform, edit, or stylize an image, you MUST use the t2i_model_generation tool to generate the new image. Always respond by calling the tool."""
response = client.chat.completions.create(
model="track_b_model",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}},
{"type": "text", "text": "Transform to watercolor style"}
]
}
],
tools=[{
"type": "function",
"function": {
"name": "t2i_model_generation",
"description": "Generates an RGB image based on the provided discrete image representation.",
"parameters": {
"type": "object",
"required": ["discrete_image_token"],
"properties": {
"discrete_image_token": {
"type": "string",
"description": "A serialized string of discrete vision tokens, encapsulated by special tokens. The format must be strictly followed: <|discrete_image_start|><|vision_ratio_4:3|><|vision_token|><|visionaaaaa|><|visionbbbbb|>... <|visionzzzzz|><|vision_eol|><|vision_eof|><|discrete_image_end|>.",
"minLength": 1
}
}
}
}
}],
max_tokens=7000,
extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)
if response.choices[0].message.tool_calls:
args = json.loads(response.choices[0].message.tool_calls[0].function.arguments)
print(f"Generated image: {args['discrete_image_token']}")
```
Audio to Audio
```python
import base64
# Input audio (URL encoded as base64)
audio_url = "https://example.com/input.mp3"
audio_data = base64.b64encode(audio_url.encode()).decode()
response = client.chat.completions.create(
model="track_b_model",
messages=[
{
"role": "user",
"content": [
{"type": "input_audio", "input_audio": {"data": audio_data, "format": "mp3"}},
{"type": "text", "text": "Listen to this and respond with speech"}
]
}
],
max_tokens=2000,
extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)
if response.choices[0].message.audio:
audio_url = base64.b64decode(response.choices[0].message.audio.data).decode()
print(f"Generated audio: {audio_url}")
```
Using curl
```bash
# Image understanding
curl -X POST http://localhost:8000/b/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "track_b_model",
"messages": [{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
{"type": "text", "text": "Describe this image."}
]}],
"max_tokens": 256,
"extra_body": {"chat_template_kwargs": {"skip_reasoning": true}}
}'
# Text to audio
curl -X POST http://localhost:8000/b/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "track_b_model",
"messages": [{"role": "user", "content": "Say hello"}],
"max_tokens": 1000,
"extra_body": {"chat_template_kwargs": {"skip_reasoning": true}}
}'
```
## Architecture
```
User Request
(Image/Audio/Video/Text)
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ OmniServe │
│ POST /b/v1/chat/completions │
│ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ [1] INPUT ENCODING │ │
│ │ │ │
│ │ ┌─────────────────┐ ┌─────────────────┐ │ │
│ │ │ Vision Encoder │ │ Audio Encoder │ │ │
│ │ └────────┬────────┘ └────────┬────────┘ │ │
│ │ │ │ │ │
│ │ └────────────┬────────────────────┘ │ │
│ │ │ embeddings │ │
│ └──────────────────────────┼───────────────────────────────────────┘ │
│ ▼ │
│ ┌──────────────┐ │
│ │ LLM (8B) │◀──── text │
│ └──────┬───────┘ │
│ │ │
│ ┌─────────────────────────┼────────────────────────────────────────┐ │
│ │ [2] OUTPUT DECODING │ │
│ │ │ │ │
│ │ ┌──────────────┼──────────────┐ │ │
│ │ ▼ ▼ ▼ │ │
│ │ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ │
│ │ │ Text │ │ Vision │ │ Audio │ │ │
│ │ │ │ │ Decoder │ │ Decoder │ │ │
│ │ └───────────┘ └─────┬─────┘ └─────┬─────┘ │ │
│ │ │ │ │ │
│ │ ▼ ▼ │ │
│ │ Image URL Audio URL │ │
│ │ (S3) (S3) │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
Response
(Text / Image URL / Audio URL)
```
## Hardware Requirements
| Component | GPU | VRAM |
|-----------|-----|------|
| Vision Encoder | 1x | ~8GB |
| Audio Encoder | (shared) | ~4GB |
| LLM (8B) | 1x | ~16GB |
| Vision Decoder | 1x | ~16GB |
| Audio Decoder | (shared) | ~4GB |
| **Total** | **3x** | **~48GB** |
## Key Parameters
| Parameter | Description | Default |
|-----------|-------------|---------|
| `chat_template_kwargs.skip_reasoning` | Skip reasoning | `true` |
| `max_tokens` | Max output tokens | - |
| `temperature` | Sampling temperature | 0.7 |
| `tools` | Required for image generation | - |
## S3 Configuration
Required for image/audio generation:
```bash
NCP_S3_ENDPOINT=https://your-s3-endpoint.com
NCP_S3_ACCESS_KEY=your-access-key
NCP_S3_SECRET_KEY=your-secret-key
NCP_S3_BUCKET_NAME=your-bucket-name
```
For more details, see [OmniServe documentation](https://github.com/NAVER-Cloud-HyperCLOVA-X/OmniServe).
---
# Citation
TBU (Technical Report)
---
# Questions
For any other questions, please feel free to contact us at dl_hcxopensource@navercorp.com.
---
# License
The model is licensed under [HyperCLOVA X SEED 8B Omni Model License Agreement](./LICENSE)