--- license: other license_name: hyperclovax license_link: LICENSE library_name: transformers --- ![image](https://cdn-uploads.huggingface.co/production/uploads/64383d54c5a91b84ece18d62/3gaPG3_F4Fxn-SOZWrmfU.png) # Overview HyperCLOVA X SEED 8B Omni is a unified multimodal model that brings text, vision, and speech together, based on an auto-regressive Transformer architecture, enabling consistent multimodal understanding and generation. SEED 8B Omni aligns textual, visual, and audio representations in a shared semantic space and supports bidirectional interactions across modalities, including established text capabilities as well as vision–language QA, text-to-image generation and editing, speech recognition and translation, and text-to-speech, within a 32K context window. As an early pathfinding milestone of HyperCLOVA X toward **Any-to-Any-Korean-First** intelligence, SEED 8B Omni serves as a practical exploration of unified multimodal modeling and provides a reference point for future development and scaling. --- # Technical Report - [HyperCLOVAX-SEED-Omni-8B Tech Report (PDF)](./HyperCLOVA_X_8B_Omni.pdf) --- # Basic Information - **Architecture** : Transformer-based omni-model architecture (Dense Model) - **Parameters** : 8B - **Input Format**: Text/Image/Video/Audio(Speech) - **Output Format**: Text/Image/Audio(Speech) - **Context Length** : 32K - **Knowledge Cutoff**: May 2025 --- # Benchmarks ![테크니컬 리포트 05_2@2x](https://cdn-uploads.huggingface.co/production/uploads/646acf46086023e36edce4c4/x1IvD9Rt_NK71CklecpN2.png) - **Text-to-Text** : MMLU-Pro, GSM8K, KMMLU-Pro, HAERAE 1.0 - **Vision-to-Text** :SEED-IMG, AI2D, K-MMBench - **Text-to-Vision**: GenEval, ImgEdit - **Audio-to-Text**: Librispeech, Ksponspeech - **Audio-to-Audio**:Fleurs en2ko, Fleurs ko2en --- # Examples ## Text-to-Image Generation ![hf_img01](https://cdn-uploads.huggingface.co/production/uploads/64383d54c5a91b84ece18d62/6fRekMbt_9ab5I80GTkdG.png) ## Text-based Image Editing ![hf_img02](https://cdn-uploads.huggingface.co/production/uploads/64383d54c5a91b84ece18d62/aoecU357A0fVvR8uerozh.png) ![hf_img03](https://cdn-uploads.huggingface.co/production/uploads/64383d54c5a91b84ece18d62/0fpcq--rj1kqPa9m8DYgt.png) ![hf_img04](https://cdn-uploads.huggingface.co/production/uploads/64383d54c5a91b84ece18d62/Z24JUQZSmeaVNrhDMYG6K.png) --- # Inference We provide [OmniServe](https://github.com/NAVER-Cloud-HyperCLOVA-X/OmniServe), a production-ready multimodal inference system with OpenAI-compatible API. ## Capabilities - **Inputs**: Text, Image, Audio, Video - **Outputs**: Text, Image, Audio (no video generation) ## Requirements - 4x NVIDIA A100 80GB - Docker & Docker Compose - NVIDIA Driver 525+, CUDA 12.1+ - S3-compatible storage (for image/audio output) ## Installation ```bash # Clone OmniServe git clone https://github.com/NAVER-Cloud-HyperCLOVA-X/OmniServe.git cd OmniServe # Install dependencies pip install huggingface_hub safetensors torch openai easydict # Download model (~16GB) huggingface-cli download naver-hyperclovax/HyperCLOVAX-SEED-Omni-8B \ --local-dir ./models/HyperCLOVAX-SEED-Omni-8B # Convert model to component format python convert_model.py \ --input ./models/HyperCLOVAX-SEED-Omni-8B \ --output ./track_b \ --track b # Configure environment cp .env.example .env # Edit .env with model paths and S3 credentials # Build and run (Track B only - OMNI model) docker compose --profile track-b build docker compose --profile track-b up -d # Wait for model loading (~5 minutes) docker compose logs -f omni # Note: To run both VLM and OMNI models together: # docker compose --profile track-a --profile track-b up -d ``` ## Basic Usage ```python from openai import OpenAI client = OpenAI( base_url="http://localhost:8000/b/v1", api_key="not-needed" ) # Image understanding response = client.chat.completions.create( model="track_b_model", messages=[ { "role": "user", "content": [ {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}, {"type": "text", "text": "What is in this image?"} ] } ], max_tokens=256, extra_body={"chat_template_kwargs": {"skip_reasoning": True}} ) print(response.choices[0].message.content) ``` ## More Examples
Text to Image ```python import json SYSTEM_PROMPT = """You are an AI assistant that generates images. When asked to draw or create an image, you MUST use the t2i_model_generation tool to generate the image. Always respond by calling the tool.""" response = client.chat.completions.create( model="track_b_model", messages=[ {"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": "Draw a sunset over mountains"} ], tools=[{ "type": "function", "function": { "name": "t2i_model_generation", "description": "Generates an RGB image based on the provided discrete image representation.", "parameters": { "type": "object", "required": ["discrete_image_token"], "properties": { "discrete_image_token": { "type": "string", "description": "A serialized string of discrete vision tokens, encapsulated by special tokens. The format must be strictly followed: <|discrete_image_start|><|vision_ratio_4:3|><|vision_token|><|visionaaaaa|><|visionbbbbb|>... <|visionzzzzz|><|vision_eol|><|vision_eof|><|discrete_image_end|>.", "minLength": 1 } } } } }], max_tokens=7000, extra_body={"chat_template_kwargs": {"skip_reasoning": True}} ) if response.choices[0].message.tool_calls: args = json.loads(response.choices[0].message.tool_calls[0].function.arguments) print(f"Generated image: {args['discrete_image_token']}") ```
Text to Audio ```python import base64 # Prompt should explicitly request speech/audio output response = client.chat.completions.create( model="track_b_model", messages=[{ "role": "user", "content": "Read this text aloud in a cheerful female voice:\nHello! How are you today?" }], max_tokens=1000, extra_body={"chat_template_kwargs": {"skip_reasoning": True}} ) if response.choices[0].message.audio: audio_url = base64.b64decode(response.choices[0].message.audio.data).decode() print(f"Generated audio: {audio_url}") ```
Audio Input ```python import base64 audio_url = "https://example.com/audio.mp3" audio_data = base64.b64encode(audio_url.encode()).decode() response = client.chat.completions.create( model="track_b_model", messages=[ { "role": "user", "content": [ {"type": "input_audio", "input_audio": {"data": audio_data, "format": "mp3"}}, {"type": "text", "text": "What is being said?"} ] } ], max_tokens=256, extra_body={"chat_template_kwargs": {"skip_reasoning": True}} ) print(response.choices[0].message.content) ```
Video Input ```python response = client.chat.completions.create( model="track_b_model", messages=[ { "role": "user", "content": [ {"type": "image_url", "image_url": {"url": "https://example.com/video.mp4"}}, {"type": "text", "text": "Describe this video."} ] } ], max_tokens=512, extra_body={"chat_template_kwargs": {"skip_reasoning": True}} ) print(response.choices[0].message.content) ```
Image to Image ```python import json SYSTEM_PROMPT = """You are an AI assistant that transforms images. When asked to transform, edit, or stylize an image, you MUST use the t2i_model_generation tool to generate the new image. Always respond by calling the tool.""" response = client.chat.completions.create( model="track_b_model", messages=[ {"role": "system", "content": SYSTEM_PROMPT}, { "role": "user", "content": [ {"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}}, {"type": "text", "text": "Transform to watercolor style"} ] } ], tools=[{ "type": "function", "function": { "name": "t2i_model_generation", "description": "Generates an RGB image based on the provided discrete image representation.", "parameters": { "type": "object", "required": ["discrete_image_token"], "properties": { "discrete_image_token": { "type": "string", "description": "A serialized string of discrete vision tokens, encapsulated by special tokens. The format must be strictly followed: <|discrete_image_start|><|vision_ratio_4:3|><|vision_token|><|visionaaaaa|><|visionbbbbb|>... <|visionzzzzz|><|vision_eol|><|vision_eof|><|discrete_image_end|>.", "minLength": 1 } } } } }], max_tokens=7000, extra_body={"chat_template_kwargs": {"skip_reasoning": True}} ) if response.choices[0].message.tool_calls: args = json.loads(response.choices[0].message.tool_calls[0].function.arguments) print(f"Generated image: {args['discrete_image_token']}") ```
Audio to Audio ```python import base64 # Input audio (URL encoded as base64) audio_url = "https://example.com/input.mp3" audio_data = base64.b64encode(audio_url.encode()).decode() response = client.chat.completions.create( model="track_b_model", messages=[ { "role": "user", "content": [ {"type": "input_audio", "input_audio": {"data": audio_data, "format": "mp3"}}, {"type": "text", "text": "Listen to this and respond with speech"} ] } ], max_tokens=2000, extra_body={"chat_template_kwargs": {"skip_reasoning": True}} ) if response.choices[0].message.audio: audio_url = base64.b64decode(response.choices[0].message.audio.data).decode() print(f"Generated audio: {audio_url}") ```
Using curl ```bash # Image understanding curl -X POST http://localhost:8000/b/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "track_b_model", "messages": [{"role": "user", "content": [ {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}, {"type": "text", "text": "Describe this image."} ]}], "max_tokens": 256, "extra_body": {"chat_template_kwargs": {"skip_reasoning": true}} }' # Text to audio curl -X POST http://localhost:8000/b/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "track_b_model", "messages": [{"role": "user", "content": "Say hello"}], "max_tokens": 1000, "extra_body": {"chat_template_kwargs": {"skip_reasoning": true}} }' ```
## Architecture ``` User Request (Image/Audio/Video/Text) │ ▼ ┌─────────────────────────────────────────────────────────────────────────┐ │ OmniServe │ │ POST /b/v1/chat/completions │ │ │ │ ┌──────────────────────────────────────────────────────────────────┐ │ │ │ [1] INPUT ENCODING │ │ │ │ │ │ │ │ ┌─────────────────┐ ┌─────────────────┐ │ │ │ │ │ Vision Encoder │ │ Audio Encoder │ │ │ │ │ └────────┬────────┘ └────────┬────────┘ │ │ │ │ │ │ │ │ │ │ └────────────┬────────────────────┘ │ │ │ │ │ embeddings │ │ │ └──────────────────────────┼───────────────────────────────────────┘ │ │ ▼ │ │ ┌──────────────┐ │ │ │ LLM (8B) │◀──── text │ │ └──────┬───────┘ │ │ │ │ │ ┌─────────────────────────┼────────────────────────────────────────┐ │ │ │ [2] OUTPUT DECODING │ │ │ │ │ │ │ │ │ ┌──────────────┼──────────────┐ │ │ │ │ ▼ ▼ ▼ │ │ │ │ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ │ │ │ │ Text │ │ Vision │ │ Audio │ │ │ │ │ │ │ │ Decoder │ │ Decoder │ │ │ │ │ └───────────┘ └─────┬─────┘ └─────┬─────┘ │ │ │ │ │ │ │ │ │ │ ▼ ▼ │ │ │ │ Image URL Audio URL │ │ │ │ (S3) (S3) │ │ │ └──────────────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────────┘ │ ▼ Response (Text / Image URL / Audio URL) ``` ## Hardware Requirements | Component | GPU | VRAM | |-----------|-----|------| | Vision Encoder | 1x | ~8GB | | Audio Encoder | (shared) | ~4GB | | LLM (8B) | 1x | ~16GB | | Vision Decoder | 1x | ~16GB | | Audio Decoder | (shared) | ~4GB | | **Total** | **3x** | **~48GB** | ## Key Parameters | Parameter | Description | Default | |-----------|-------------|---------| | `chat_template_kwargs.skip_reasoning` | Skip reasoning | `true` | | `max_tokens` | Max output tokens | - | | `temperature` | Sampling temperature | 0.7 | | `tools` | Required for image generation | - | ## S3 Configuration Required for image/audio generation: ```bash NCP_S3_ENDPOINT=https://your-s3-endpoint.com NCP_S3_ACCESS_KEY=your-access-key NCP_S3_SECRET_KEY=your-secret-key NCP_S3_BUCKET_NAME=your-bucket-name ``` For more details, see [OmniServe documentation](https://github.com/NAVER-Cloud-HyperCLOVA-X/OmniServe). --- # Citation TBU (Technical Report) --- # Questions For any other questions, please feel free to contact us at dl_hcxopensource@navercorp.com. --- # License The model is licensed under [HyperCLOVA X SEED 8B Omni Model License Agreement](./LICENSE)