GR00T-N1.5-3B Finetuned for SO-101 Table Cleanup Tasks

Model Description

This is a finetuned version of NVIDIA's GR00T-N1.5-3B foundation model, specifically adapted for SO-101 robot arm table cleanup tasks. The model has been trained on a custom dataset of 80 episodes (47,513 frames) demonstrating table cleanup behaviors using dual-camera observations.

Key Features

Base Model: NVIDIA GR00T-N1.5-3B (2.7B parameters)
Embodiment: SO-101 robot arm with 6-DOF control
Task Domain: Table cleanup and manipulation tasks
Input Modalities: Dual camera views (front + wrist cameras) + proprioceptive state
Output: 6-DOF joint space actions (5 arm joints + gripper)

Model Architecture

The model combines:

Vision-Language Model: Eagle 2.5 backbone (frozen during finetuning)
Action Head: Flow-matching diffusion transformer with 16 layers
Projector: MLP adapter connecting vision encoder to LLM
Embodiment Head: Custom action head for SO-101 robot configuration

Technical Specifications

Action Space: 5-DOF arm joints + gripper (6 total DOF)
Action Horizon: 16 timesteps (predicts future action sequence)
State Space: 5-DOF normalized joint positions + gripper state
Vision Input: Dual cameras (480×640×3) at 30 FPS
Model Precision: bfloat16 during training, float32 for inference
Control Type: Delta joint space control (relative movements)

Training Details

Dataset

Source: SO-101 table cleanup demonstrations
Episodes: 80 total episodes
Frames: 47,513 total frames
Tasks: 4 different table cleanup tasks
Cameras: Front camera + wrist camera
Data Format: LeRobot compatible schema

Training Configuration

Learning Rate: 1e-4 with cosine scheduling
Batch Size: 32 per device
Max Steps: 10,000
Optimizer: AdamW (β₁=0.95, β₂=0.999)
Weight Decay: 1e-5
Gradient Clipping: 1.0
Warmup: 5% of total steps

Finetuning Strategy

Frozen Components: Vision encoder, language model
Tuned Components: Projector, diffusion model, embodiment-specific action head
Training Loss: Flow matching loss + FLARE objective
Inference Steps: 4 denoising steps

Performance Metrics

Based on training logs:

Final Training Loss: ~0.04 (converged from initial ~0.96)
Training Steps: 1,460 completed
Gradient Norm: Stabilized around 0.5
Learning Rate: Decayed to ~9.75e-5

Usage

Basic Inference

from gr00t.model.policy import Gr00tPolicy
from gr00t.data.embodiment_tags import EmbodimentTag
import numpy as np

# Load the finetuned model
policy = Gr00tPolicy(
    model_path="path/to/your/finetuned/model",
    modality_config=modality_config,
    modality_transform=transforms,
    embodiment_tag=EmbodimentTag.NEW_EMBODIMENT,  # Custom SO-101 embodiment
    device="cuda"
)

# Prepare observation in correct format
obs = {
    "video.front": np.random.randint(0, 256, (1, 480, 640, 3), dtype=np.uint8),
    "video.wrist": np.random.randint(0, 256, (1, 480, 640, 3), dtype=np.uint8),
    "state.single_arm": np.array([[-1.1961363e-16, 1.1968981e-10, 4.2663229e-10, 6.3043515e-10, 3.8023183e-12]]),
    "state.gripper": np.array([[-2.5943134e-10]]),
    "annotation.human.task_description": ["pick up the cup from the table"],
}

# Run inference
action_chunk = policy.get_action(obs)
# Returns: {"action.single_arm": (16, 5), "action.gripper": (16,)}

Server-Client Setup

# Start inference server
python scripts/inference_service.py \
    --model-path /path/to/your/finetuned/model \
    --server \
    --data-config so100_dualcam \
    --embodiment-tag new_embodiment

# Run client for evaluation
python scripts/inference_service.py --client

Input/Output Format

Input Observation

obs = {
    "video.front": np.ndarray,      # (1, 480, 640, 3) front camera frames
    "video.wrist": np.ndarray,      # (1, 480, 640, 3) wrist camera frames
    "state.single_arm": np.ndarray, # (1, 5) normalized joint positions [pan, lift, elbow, wrist_flex, wrist_roll]
    "state.gripper": np.ndarray,    # (1, 1) normalized gripper position
    "annotation.human.task_description": [str]  # Language instruction as list
}

Example Input:

obs = {
    "video.front": np.random.randint(0, 256, (1, 480, 640, 3), dtype=np.uint8),
    "video.wrist": np.random.randint(0, 256, (1, 480, 640, 3), dtype=np.uint8),
    "state.single_arm": np.array([[-1.1961363e-16, 1.1968981e-10, 4.2663229e-10, 6.3043515e-10, 3.8023183e-12]]),
    "state.gripper": np.array([[-2.5943134e-10]]),
    "annotation.human.task_description": ["pick up the cup from the table"],
}

Output Action

{
    "action.single_arm": np.ndarray,  # (16, 5) action horizon × joint deltas
    "action.gripper": np.ndarray      # (16,) gripper actions over horizon
}

Example Output:

{
    "action.single_arm": np.array([
        [ -8.586029,  -19.553513,   21.609299,   62.635612,    1.1289978],
        [ -7.937084,  -20.358055,   23.032593,   63.238243,    2.4241562],
        # ... 14 more timesteps
        [ -6.097515,  -48.115795,   44.834297,   64.58016,     3.4453125]
    ]),  # Shape: (16, 5)
    
    "action.gripper": np.array([
        15.843709, 14.624388, 11.970064, 11.224205, 10.709647, 10.889213,
        7.1574492, 7.342965, 4.87047, 3.5736556, 2.6360812, 0.70475096,
        1.0765471, 0.68617123, 0.5908453, 0.559207
    ])  # Shape: (16,)
}

Hardware Requirements

Training

GPU: H100, L40, RTX 4090, or A6000
Memory: 24GB+ VRAM recommended
CUDA: Version 12.4
Python: 3.10

Inference

GPU: RTX 3090, RTX 4090, or A6000
Memory: 8GB+ VRAM
Latency: ~48ms per inference (H100)

Limitations and Considerations

Domain Specificity: Model is specialized for table cleanup tasks and may not generalize to other manipulation domains
Robot Configuration: Optimized for SO-101 robot arm; adaptation required for other embodiments
Camera Setup: Requires specific dual-camera configuration (front + wrist)
Action Space: Limited to 6-DOF joint space control
Safety: Model outputs should be validated and constrained for safe robot operation

Citation

If you use this model, please cite the original GR00T paper:

@article{gr00t2024,
  title={GR00T: A Foundation Model for Generalized Humanoid Robot Reasoning and Skills},
  author={NVIDIA Research},
  journal={arXiv preprint arXiv:2503.14734},
  year={2024}
}