GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
Paper
•
2503.14734
•
Published
•
5
This is a finetuned version of NVIDIA's GR00T-N1.5-3B foundation model, specifically adapted for SO-101 robot arm table cleanup tasks. The model has been trained on a custom dataset of 80 episodes (47,513 frames) demonstrating table cleanup behaviors using dual-camera observations.
The model combines:
Based on training logs:
from gr00t.model.policy import Gr00tPolicy
from gr00t.data.embodiment_tags import EmbodimentTag
import numpy as np
# Load the finetuned model
policy = Gr00tPolicy(
model_path="path/to/your/finetuned/model",
modality_config=modality_config,
modality_transform=transforms,
embodiment_tag=EmbodimentTag.NEW_EMBODIMENT, # Custom SO-101 embodiment
device="cuda"
)
# Prepare observation in correct format
obs = {
"video.front": np.random.randint(0, 256, (1, 480, 640, 3), dtype=np.uint8),
"video.wrist": np.random.randint(0, 256, (1, 480, 640, 3), dtype=np.uint8),
"state.single_arm": np.array([[-1.1961363e-16, 1.1968981e-10, 4.2663229e-10, 6.3043515e-10, 3.8023183e-12]]),
"state.gripper": np.array([[-2.5943134e-10]]),
"annotation.human.task_description": ["pick up the cup from the table"],
}
# Run inference
action_chunk = policy.get_action(obs)
# Returns: {"action.single_arm": (16, 5), "action.gripper": (16,)}
# Start inference server
python scripts/inference_service.py \
--model-path /path/to/your/finetuned/model \
--server \
--data-config so100_dualcam \
--embodiment-tag new_embodiment
# Run client for evaluation
python scripts/inference_service.py --client
obs = {
"video.front": np.ndarray, # (1, 480, 640, 3) front camera frames
"video.wrist": np.ndarray, # (1, 480, 640, 3) wrist camera frames
"state.single_arm": np.ndarray, # (1, 5) normalized joint positions [pan, lift, elbow, wrist_flex, wrist_roll]
"state.gripper": np.ndarray, # (1, 1) normalized gripper position
"annotation.human.task_description": [str] # Language instruction as list
}
Example Input:
obs = {
"video.front": np.random.randint(0, 256, (1, 480, 640, 3), dtype=np.uint8),
"video.wrist": np.random.randint(0, 256, (1, 480, 640, 3), dtype=np.uint8),
"state.single_arm": np.array([[-1.1961363e-16, 1.1968981e-10, 4.2663229e-10, 6.3043515e-10, 3.8023183e-12]]),
"state.gripper": np.array([[-2.5943134e-10]]),
"annotation.human.task_description": ["pick up the cup from the table"],
}
{
"action.single_arm": np.ndarray, # (16, 5) action horizon × joint deltas
"action.gripper": np.ndarray # (16,) gripper actions over horizon
}
Example Output:
{
"action.single_arm": np.array([
[ -8.586029, -19.553513, 21.609299, 62.635612, 1.1289978],
[ -7.937084, -20.358055, 23.032593, 63.238243, 2.4241562],
# ... 14 more timesteps
[ -6.097515, -48.115795, 44.834297, 64.58016, 3.4453125]
]), # Shape: (16, 5)
"action.gripper": np.array([
15.843709, 14.624388, 11.970064, 11.224205, 10.709647, 10.889213,
7.1574492, 7.342965, 4.87047, 3.5736556, 2.6360812, 0.70475096,
1.0765471, 0.68617123, 0.5908453, 0.559207
]) # Shape: (16,)
}
If you use this model, please cite the original GR00T paper:
@article{gr00t2024,
title={GR00T: A Foundation Model for Generalized Humanoid Robot Reasoning and Skills},
author={NVIDIA Research},
journal={arXiv preprint arXiv:2503.14734},
year={2024}
}
This model is released under the Apache License 2.0. See the LICENSE file for details.
Base model
nvidia/GR00T-N1.5-3B