VLA Visualization Demo#
This demo provides a complete pipeline for egocentric video hand action recognition and LeRobot dataset export. It extracts frames from ego-view videos, estimates camera intrinsics and poses, reconstructs 3D hands, computes hand actions, and exports the results as a LeRobot v2.0 dataset.
Pipeline Overview#
Video Input
│
▼
VideoExtractFramesMapper # Extract video keyframes
│
▼
VideoCameraCalibrationMogeMapper # MoGe-2 camera calibration + depth estimation
│
▼
VideoHandReconstructionHaworMapper # HaWoR 3D hand reconstruction
│
▼
VideoCameraPoseMegaSaMMapper # MegaSaM camera pose estimation (⚠️ requires separate conda env)
│
▼
VideoHandActionComputeMapper # Compute 7-DoF actions + 8-dim states
│
▼
VideoActionCaptioningMapper # action instruction captioning
│
▼
ExportToLeRobotMapper # Export to LeRobot v2.0 dataset
Output Format#
Action: 7-dim
[dx, dy, dz, droll, dpitch, dyaw, gripper]State: 8-dim
[x, y, z, roll, pitch, yaw, pad, gripper]Gripper: 1.0 (open) to -1.0 (closed), estimated from finger joint angles
Prerequisites#
1. Base Environment#
Create an image based on the Dockerfile.
The VideoCameraPoseMegaSaMMapper operator depends on MegaSaM (based on DROID-SLAM). Its CUDA compiled components (droid_backends, lietorch, torch-scatter) conflict with the main environment and must run in a separate conda environment.
Note: This environment is automatically activated at runtime via Ray's
runtime_env={"conda": "mega-sam"}mechanism. You do not need to manually switch environments. All other operators run in the default environment.
2. Ray Cluster#
The pipeline runs on Ray. You need to start a Ray cluster.
3. MANO Hand Model#
Download MANO v1.2 from the MANO website. Update the mano_right_path and mano_left_path in the config or script to point to your MANO_RIGHT.pkl and MANO_LEFT.pkl files.
Running the Demo#
Option 1: Python Script (Recommended)#
cd demos/ego_hand_action_annotation
python vla_pipeline.py
Option 2: YAML Config#
python tools/process_data.py --config demos/ego_hand_action_annotation/configs/vla_pipeline.yaml
Input Data Format#
Each sample is a JSON object containing a video path list:
{
"videos": ["./data/1018.mp4"],
"text": "",
"__dj__meta__": {}
}
The demo includes two sample videos: data/1018.mp4 and data/1034.mp4.
Output Structure#
output/
├── frames/ # Extracted video frames
├── lerobot_dataset/ # LeRobot v2.0 dataset
│ ├── data/
│ │ └── chunk-000/
│ │ ├── episode_000000.parquet
│ │ └── ...
│ ├── videos/
│ │ └── chunk-000/
│ │ ├── observation.images.main/
│ │ │ ├── episode_000000.mp4
│ │ │ └── ...
│ ├── meta/
│ │ ├── info.json
│ │ ├── episodes.jsonl
│ │ ├── stats.json
│ │ └── tasks.jsonl
│ └── modality.json
└── *.parquet # Ray output results
Visualization Tools#
Two visualization scripts are provided for inspecting processing results:
Action Annotation Verification (vis_hand_action_demo.py)#
Verify hand action annotations with hand trajectory, state, and action value overlays:
python vis_hand_action_demo.py --data_path output/xxx.parquet
Troubleshooting#
MegaSaM compilation fails#
Ensure the mega-sam conda environment has a CUDA toolkit matching your PyTorch version. Verify with nvcc --version.
MANO model loading fails#
Check that mano_right_path and mano_left_path point to valid files. MANO models must be downloaded separately from the official website.
Ray GPU resource exhaustion#
Multiple operators require GPU. By default each uses 0.1 GPU (10 operators can share 1 GPU). Adjust num_gpus or add more GPUs if needed.