VLA Visualization Demo#

This demo provides a complete pipeline for egocentric video hand action recognition and LeRobot dataset export. It extracts frames from ego-view videos, estimates camera intrinsics and poses, reconstructs 3D hands, computes hand actions, and exports the results as a LeRobot v2.0 dataset.

Pipeline Overview#

Video Input
  │
  ▼
VideoExtractFramesMapper          # Extract video keyframes
  │
  ▼
VideoCameraCalibrationMogeMapper  # MoGe-2 camera calibration + depth estimation
  │
  ▼
VideoHandReconstructionHaworMapper # HaWoR 3D hand reconstruction
  │
  ▼
VideoCameraPoseMegaSaMMapper      # MegaSaM camera pose estimation (⚠️ requires separate conda env)
  │
  ▼
VideoHandActionComputeMapper      # Compute 7-DoF actions + 8-dim states
  │
  ▼
VideoActionCaptioningMapper       # action instruction captioning
  │
  ▼
ExportToLeRobotMapper            # Export to LeRobot v2.0 dataset

Output Format#

  • Action: 7-dim [dx, dy, dz, droll, dpitch, dyaw, gripper]

  • State: 8-dim [x, y, z, roll, pitch, yaw, pad, gripper]

  • Gripper: 1.0 (open) to -1.0 (closed), estimated from finger joint angles

Prerequisites#

1. Base Environment#

Create an image based on the Dockerfile.

The VideoCameraPoseMegaSaMMapper operator depends on MegaSaM (based on DROID-SLAM). Its CUDA compiled components (droid_backends, lietorch, torch-scatter) conflict with the main environment and must run in a separate conda environment.

Note: This environment is automatically activated at runtime via Ray's runtime_env={"conda": "mega-sam"} mechanism. You do not need to manually switch environments. All other operators run in the default environment.

2. Ray Cluster#

The pipeline runs on Ray. You need to start a Ray cluster.

3. MANO Hand Model#

Download MANO v1.2 from the MANO website. Update the mano_right_path and mano_left_path in the config or script to point to your MANO_RIGHT.pkl and MANO_LEFT.pkl files.

Running the Demo#

Option 2: YAML Config#

python tools/process_data.py --config demos/ego_hand_action_annotation/configs/vla_pipeline.yaml

Input Data Format#

Each sample is a JSON object containing a video path list:

{
    "videos": ["./data/1018.mp4"],
    "text": "",
    "__dj__meta__": {}
}

The demo includes two sample videos: data/1018.mp4 and data/1034.mp4.

Output Structure#

output/
├── frames/                    # Extracted video frames
├── lerobot_dataset/           # LeRobot v2.0 dataset
│   ├── data/
│   │   └── chunk-000/
│   │       ├── episode_000000.parquet
│   │       └── ...
│   ├── videos/
│   │   └── chunk-000/
│   │       ├── observation.images.main/
│   │       │   ├── episode_000000.mp4
│   │       │   └── ...
│   ├── meta/
│   │   ├── info.json
│   │   ├── episodes.jsonl
│   │   ├── stats.json
│   │   └── tasks.jsonl
│   └── modality.json
└── *.parquet                  # Ray output results

Visualization Tools#

Two visualization scripts are provided for inspecting processing results:

Action Annotation Verification (vis_hand_action_demo.py)#

Verify hand action annotations with hand trajectory, state, and action value overlays:

python vis_hand_action_demo.py --data_path output/xxx.parquet

Troubleshooting#

MegaSaM compilation fails#

Ensure the mega-sam conda environment has a CUDA toolkit matching your PyTorch version. Verify with nvcc --version.

MANO model loading fails#

Check that mano_right_path and mano_left_path point to valid files. MANO models must be downloaded separately from the official website.

Ray GPU resource exhaustion#

Multiple operators require GPU. By default each uses 0.1 GPU (10 operators can share 1 GPU). Adjust num_gpus or add more GPUs if needed.