# VLA Visualization Demo

This demo provides a complete pipeline for **egocentric video hand action recognition and LeRobot dataset export**. It extracts frames from ego-view videos, estimates camera intrinsics and poses, reconstructs 3D hands, computes hand actions, and exports the results as a [LeRobot v2.0](https://github.com/huggingface/lerobot) dataset.

## Pipeline Overview

```
Video Input
  │
  ▼
VideoExtractFramesMapper          # Extract video keyframes
  │
  ▼
VideoCameraCalibrationMogeMapper  # MoGe-2 camera calibration + depth estimation
  │
  ▼
VideoHandReconstructionHaworMapper # HaWoR 3D hand reconstruction
  │
  ▼
VideoCameraPoseMegaSaMMapper      # MegaSaM camera pose estimation (⚠️ requires separate conda env)
  │
  ▼
VideoHandActionComputeMapper      # Compute 7-DoF actions + 8-dim states
  │
  ▼
VideoActionCaptioningMapper       # action instruction captioning
  │
  ▼
ExportToLeRobotMapper            # Export to LeRobot v2.0 dataset
```

## Output Format

- **Action**: 7-dim `[dx, dy, dz, droll, dpitch, dyaw, gripper]`
- **State**: 8-dim `[x, y, z, roll, pitch, yaw, pad, gripper]`
- **Gripper**: 1.0 (open) to -1.0 (closed), estimated from finger joint angles

## Prerequisites

### 1. Base Environment

Create an image based on the Dockerfile.

The `VideoCameraPoseMegaSaMMapper` operator depends on MegaSaM (based on DROID-SLAM). Its CUDA compiled components (`droid_backends`, `lietorch`, `torch-scatter`) **conflict with the main environment** and must run in a separate conda environment.

> **Note**: This environment is automatically activated at runtime via Ray's `runtime_env={"conda": "mega-sam"}` mechanism. You do not need to manually switch environments. All other operators run in the default environment.


### 2. Ray Cluster

The pipeline runs on Ray. You need to start a Ray cluster.

### 3. MANO Hand Model

Download MANO v1.2 from the [MANO website](https://mano.is.tue.mpg.de/). Update the `mano_right_path` and `mano_left_path` in the config or script to point to your `MANO_RIGHT.pkl` and `MANO_LEFT.pkl` files.


## Running the Demo

### Option 1: Python Script (Recommended)

```bash
cd demos/ego_hand_action_annotation
python vla_pipeline.py
```

### Option 2: YAML Config

```bash
python tools/process_data.py --config demos/ego_hand_action_annotation/configs/vla_pipeline.yaml
```

## Input Data Format

Each sample is a JSON object containing a video path list:

```json
{
    "videos": ["./data/1018.mp4"],
    "text": "",
    "__dj__meta__": {}
}
```

The demo includes two sample videos: `data/1018.mp4` and `data/1034.mp4`.

## Output Structure

```
output/
├── frames/                    # Extracted video frames
├── lerobot_dataset/           # LeRobot v2.0 dataset
│   ├── data/
│   │   └── chunk-000/
│   │       ├── episode_000000.parquet
│   │       └── ...
│   ├── videos/
│   │   └── chunk-000/
│   │       ├── observation.images.main/
│   │       │   ├── episode_000000.mp4
│   │       │   └── ...
│   ├── meta/
│   │   ├── info.json
│   │   ├── episodes.jsonl
│   │   ├── stats.json
│   │   └── tasks.jsonl
│   └── modality.json
└── *.parquet                  # Ray output results
```

## Visualization Tools

Two visualization scripts are provided for inspecting processing results:

### Action Annotation Verification (vis_hand_action_demo.py)

Verify hand action annotations with hand trajectory, state, and action value overlays:

```bash
python vis_hand_action_demo.py --data_path output/xxx.parquet
```

## Troubleshooting

### MegaSaM compilation fails
Ensure the `mega-sam` conda environment has a CUDA toolkit matching your PyTorch version. Verify with `nvcc --version`.

### MANO model loading fails
Check that `mano_right_path` and `mano_left_path` point to valid files. MANO models must be downloaded separately from the official website.

### Ray GPU resource exhaustion
Multiple operators require GPU. By default each uses 0.1 GPU (10 operators can share 1 GPU). Adjust `num_gpus` or add more GPUs if needed.