data_juicer.ops.mapper.export_to_lerobot_mapper module#

class data_juicer.ops.mapper.export_to_lerobot_mapper.ExportToLeRobotMapper(*args, **kwargs)[源代码]#

基类:Mapper

Export processed video data to LeRobot v2.0 dataset format (LIBERO-style).

Designed for Ray distributed execution: each actor writes files independently using UUID-based names (no cross-process coordination). After all actors finish, call finalize_dataset() once to assign sequential episode indices, rename files, and generate metadata.

Processing phase (parallel, per actor):

staging/ ├── data/{uuid}.parquet ├── videos/{uuid}.mp4 └── meta/episodes_{uuid}.jsonl

After finalize_dataset() (single-threaded):

dataset_dir/ ├── data/chunk-{NNN}/episode_XXXXXX.parquet ├── videos/chunk-{NNN}/observation.images.image/episode_XXXXXX.mp4 └── meta/

├── info.json ├── tasks.jsonl ├── episodes.jsonl └── modality.json

__init__(output_dir: str = './lerobot_output', hand_action_field: str = 'hand_action_tags', fps: int = 10, robot_type: str = 'egodex_hand', chunks_size: int = 1000, segment_field: str = None, frame_field: str = 'video_frames', *args, **kwargs)[源代码]#

Initialization method.

参数:
  • output_dir -- Root directory for the LeRobot dataset output.

  • hand_action_field -- Meta field with action/state data. Used in whole-video mode (segment_field=None).

  • fps -- Frames per second for the dataset.

  • robot_type -- Robot type identifier for info.json.

  • chunks_size -- Max episodes per chunk directory (default 1000).

  • segment_field -- Meta field storing atomic action segments. When set, each segment becomes a separate episode with its own caption as task description. When None (default), falls back to whole-video export via hand_action_field.

  • frame_field -- Sample field with extracted frame image paths. Used in segment mode to create per-segment videos.

process_single(sample=None, rank=None)[源代码]#

For sample level, sample --> sample

参数:

sample -- sample to process

返回:

processed sample

static finalize_dataset(output_dir, fps=10, robot_type='egodex_hand', chunks_size=1000)[源代码]#

Merge staged files into final LeRobot dataset structure.

Must be called ONCE after all Ray actors have finished. This is single-threaded — no concurrency issues.

Steps:
  1. Collect all episode metadata fragments from staging

  2. Sort by UUID for deterministic ordering

  3. Assign sequential episode_index (0, 1, 2, ...)

  4. Rewrite parquet files with correct episode_index / index

  5. Move video files to chunk directories

  6. Write episodes.jsonl, tasks.jsonl, info.json

  7. Clean up staging directory