data_juicer.ops.mapper.export_to_lerobot_mapper module#
- class data_juicer.ops.mapper.export_to_lerobot_mapper.ExportToLeRobotMapper(*args, **kwargs)[source]#
Bases:
MapperExport processed video data to LeRobot v2.0 dataset format (LIBERO-style).
Designed for Ray distributed execution: each actor writes files independently using UUID-based names (no cross-process coordination). After all actors finish, call finalize_dataset() once to assign sequential episode indices, rename files, and generate metadata.
- Processing phase (parallel, per actor):
staging/ âââ data/{uuid}.parquet âââ videos/{uuid}.mp4 âââ meta/episodes_{uuid}.jsonl
- After finalize_dataset() (single-threaded):
dataset_dir/ âââ data/chunk-{NNN}/episode_XXXXXX.parquet âââ videos/chunk-{NNN}/observation.images.image/episode_XXXXXX.mp4 âââ meta/
âââ info.json âââ tasks.jsonl âââ episodes.jsonl âââ modality.json
- __init__(output_dir: str = './lerobot_output', hand_action_field: str = 'hand_action_tags', fps: int = 10, robot_type: str = 'egodex_hand', chunks_size: int = 1000, segment_field: str = None, frame_field: str = 'video_frames', *args, **kwargs)[source]#
Initialization method.
- Parameters:
output_dir â Root directory for the LeRobot dataset output.
hand_action_field â Meta field with action/state data. Used in whole-video mode (segment_field=None).
fps â Frames per second for the dataset.
robot_type â Robot type identifier for info.json.
chunks_size â Max episodes per chunk directory (default 1000).
segment_field â Meta field storing atomic action segments. When set, each segment becomes a separate episode with its own caption as task description. When None (default), falls back to whole-video export via hand_action_field.
frame_field â Sample field with extracted frame image paths. Used in segment mode to create per-segment videos.
- process_single(sample=None, rank=None)[source]#
For sample level, sample â> sample
- Parameters:
sample â sample to process
- Returns:
processed sample
- static finalize_dataset(output_dir, fps=10, robot_type='egodex_hand', chunks_size=1000)[source]#
Merge staged files into final LeRobot dataset structure.
Must be called ONCE after all Ray actors have finished. This is single-threaded â no concurrency issues.
- Steps:
Collect all episode metadata fragments from staging
Sort by UUID for deterministic ordering
Assign sequential episode_index (0, 1, 2, âĻ)
Rewrite parquet files with correct episode_index / index
Move video files to chunk directories
Write episodes.jsonl, tasks.jsonl, info.json
Clean up staging directory