data_juicer.ops.mapper.vggt_mapper module

class data_juicer.ops.mapper.vggt_mapper.VggtMapper(vggt_model_path: str = 'facebook/VGGT-1B', frame_num: Annotated[int, Gt(gt=0)] = 3, duration: float = 0, tag_field_name: str = 'vggt_tags', frame_dir: str = '/home/runner/.cache/data_juicer/assets', if_output_camera_parameters: bool = True, if_output_depth_maps: bool = True, if_output_point_maps_from_projection: bool = True, if_output_point_maps_from_unprojection: bool = True, if_output_point_tracks: bool = True, *args, **kwargs)[source]

Bases: Mapper

Input a video of a single scene, and use VGGT to extract information including Camera Pose, Depth Maps, Point Maps, and 3D Point Tracks (if outputting point tracks is required, the user needs to provide query points).

__init__(vggt_model_path: str = 'facebook/VGGT-1B', frame_num: Annotated[int, Gt(gt=0)] = 3, duration: float = 0, tag_field_name: str = 'vggt_tags', frame_dir: str = '/home/runner/.cache/data_juicer/assets', if_output_camera_parameters: bool = True, if_output_depth_maps: bool = True, if_output_point_maps_from_projection: bool = True, if_output_point_maps_from_unprojection: bool = True, if_output_point_tracks: bool = True, *args, **kwargs)[source]

Initialization method.

Parameters:
  • vggt_model_path – The path to the VGGT model.

  • frame_num – The number of frames to be extracted uniformly from the video. If it’s 1, only the middle frame will be extracted. If it’s 2, only the first and the last frames will be extracted. If it’s larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration. If “duration” > 0, frame_num is the number of frames per segment.

  • duration – The duration of each segment in seconds. If 0, frames are extracted from the entire video. If duration > 0, the video is segmented into multiple segments based on duration, and frames are extracted from each segment.

  • tag_field_name – The field name to store the tags. It’s “vggt_tags” in default.

  • frame_dir – Output directory to save extracted frames.

  • if_output_camera_parameters – Determines whether to output camera parameters.

  • if_output_depth_maps – Determines whether to output depth maps.

  • if_output_point_maps_from_projection – Determines whether to output point maps directly inferred by VGGT.

  • if_output_point_maps_from_unprojection – Determines whether to output point maps constructed from depth maps and camera parameters.

  • if_output_point_tracks – Determines whether to output point tracks. If point tracks are required, the user should provide a list where each element consists of 2D point coordinates (list shape: (N, 2)). The point coordinates should be specified in the format [x, y], relative to the top-left corner, where x/y values are non-normalized.

  • args – extra args

  • kwargs – extra args

process_single(sample=None, rank=None)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample