data_juicer.ops.mapper.vggt_mapper module#
- class data_juicer.ops.mapper.vggt_mapper.VggtMapper(vggt_model_path: str = 'facebook/VGGT-1B', frame_num: Annotated[int, Gt(gt=0)] = 3, duration: float = 0, tag_field_name: str = 'vggt_tags', frame_dir: str = '/home/runner/.cache/data_juicer/assets', if_output_camera_parameters: bool = True, if_output_depth_maps: bool = True, if_output_point_maps_from_projection: bool = True, if_output_point_maps_from_unprojection: bool = True, if_output_point_tracks: bool = True, *args, **kwargs)[source]#
Bases:
MapperInput a video of a single scene, and use VGGT to extract information including Camera Pose, Depth Maps, Point Maps, and 3D Point Tracks.
The operator processes a video and extracts frames based on the specified frame number and duration.
It uses the VGGT model to analyze the extracted frames and generate various outputs such as camera parameters, depth maps, point maps, and 3D point tracks.
If 3D point tracks are required, the user must provide query points in the format [x, y], relative to the top-left corner.
The results are stored in the sample’s metadata under the specified tag field name, which defaults to ‘vggt_tags’.
The operator can output camera parameters, depth maps, point maps from projection, point maps from unprojection, and 3D point tracks, depending on the configuration.
The VGGT model is loaded from the provided path, and the operator runs in CUDA mode if available.
- __init__(vggt_model_path: str = 'facebook/VGGT-1B', frame_num: Annotated[int, Gt(gt=0)] = 3, duration: float = 0, tag_field_name: str = 'vggt_tags', frame_dir: str = '/home/runner/.cache/data_juicer/assets', if_output_camera_parameters: bool = True, if_output_depth_maps: bool = True, if_output_point_maps_from_projection: bool = True, if_output_point_maps_from_unprojection: bool = True, if_output_point_tracks: bool = True, *args, **kwargs)[source]#
Initialization method.
- Parameters:
vggt_model_path – The path to the VGGT model.
frame_num – The number of frames to be extracted uniformly from the video. If it’s 1, only the middle frame will be extracted. If it’s 2, only the first and the last frames will be extracted. If it’s larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration. If “duration” > 0, frame_num is the number of frames per segment.
duration – The duration of each segment in seconds. If 0, frames are extracted from the entire video. If duration > 0, the video is segmented into multiple segments based on duration, and frames are extracted from each segment.
tag_field_name – The field name to store the tags. It’s “vggt_tags” in default.
frame_dir – Output directory to save extracted frames.
if_output_camera_parameters – Determines whether to output camera parameters.
if_output_depth_maps – Determines whether to output depth maps.
if_output_point_maps_from_projection – Determines whether to output point maps directly inferred by VGGT.
if_output_point_maps_from_unprojection – Determines whether to output point maps constructed from depth maps and camera parameters.
if_output_point_tracks – Determines whether to output point tracks. If point tracks are required, the user should provide a list where each element consists of 2D point coordinates (list shape: (N, 2)). The point coordinates should be specified in the format [x, y], relative to the top-left corner, where x/y values are non-normalized.
args – extra args
kwargs – extra args