vggt_mapper#

Input a video of a single scene, and use VGGT to extract information including Camera Pose, Depth Maps, Point Maps, and 3D Point Tracks.

The operator processes a video and extracts frames based on the specified frame number and duration.
It uses the VGGT model to analyze the extracted frames and generate various outputs such as camera parameters, depth maps, point maps, and 3D point tracks.
If 3D point tracks are required, the user must provide query points in the format [x, y], relative to the top-left corner.
The results are stored in the sample’s metadata under the specified tag field name, which defaults to ‘vggt_tags’.
The operator can output camera parameters, depth maps, point maps from projection, point maps from unprojection, and 3D point tracks, depending on the configuration.
The VGGT model is loaded from the provided path, and the operator runs in CUDA mode if available.

输入单个场景的视频，并使用VGGT提取信息，包括相机姿态、深度图、点图和3D点轨迹。

该算子处理视频并根据指定的帧数和持续时间提取帧。
它使用VGGT模型分析提取的帧并生成各种输出，如相机参数、深度图、点图和3D点轨迹。
如果需要3D点轨迹，用户必须提供查询点，格式为[x, y]，相对于左上角。
结果存储在样本元数据中指定的标签字段名下，默认为’vggt_tags’。
根据配置，算子可以输出相机参数、深度图、投影点图、反投影点图和3D点轨迹。
VGGT模型从提供的路径加载，如果可用，算子将在CUDA模式下运行。

Type 算子类型: mapper

Tags 标签: gpu, video

🔧 Parameter Configuration 参数配置#

name 参数名	type 类型	default 默认值	desc 说明
`vggt_model_path`	<class ‘str’>	`'facebook/VGGT-1B'`	The path to the VGGT model.
`frame_num`	typing.Annotated[int, Gt(gt=0)]	`3`	The number of frames to be extracted uniformly from the video. If it’s 1, only the middle frame will be extracted. If it’s 2, only the first and the last frames will be extracted. If it’s larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration. If “duration” > 0, frame_num is the number of frames per segment.
`duration`	<class ‘float’>	`0`	The duration of each segment in seconds. If 0, frames are extracted from the entire video. If duration > 0, the video is segmented into multiple segments based on duration, and frames are extracted from each segment.
`tag_field_name`	<class ‘str’>	`'vggt_tags'`	The field name to store the tags. It’s “vggt_tags” in default.
`frame_dir`	<class ‘str’>	`DATA_JUICER_ASSETS_CACHE`	Output directory to save extracted frames.
`if_output_camera_parameters`	<class ‘bool’>	`True`	Determines whether to output camera parameters.
`if_output_depth_maps`	<class ‘bool’>	`True`	Determines whether to output depth maps.
`if_output_point_maps_from_projection`	<class ‘bool’>	`True`	Determines whether to output point maps directly inferred by VGGT.
`if_output_point_maps_from_unprojection`	<class ‘bool’>	`True`	Determines whether to output point maps constructed from depth maps and camera parameters.
`if_output_point_tracks`	<class ‘bool’>	`True`	Determines whether to output point tracks. If point tracks are required, the user should provide a list where each element consists of 2D point coordinates (list shape: (N, 2)). The point coordinates should be specified in the format [x, y], relative to the top-left corner, where x/y values are non-normalized.
`args`		`''`	extra args
`kwargs`		`''`	extra args

📊 Effect demonstration 效果演示#

test#

VggtMapper(vggt_model_path='facebook/VGGT-1B', frame_num=2, duration=2, frame_dir=DATA_JUICER_ASSETS_CACHE, if_output_camera_parameters=True, if_output_depth_maps=True, if_output_point_maps_from_projection=True, if_output_point_maps_from_unprojection=True, if_output_point_tracks=True)

📥 input data 输入数据#

Sample 1: 1 video

video11.mp4:

query_points
[[320.0, 200.0], [500.72, 100.94]]

Sample 2: 1 video

video10.mp4:

query_points
[[50.72, 100.94]]

📤 output data 输出数据#

Sample 1: empty

camera_parameters_extrinsic
[1, 10, 3, 4]
camera_parameters_intrinsic
[1, 10, 3, 3]
depth_maps_depth_maps
[1, 10, 294, 518, 1]
depth_maps_depth_conf
[1, 10, 294, 518]
point_maps_from_projection_point_map
[1, 10, 294, 518, 3]
point_maps_from_projection_point_conf
[1, 10, 294, 518]
point_maps_from_unprojection_point_maps_from_unprojection
[10, 294, 518, 3]
point_tracks_track_list
[1, 10, 2, 2]
point_tracks_vis_score
[1, 10, 2]
point_tracks_conf_score
[1, 10, 2]

Sample 2: empty

camera_parameters_extrinsic
[1, 18, 3, 4]
camera_parameters_intrinsic
[1, 18, 3, 3]
depth_maps_depth_maps
[1, 18, 392, 518, 1]
depth_maps_depth_conf
[1, 18, 392, 518]
point_maps_from_projection_point_map
[1, 18, 392, 518, 3]
point_maps_from_projection_point_conf
[1, 18, 392, 518]
point_maps_from_unprojection_point_maps_from_unprojection
[18, 392, 518, 3]
point_tracks_track_list
[1, 18, 1, 2]
point_tracks_vis_score
[1, 18, 1]
point_tracks_conf_score
[1, 18, 1]

✨ explanation 解释#

The VggtMapper operator processes video data to extract various 3D information such as camera parameters, depth maps, and point tracks. The input consists of video files and query points. The output includes detailed 3D information for each video, such as the shape of the camera parameters, depth maps, and point tracks. This example demonstrates the typical use case where the operator processes a single video with multiple query points and outputs the corresponding 3D information. VggtMapper算子处理视频数据以提取各种3D信息，如相机参数、深度图和点轨迹。输入包括视频文件和查询点。输出包含每个视频的详细3D信息，例如相机参数、深度图和点轨迹的形状。此示例展示了典型的使用场景，其中算子处理带有多个查询点的单个视频，并输出相应的3D信息。