data_juicer.ops.mapper.video_captioning_from_frames_mapper module#
- class data_juicer.ops.mapper.video_captioning_from_frames_mapper.VideoCaptioningFromFramesMapper(*args, **kwargs)[source]#
Bases:
MapperGenerates video captions from sampled frames using an image-to-text model. Captions from different frames are concatenated into a single string.
Uses a Hugging Face image-to-text model to generate captions for sampled video frames.
Supports different frame sampling methods: âall_keyframesâ or âuniformâ.
Can apply horizontal and vertical flips to the frames before captioning.
Offers multiple strategies for retaining generated captions: ârandom_anyâ,
âsimilar_one_simhashâ, or âallâ. - Optionally keeps the original sample in the final dataset. - Allows setting a global prompt or per-sample prompts to guide caption generation. - Generates a specified number of candidate captions per video, which can be reduced based on the selected retention strategy. - The number of output samples depends on the retention strategy and whether original samples are kept.
- __init__(hf_img2seq: str = 'Salesforce/blip2-opt-2.7b', trust_remote_code: bool = False, caption_num: Annotated[int, Gt(gt=0)] = 1, keep_candidate_mode: str = 'random_any', keep_original_sample: bool = True, prompt: str | None = None, prompt_key: str | None = None, frame_field: str | None = None, frame_sampling_method: str = 'all_keyframes', frame_num: Annotated[int, Gt(gt=0)] = 3, horizontal_flip: bool = False, vertical_flip: bool = False, text_update_strategy: str = 'rewrite', caption_field: str | None = None, legacy_split_by_text_token: bool = True, *args, **kwargs)[source]#
Initialization method.
- Parameters:
hf_img2seq â model name on huggingface to generate caption
trust_remote_code â whether to trust the remote code of HF models.
caption_num â how many candidate captions to generate for each video
keep_candidate_mode â
retain strategy for the generated $caption_num$ candidates.
ârandom_anyâ: Retain the random one from generated captions
- âsimilar_one_simhashâ: Retain the generated one that is most
similar to the original caption
âallâ: Retain all generated captions by concatenation
Note
This is a batched_OP, whose input and output type are both list. Suppose there are $N$ list of input samples, whose batch size is $b$, and denote caption_num as $M$. The number of total samples after generation is $2Nb$ when keep_original_sample is True and $Nb$ when keep_original_sample is False. For ârandom_anyâ and âsimilar_one_simhashâ mode, itâs $(1+M)Nb$ for âallâ mode when keep_original_sample is True and $MNb$ when keep_original_sample is False.
- Parameters:
keep_original_sample â whether to keep the original sample. If itâs set to False, there will be only generated captions in the final datasets and the original captions will be removed. Itâs True in default.
prompt â a string prompt to guide the generation of image-to-text model for all samples globally. Itâs None in default, which means no prompt provided.
prompt_key â the key name of fields in samples to store prompts for each sample. Itâs used for set different prompts for different samples. If itâs none, use prompt in parameter âpromptâ. Itâs None in default.
frame_field â the field name of video frames to generate caption. If frame_field is None, extract frames from the video field.
frame_sampling_method â sampling method of extracting frame videos from the videos. Should be one of [âall_keyframesâ, âuniformâ]. Only works when âframe_fieldâ is none. The former one extracts all key frames (the number of which depends on the duration of the video) and the latter one extract specified number of frames uniformly from the video. Default: âall_keyframesâ.
frame_num â the number of frames to be extracted uniformly from the video frames. Only works when âframe_sampling_methodâ is âuniformâ or âframe_fieldâ is given. If itâs 1, only the middle frame will be extracted. If itâs 2, only the first and the last frames will be extracted. If itâs larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration.
horizontal_flip â flip frame video horizontally (left to right).
vertical_flip â flip frame video vertically (top to bottom).
text_update_strategy â strategy to update the text field after caption generation. Can be one of [âkeep_originâ, ârewriteâ]. âkeep_originâ: keep the original text unchanged. ârewriteâ: rewrite the text field with the generated captions concated by special tokens.
caption_field â the field name to save the generated captions.
legacy_split_by_text_token â Whether to split by special tokens (e.g. <__dj__video>) in the text field and read videos in order, or use the âvideosâ or âframesâ field directly.
args â extra args
kwargs â extra args
- process_batched(samples, rank=None, context=False)[source]#
- Parameters:
samples
- Returns:
Note
This is a batched_OP, whose the input and output type are both list. Suppose there are $N$ input sample list with batch size as $b$, and denote caption_num as $M$. the number of total samples after generation is $2Nb$ for ârandom_anyâ and âsimilar_oneâ mode, and $(1+M)Nb$ for âallâ mode.