data_juicer.ops.mapper.image_captioning_mapper module#

class data_juicer.ops.mapper.image_captioning_mapper.ImageCaptioningMapper(*args, **kwargs)[source]#

Bases: Mapper

Generates image captions using a Hugging Face model and appends them to samples.

This operator generates captions for images in the input samples using a specified Hugging Face model. It can generate multiple captions per image and apply different strategies to retain the generated captions. The operator supports three retention modes: ‘random_any’, ‘similar_one_simhash’, and ‘all’. In ‘random_any’ mode, a random caption is retained. In ‘similar_one_simhash’ mode, the most similar caption to the original text (based on SimHash) is retained. In ‘all’ mode, all generated captions are concatenated and retained. The operator can also keep or discard the original sample based on the keep_original_sample parameter. If both prompt and prompt_key are set, the prompt_key takes precedence.

__init__(hf_img2seq: str = 'Salesforce/blip2-opt-2.7b', trust_remote_code: bool = False, caption_num: Annotated[int, Gt(gt=0)] = 1, keep_candidate_mode: str = 'random_any', keep_original_sample: bool = True, prompt: str | None = None, prompt_key: str | None = None, gpu_batch_size: Annotated[int, Gt(gt=0)] = 8, *args, **kwargs)[source]#

Initialization method.

Parameters:
  • hf_img2seq – model name on huggingface to generate caption

  • trust_remote_code – whether to trust the remote code of HF models.

  • caption_num – how many candidate captions to generate for each image

  • keep_candidate_mode

    retain strategy for the generated $caption_num$ candidates.

    ’random_any’: Retain the random one from generated captions

    ’similar_one_simhash’: Retain the generated one that is most

    similar to the original caption

    ’all’: Retain all generated captions by concatenation

Note

This is a batched_OP, whose input and output type are both list. Suppose there are $N$ list of input samples, whose batch size is $b$, and denote caption_num as $M$. The number of total samples after generation is $2Nb$ when keep_original_sample is True and $Nb$ when keep_original_sample is False. For ‘random_any’ and ‘similar_one_simhash’ mode, it’s $(1+M)Nb$ for ‘all’ mode when keep_original_sample is True and $MNb$ when keep_original_sample is False.

Parameters:
  • keep_original_sample – whether to keep the original sample. If it’s set to False, there will be only generated captions in the final datasets and the original captions will be removed. It’s True in default.

  • prompt – a string prompt to guide the generation of blip2 model for all samples globally. It’s None in default, which means no prompt provided.

  • prompt_key – the key name of fields in samples to store prompts for each sample. It’s used for set different prompts for different samples. If it’s none, use prompt in parameter “prompt”. It’s None in default.

  • gpu_batch_size – the batch size for GPU inference. This controls how many images are processed together in a single GPU forward pass. Useful when the dataset batch size is larger than what the GPU can handle. Default is 8.

  • args – extra args

  • kwargs – extra args

process_batched(samples, rank=None)[source]#

Process a batch of samples with true GPU batching for caption generation.

This method collects all images from all samples in the batch, generates captions for them in GPU-efficient sub-batches, and then distributes the captions back to their respective samples.

Note

This is a batched_OP, whose input and output type are both list. Suppose there are $N$ input sample list with batch size as $b$, and denote caption_num as $M$. the number of total samples after generation is $2Nb$ for ‘random_any’ and ‘similar_one’ mode, and $(1+M)Nb$ for ‘all’ mode.

Parameters:
  • samples – Dict of lists containing the batch of samples.

  • rank – Optional GPU rank for distributed processing.

Returns:

Dict of lists containing the processed samples with generated captions.