detect_main_character_mapper

Extract all main character names based on the given image and its caption.

This operator uses a multimodal language model to generate a description of the main characters in the given image. It then parses the generated JSON to extract the list of main characters. The operator filters out samples where the number of main characters is less than the specified threshold. The default arguments for the multimodal language model include using a Hugging Face model with specific generation parameters. The key metric, main_character_list, is stored in the sample's metadata.

根据给定的图像及其图像描述,提取所有主要角色的名字。

该算子使用多模态语言模型生成给定图像中主要角色的描述。然后解析生成的JSON以提取主要角色列表。算子会过滤掉主要角色数量少于指定阈值的样本。多模态语言模型的默认参数包括使用特定生成参数的Hugging Face模型。关键指标main_character_list存储在样本的元数据中。

Type 算子类型: mapper

Tags 标签: gpu

🔧 Parameter Configuration 参数配置

name 参数名

type 类型

default 默认值

desc 说明

mllm_mapper_args

typing.Optional[typing.Dict]

{}

Arguments for multimodal language model mapper. Controls the generation of captions for bounding box regions. Default empty dict will use fixed values: max_new_tokens=256, temperature=0.2, top_p=None, num_beams=1, hf_model="llava-hf/llava-v1.6-vicuna-7b-hf".

filter_min_character_num

<class 'int'>

0

Filters out samples where the number of main characters in the image is less than this threshold.

args

''

kwargs

''

📊 Effect demonstration 效果演示

not available 暂无