data_juicer.ops.mapper#

class data_juicer.ops.mapper.AudioAddGaussianNoiseMapper(min_amplitude: float = 0.001, max_amplitude: float = 0.015, p: float = 0.5, save_dir: str = None, *args, **kwargs)[source]#

Bases: Mapper

Mapper to add Gaussian noise to audio samples.

This operator adds Gaussian noise to audio data with a specified probability. The amplitude of the noise is randomly chosen between min_amplitude and max_amplitude. If save_dir is provided, the modified audio files are saved in that directory; otherwise, they are saved in the same directory as the input files. The p parameter controls the probability of applying this transformation to each sample. If no audio is present in the sample, it is returned unchanged.

__init__(min_amplitude: float = 0.001, max_amplitude: float = 0.015, p: float = 0.5, save_dir: str = None, *args, **kwargs)[source]#

Initialization method.

Parameters:
  • min_amplitude โ€“ float unit: linear amplitude. Default: 0.001. Minimum noise amplification factor.

  • max_amplitude โ€“ float unit: linear amplitude. Default: 0.015. Maximum noise amplification factor.

  • p โ€“ float range: [0.0, 1.0]. Default: 0.5. The probability of applying this transform.

  • save_dir โ€“ str. Default: None. The directory where generated audio files will be stored. If not specified, outputs will be saved in the same directory as their corresponding input files. This path can alternatively be defined by setting the DJ_PRODUCED_DATA_DIR environment variable.

process_single(sample, context=False)[source]#

For sample level, sample โ€“> sample

Parameters:

sample โ€“ sample to process

Returns:

processed sample

class data_juicer.ops.mapper.AudioFFmpegWrappedMapper(filter_name: str | None = None, filter_kwargs: Dict | None = None, global_args: List[str] | None = None, capture_stderr: bool = True, overwrite_output: bool = True, save_dir: str = None, *args, **kwargs)[source]#

Bases: Mapper

Wraps FFmpeg audio filters for processing audio files in a dataset.

This operator applies specified FFmpeg audio filters to the audio files in the dataset. It supports passing custom filter parameters and global arguments to the FFmpeg command line. The processed audio files are saved to a specified directory or the same directory as the input files if no save directory is provided. The DJ_PRODUCED_DATA_DIR environment variable can also be used to set the save directory. If no filter name is provided, the audio files remain unmodified. The operator updates the source file paths in the dataset after processing.

__init__(filter_name: str | None = None, filter_kwargs: Dict | None = None, global_args: List[str] | None = None, capture_stderr: bool = True, overwrite_output: bool = True, save_dir: str = None, *args, **kwargs)[source]#

Initialization method.

Parameters:
  • filter_name โ€“ ffmpeg audio filter name.

  • filter_kwargs โ€“ keyword-arguments passed to ffmpeg filter.

  • global_args โ€“ list-arguments passed to ffmpeg command-line.

  • capture_stderr โ€“ whether to capture stderr.

  • overwrite_output โ€“ whether to overwrite output file.

  • save_dir โ€“ The directory where generated audio files will be stored. If not specified, outputs will be saved in the same directory as their corresponding input files. This path can alternatively be defined by setting the DJ_PRODUCED_DATA_DIR environment variable.

  • args โ€“ extra args

  • kwargs โ€“ extra args

process_single(sample)[source]#

For sample level, sample โ€“> sample

Parameters:

sample โ€“ sample to process

Returns:

processed sample

class data_juicer.ops.mapper.CalibrateQAMapper(api_model: str = 'gpt-4o', *, api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, reference_template: str | None = None, qa_pair_template: str | None = None, output_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]#

Bases: Mapper

Calibrates question-answer pairs based on reference text using an API model.

This operator uses a specified API model to calibrate question-answer pairs, making them more detailed and accurate. It constructs the input prompt by combining the reference text and the question-answer pair, then sends it to the API for calibration. The output is parsed to extract the calibrated question and answer. The operator retries the API call and parsing up to a specified number of times in case of errors. The default system prompt, input templates, and output pattern can be customized. The operator supports additional parameters for model initialization and sampling.

DEFAULT_SYSTEM_PROMPT = '่ฏทๆ นๆฎๆไพ›็š„ใ€ๅ‚่€ƒไฟกๆฏใ€‘ๅฏนใ€้—ฎ้ข˜ใ€‘ๅ’Œใ€ๅ›ž็ญ”ใ€‘่ฟ›่กŒๆ กๅ‡†๏ผŒไฝฟๅ…ถๆ›ดๅŠ ่ฏฆ็ป†ใ€ๅ‡†็กฎใ€‚\nๆŒ‰็…งไปฅไธ‹ๆ ผๅผ่พ“ๅ‡บ๏ผš\nใ€้—ฎ้ข˜ใ€‘\nๆ กๅ‡†ๅŽ็š„้—ฎ้ข˜\nใ€ๅ›ž็ญ”ใ€‘\nๆ กๅ‡†ๅŽ็š„ๅ›ž็ญ”'#
DEFAULT_INPUT_TEMPLATE = '{reference}\n{qa_pair}'#
DEFAULT_REFERENCE_TEMPLATE = 'ใ€ๅ‚่€ƒไฟกๆฏใ€‘\n{}'#
DEFAULT_QA_PAIR_TEMPLATE = 'ใ€้—ฎ้ข˜ใ€‘\n{}\nใ€ๅ›ž็ญ”ใ€‘\n{}'#
DEFAULT_OUTPUT_PATTERN = 'ใ€้—ฎ้ข˜ใ€‘\\s*(.*?)\\s*ใ€ๅ›ž็ญ”ใ€‘\\s*(.*)'#
__init__(api_model: str = 'gpt-4o', *, api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, reference_template: str | None = None, qa_pair_template: str | None = None, output_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]#

Initialization method.

Parameters:
  • api_model โ€“ API model name.

  • api_endpoint โ€“ URL endpoint for the API.

  • response_path โ€“ Path to extract content from the API response. Defaults to โ€˜choices.0.message.contentโ€™.

  • system_prompt โ€“ System prompt for the calibration task.

  • input_template โ€“ Template for building the model input.

  • reference_template โ€“ Template for formatting the reference text.

  • qa_pair_template โ€“ Template for formatting question-answer pairs.

  • output_pattern โ€“ Regular expression for parsing model output.

  • try_num โ€“ The number of retry attempts when there is an API call error or output parsing error.

  • model_params โ€“ Parameters for initializing the API model.

  • sampling_params โ€“ Extra parameters passed to the API call. e.g {โ€˜temperatureโ€™: 0.9, โ€˜top_pโ€™: 0.95}

  • kwargs โ€“ Extra keyword arguments.

build_input(sample)[source]#
parse_output(raw_output)[source]#
process_single(sample, rank=None)[source]#

For sample level, sample โ€“> sample

Parameters:

sample โ€“ sample to process

Returns:

processed sample

class data_juicer.ops.mapper.CalibrateQueryMapper(api_model: str = 'gpt-4o', *, api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, reference_template: str | None = None, qa_pair_template: str | None = None, output_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]#

Bases: CalibrateQAMapper

Calibrate query in question-answer pairs based on reference text.

This operator adjusts the query (question) in a question-answer pair to be more detailed and accurate, while ensuring it can still be answered by the original answer. It uses a reference text to inform the calibration process. The calibration is guided by a system prompt, which instructs the model to refine the question without adding extraneous information. The output is parsed to extract the calibrated query, with any additional content removed.

DEFAULT_SYSTEM_PROMPT = '่ฏทๆ นๆฎๆไพ›็š„ใ€ๅ‚่€ƒไฟกๆฏใ€‘ๅฏน้—ฎ็ญ”ๅฏนไธญ็š„ใ€้—ฎ้ข˜ใ€‘่ฟ›่กŒๆ กๅ‡†๏ผŒ        ไฝฟๅ…ถๆ›ดๅŠ ่ฏฆ็ป†ใ€ๅ‡†็กฎ๏ผŒไธ”ไปๅฏไปฅ็”ฑๅŽŸ็ญ”ๆกˆๅ›ž็ญ”ใ€‚ๅช่พ“ๅ‡บๆ กๅ‡†ๅŽ็š„้—ฎ้ข˜๏ผŒไธ่ฆ่พ“ๅ‡บๅคšไฝ™ๅ†…ๅฎนใ€‚'#
parse_output(raw_output)[source]#
class data_juicer.ops.mapper.CalibrateResponseMapper(api_model: str = 'gpt-4o', *, api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, reference_template: str | None = None, qa_pair_template: str | None = None, output_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]#

Bases: CalibrateQAMapper

Calibrate response in question-answer pairs based on reference text.

This mapper calibrates the โ€˜responseโ€™ part of a question-answer pair by using a reference text. It aims to make the response more detailed and accurate while ensuring it still answers the original question. The calibration process uses a default system prompt, which can be customized. The output is stripped of any leading or trailing whitespace.

DEFAULT_SYSTEM_PROMPT = '่ฏทๆ นๆฎๆไพ›็š„ใ€ๅ‚่€ƒไฟกๆฏใ€‘ๅฏน้—ฎ็ญ”ๅฏนไธญ็š„ใ€ๅ›ž็ญ”ใ€‘่ฟ›่กŒๆ กๅ‡†๏ผŒ        ไฝฟๅ…ถๆ›ดๅŠ ่ฏฆ็ป†ใ€ๅ‡†็กฎ๏ผŒไธ”ไปๅฏไปฅๅ›ž็ญ”ๅŽŸ้—ฎ้ข˜ใ€‚ๅช่พ“ๅ‡บๆ กๅ‡†ๅŽ็š„ๅ›ž็ญ”๏ผŒไธ่ฆ่พ“ๅ‡บๅคšไฝ™ๅ†…ๅฎนใ€‚'#
parse_output(raw_output)[source]#
class data_juicer.ops.mapper.ChineseConvertMapper(mode: str = 's2t', *args, **kwargs)[source]#

Bases: Mapper

Mapper to convert Chinese text between Traditional, Simplified, and Japanese Kanji.

This operator converts Chinese text based on the specified mode. It supports conversions between Simplified Chinese, Traditional Chinese (including Taiwan and Hong Kong variants), and Japanese Kanji. The conversion is performed using a pre-defined set of rules. The available modes include โ€˜s2tโ€™ for Simplified to Traditional, โ€˜t2sโ€™ for Traditional to Simplified, and other specific variants like โ€˜s2twโ€™, โ€˜tw2sโ€™, โ€˜s2hkโ€™, โ€˜hk2sโ€™, โ€˜s2twpโ€™, โ€˜tw2spโ€™, โ€˜t2twโ€™, โ€˜tw2tโ€™, โ€˜hk2tโ€™, โ€˜t2hkโ€™, โ€˜t2jpโ€™, and โ€˜jp2tโ€™. The operator processes text in batches and applies the conversion to the specified text key in the samples.

__init__(mode: str = 's2t', *args, **kwargs)[source]#

Initialization method.

Parameters:
  • mode โ€“

    Choose the mode to convert Chinese:

    s2t: Simplified Chinese to Traditional Chinese,

    t2s: Traditional Chinese to Simplified Chinese,

    s2tw: Simplified Chinese to Traditional Chinese (Taiwan Standard),

    tw2s: Traditional Chinese (Taiwan Standard) to Simplified Chinese,

    s2hk: Simplified Chinese to Traditional Chinese (Hong Kong variant),

    hk2s: Traditional Chinese (Hong Kong variant) to Simplified Chinese,

    s2twp: Simplified Chinese to Traditional Chinese (Taiwan Standard) with Taiwanese idiom,

    tw2sp: Traditional Chinese (Taiwan Standard) to Simplified Chinese with Mainland Chinese idiom,

    t2tw: Traditional Chinese to Traditional Chinese (Taiwan Standard),

    tw2t: Traditional Chinese (Taiwan standard) to Traditional Chinese,

    hk2t: Traditional Chinese (Hong Kong variant) to Traditional Chinese,

    t2hk: Traditional Chinese to Traditional Chinese (Hong Kong variant),

    t2jp: Traditional Chinese Characters (Kyลซjitai) to New Japanese Kanji,

    jp2t: New Japanese Kanji (Shinjitai) to Traditional Chinese Characters,

  • args โ€“ extra args

  • kwargs โ€“ extra args

process_batched(samples)[source]#
class data_juicer.ops.mapper.CleanCopyrightMapper(*args, **kwargs)[source]#

Bases: Mapper

Cleans copyright comments at the beginning of text samples.

This operator removes copyright comments from the start of text samples. It identifies and strips multiline comments that contain the word โ€œcopyrightโ€ using a regular expression. It also greedily removes lines starting with comment markers like //, #, or โ€“ at the beginning of the text, as these are often part of copyright headers. The operator processes each sample individually but can handle batches for efficiency.

__init__(*args, **kwargs)[source]#

Initialization method.

Parameters:
  • args โ€“ extra args

  • kwargs โ€“ extra args

process_batched(samples)[source]#
class data_juicer.ops.mapper.CleanEmailMapper(pattern: str | None = None, repl: str = '', *args, **kwargs)[source]#

Bases: Mapper

Cleans email addresses from text samples using a regular expression.

This operator removes or replaces email addresses in the text based on a regular expression pattern. By default, it uses a standard pattern to match email addresses, but a custom pattern can be provided. The matched email addresses are replaced with a specified replacement string, which defaults to an empty string. The operation is applied to each text sample in the batch. If no email address is found in a sample, it remains unchanged.

__init__(pattern: str | None = None, repl: str = '', *args, **kwargs)[source]#

Initialization method.

Parameters:
  • pattern โ€“ regular expression pattern to search for within text.

  • repl โ€“ replacement string, default is empty string.

  • args โ€“ extra args

  • kwargs โ€“ extra args

process_batched(samples)[source]#
class data_juicer.ops.mapper.CleanHtmlMapper(*args, **kwargs)[source]#

Bases: Mapper

Cleans HTML code from text samples, converting HTML to plain text.

This operator processes text samples by removing HTML tags and converting HTML elements to a more readable format. Specifically, it replaces <li> and <ol> tags with newline and bullet points. The Selectolax HTML parser is used to extract the text content from the HTML. This operation is performed in a batched manner, making it efficient for large datasets.

__init__(*args, **kwargs)[source]#

Initialization method.

Parameters:
  • args โ€“ extra args

  • kwargs โ€“ extra args

process_batched(samples)[source]#
class data_juicer.ops.mapper.CleanIpMapper(pattern: str | None = None, repl: str = '', *args, **kwargs)[source]#

Bases: Mapper

Cleans IPv4 and IPv6 addresses from text samples.

This operator removes or replaces IPv4 and IPv6 addresses in the text. It uses a regular expression to identify and clean the IP addresses. By default, it replaces the IP addresses with an empty string, effectively removing them. The operator can be configured with a custom pattern and replacement string. If no pattern is provided, a default pattern for both IPv4 and IPv6 addresses is used. The operator processes samples in batches.

  • Uses a regular expression to find and clean IP addresses.

  • Replaces found IP addresses with a specified replacement string.

  • Default replacement string is an empty string, which removes the IP addresses.

  • Can use a custom regular expression pattern if provided.

  • Processes samples in batches for efficiency.

__init__(pattern: str | None = None, repl: str = '', *args, **kwargs)[source]#

Initialization method.

Parameters:
  • pattern โ€“ regular expression pattern to search for within text.

  • repl โ€“ replacement string, default is empty string.

  • args โ€“ extra args

  • kwargs โ€“ extra args

process_batched(samples)[source]#
class data_juicer.ops.mapper.CleanLinksMapper(pattern: str | None = None, repl: str = '', *args, **kwargs)[source]#

Bases: Mapper

Mapper to clean links like http/https/ftp in text samples.

This operator removes or replaces URLs and other web links in the text. It uses a regular expression pattern to identify and remove links. By default, it replaces the identified links with an empty string, effectively removing them. The operator can be customized with a different pattern and replacement string. It processes samples in batches and modifies the text in place. If no links are found in a sample, it is left unchanged.

__init__(pattern: str | None = None, repl: str = '', *args, **kwargs)[source]#

Initialization method.

Parameters:
  • pattern โ€“ regular expression pattern to search for within text.

  • repl โ€“ replacement string, default is empty string.

  • args โ€“ extra args

  • kwargs โ€“ extra args

process_batched(samples)[source]#
class data_juicer.ops.mapper.DetectCharacterAttributesMapper(detect_character_locations_mapper_args: Dict | None = {}, *args, **kwargs)[source]#

Bases: Mapper

Takes an image, a caption, and main character names as input to extract the charactersโ€™ attributes.

Extracts and classifies attributes of main characters in an image using a combination of object detection, image-text matching, and language model inference. It first locates the main characters in the image using YOLOE and then uses a Hugging Face tokenizer and a LLaMA-based model to classify each character into categories like โ€˜objectโ€™, โ€˜animalโ€™, โ€˜personโ€™, โ€˜textโ€™, or โ€˜otherโ€™. The operator also extracts detailed features such as color, material, and action for each character. The final output includes bounding boxes and a list of characteristics for each main character. The results are stored in the โ€˜main_character_attributes_listโ€™ field under the โ€˜metaโ€™ key.

__init__(detect_character_locations_mapper_args: Dict | None = {}, *args, **kwargs)[source]#

Initialization method.

Parameters:

detect_character_locations_mapper_args โ€“ Arguments for detect_character_locations_mapper_args. Controls the threshold for locating the main character. Default empty dict will use fixed values: default mllm_mapper_args, default image_text_matching_filter_args, yoloe_path=โ€yoloe-11l-seg.ptโ€, iou_threshold=0.7, matching_score_threshold=0.4,

process_single(samples, rank=None)[source]#

For sample level, sample โ€“> sample

Parameters:

sample โ€“ sample to process

Returns:

processed sample

class data_juicer.ops.mapper.DetectCharacterLocationsMapper(mllm_mapper_args: Dict | None = {}, image_text_matching_filter_args: Dict | None = {}, yoloe_path='yoloe-11l-seg.pt', iou_threshold=0.7, matching_score_threshold=0.4, *args, **kwargs)[source]#

Bases: Mapper

Given an image and a list of main character names, extract the bounding boxes for each present character.

Detects and extracts bounding boxes for main characters in an image, this operator uses a YOLOE model to detect the presence of these characters. It then generates and refines bounding boxes for each detected character using a multimodal language model and an image-text matching filter. The final bounding boxes are stored in the metadata under โ€˜main_character_locations_listโ€™. The operator considers two bounding boxes as overlapping if their Intersection over Union (IoU) score exceeds a specified threshold. Additionally, it uses a matching score threshold to determine if a cropped image region matches the characterโ€™s name. The operator utilizes a Hugging Face tokenizer and a BLIP model for image-text matching.

__init__(mllm_mapper_args: Dict | None = {}, image_text_matching_filter_args: Dict | None = {}, yoloe_path='yoloe-11l-seg.pt', iou_threshold=0.7, matching_score_threshold=0.4, *args, **kwargs)[source]#

Initialization method.

Parameters:
  • mllm_mapper_args โ€“ Arguments for multimodal language model mapper. Controls the generation of captions for bounding box regions. Default empty dict will use fixed values: max_new_tokens=256, temperature=0.2, top_p=None, num_beams=1, hf_model=โ€llava-hf/llava-v1.6-vicuna-7b-hfโ€.

  • image_text_matching_filter_args โ€“ Arguments for image-text matching filter. Controls the matching between cropped image regions and text descriptions. Default empty dict will use fixed values: min_score=0.1, max_score=1.0, hf_blip=โ€Salesforce/blip-itm-base-cocoโ€, num_proc=1.

  • yoloe_path โ€“ The path to the YOLOE model.

  • iou_threshold โ€“ We consider two bounding boxes from different models to be overlapping when their IOU score is higher than the iou_threshold.

  • matching_score_threshold โ€“ If the matching score between the cropped image and the characterโ€™s name exceeds the matching_score_threshold, they are considered a match.

iou_cal(bbox1, bbox2)[source]#
process_single(samples, rank=None)[source]#

For sample level, sample โ€“> sample

Parameters:

sample โ€“ sample to process

Returns:

processed sample

class data_juicer.ops.mapper.DetectMainCharacterMapper(mllm_mapper_args: Dict | None = {}, filter_min_character_num: int = 0, *args, **kwargs)[source]#

Bases: Mapper

Extract all main character names based on the given image and its caption.

This operator uses a multimodal language model to generate a description of the main characters in the given image. It then parses the generated JSON to extract the list of main characters. The operator filters out samples where the number of main characters is less than the specified threshold. The default arguments for the multimodal language model include using a Hugging Face model with specific generation parameters. The key metric, main_character_list, is stored in the sampleโ€™s metadata.

__init__(mllm_mapper_args: Dict | None = {}, filter_min_character_num: int = 0, *args, **kwargs)[source]#

Initialization.

Parameters:
  • mllm_mapper_args โ€“ Arguments for multimodal language model mapper. Controls the generation of captions for bounding box regions. Default empty dict will use fixed values: max_new_tokens=256, temperature=0.2, top_p=None, num_beams=1, hf_model=โ€llava-hf/llava-v1.6-vicuna-7b-hfโ€.

  • filter_min_character_num โ€“ Filters out samples where the number of main characters in the image is less than this threshold.

process_single(samples, rank=None)[source]#

For sample level, sample โ€“> sample

Parameters:

sample โ€“ sample to process

Returns:

processed sample

class data_juicer.ops.mapper.DialogIntentDetectionMapper(api_model: str = 'gpt-4o', intent_candidates: List[str] | None = None, max_round: Annotated[int, Ge(ge=0)] = 10, *, labels_key: str = 'dialog_intent_labels', analysis_key: str = 'dialog_intent_labels_analysis', api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, query_template: str | None = None, response_template: str | None = None, candidate_template: str | None = None, analysis_template: str | None = None, labels_template: str | None = None, analysis_pattern: str | None = None, labels_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]#

Bases: Mapper

Generates userโ€™s intent labels in a dialog by analyzing the history, query, and response.

This operator processes a dialog to identify and label the userโ€™s intent. It uses a predefined system prompt and templates to build input prompts for an API call. The API model (e.g., GPT-4) is used to analyze the dialog and generate intent labels and analysis. The results are stored in the meta field under โ€˜dialog_intent_labelsโ€™ and โ€˜dialog_intent_labels_analysisโ€™. The operator supports customizing the system prompt, templates, and patterns for parsing the API response. If the intent candidates are provided, they are included in the input prompt. The operator retries the API call up to a specified number of times if there are errors.

DEFAULT_SYSTEM_PROMPT = '่ฏทๅˆคๆ–ญ็”จๆˆทๅ’ŒLLMๅคš่ฝฎๅฏน่ฏไธญ็”จๆˆท็š„ๆ„ๅ›พใ€‚\n่ฆๆฑ‚๏ผš\n- ้œ€่ฆๅ…ˆ่ฟ›่กŒๅˆ†ๆž๏ผŒ็„ถๅŽๅˆ—ๅ‡บ็”จๆˆทๆ‰€ๅ…ทๆœ‰็š„ๆ„ๅ›พ๏ผŒไธ‹้ขๆ˜ฏไธ€ไธชๆ ทไพ‹๏ผŒ่ฏทๆจกไปฟๆ ทไพ‹ๆ ผๅผ่พ“ๅ‡บใ€‚\n็”จๆˆท๏ผšไฝ ๅฅฝ๏ผŒๆˆ‘ๆœ€่ฟ‘ๅฏนไบบๅทฅๆ™บ่ƒฝๅพˆๆ„Ÿๅ…ด่ถฃ๏ผŒ่ƒฝ็ป™ๆˆ‘่ฎฒ่ฎฒไป€ไนˆๆ˜ฏๆœบๅ™จๅญฆไน ๅ—๏ผŸ\nๆ„ๅ›พๅˆ†ๆž๏ผš็”จๆˆทๅœจ่ฏทๆฑ‚ไฟกๆฏ๏ผŒๅธŒๆœ›ไบ†่งฃๆœ‰ๅ…ณๆœบๅ™จๅญฆไน ็š„ๅŸบ็ก€็Ÿฅ่ฏ†ใ€‚\nๆ„ๅ›พ็ฑปๅˆซ๏ผšไฟกๆฏๆŸฅๆ‰พ\nLLM๏ผšไฝ ๅฅฝ๏ผๅฝ“็„ถๅฏไปฅใ€‚ๆœบๅ™จๅญฆไน ๆ˜ฏไธ€็งไบบๅทฅๆ™บ่ƒฝๆ–นๆณ•๏ผŒๅ…่ฎธ่ฎก็ฎ—ๆœบ้€š่ฟ‡ๆ•ฐๆฎ่‡ชๅŠจๆ”น่ฟ›ๅ’Œๅญฆไน ใ€‚\n็”จๆˆท๏ผšๅฌ่ตทๆฅๅพˆๆœ‰่ถฃ๏ผŒๆœ‰ๆฒกๆœ‰ๆŽจ่็š„ๅ…ฅ้—จไนฆ็ฑๆˆ–่ต„ๆ–™๏ผŸ\nๆ„ๅ›พๅˆ†ๆž๏ผš็”จๆˆทๅœจ่ฏทๆฑ‚ๅปบ่ฎฎ๏ผŒๅธŒๆœ›่Žทๅ–ๅ…ณไบŽๆœบๅ™จๅญฆไน ็š„ๅ…ฅ้—จ่ต„ๆบใ€‚\nๆ„ๅ›พ็ฑปๅˆซ๏ผš่ฏทๆฑ‚ๅปบ่ฎฎ\nLLM๏ผšๆœ‰ๅพˆๅคšไธ้”™็š„ๅ…ฅ้—จไนฆ็ฑๅ’Œ่ต„ๆบใ€‚ไธ€ๆœฌๅธธ่ขซๆŽจ่็š„ไนฆๆ˜ฏใ€ŠPythonๆœบๅ™จๅญฆไน ๅฎž่ทตใ€‹๏ผˆPython Machine Learning๏ผ‰๏ผŒๅฎƒๆถต็›–ไบ†ๅŸบ็ก€็Ÿฅ่ฏ†ๅ’Œไธ€ไบ›ๅฎž้™…ๆกˆไพ‹ใ€‚ๆญคๅค–๏ผŒๆ‚จ่ฟ˜ๅฏไปฅๅ‚่€ƒCourseraๆˆ–edXไธŠ็š„ๅœจ็บฟ่ฏพ็จ‹๏ผŒ่ฟ™ไบ›่ฏพ็จ‹ๆไพ›ไบ†็ณป็ปŸ็š„ๅญฆไน ่ทฏๅพ„ใ€‚\n็”จๆˆท๏ผš่ฐข่ฐขไฝ ็š„ๅปบ่ฎฎ๏ผๆˆ‘่ฟ˜ๆƒณ็Ÿฅ้“๏ผŒๅญฆไน ๆœบๅ™จๅญฆไน ้œ€่ฆไป€ไนˆๆ ท็š„ๆ•ฐๅญฆๅŸบ็ก€๏ผŸ\nๆ„ๅ›พๅˆ†ๆž๏ผš็”จๆˆทๅœจๅฏปๆฑ‚ไฟกๆฏ๏ผŒๅธŒๆœ›ไบ†่งฃๅญฆไน ๆœบๅ™จๅญฆไน ๆ‰€้œ€็š„ๅ‰ๆๆกไปถ๏ผŒ็‰นๅˆซๆ˜ฏๅœจๆ•ฐๅญฆๆ–น้ขใ€‚\nๆ„ๅ›พ็ฑปๅˆซ๏ผšไฟกๆฏๆŸฅๆ‰พ\nLLM๏ผšๅญฆไน ๆœบๅ™จๅญฆไน ้€šๅธธ้œ€่ฆไธ€ๅฎš็š„ๆ•ฐๅญฆๅŸบ็ก€๏ผŒ็‰นๅˆซๆ˜ฏ็บฟๆ€งไปฃๆ•ฐใ€ๆฆ‚็އ่ฎบๅ’Œ็ปŸ่ฎกๅญฆใ€‚่ฟ™ไบ›ๆ•ฐๅญฆ้ข†ๅŸŸๅธฎๅŠฉ็†่งฃ็ฎ—ๆณ•็š„ๅทฅไฝœๅŽŸ็†ๅ’Œๆ•ฐๆฎๆจกๅผๅˆ†ๆžใ€‚ๅฆ‚ๆžœๆ‚จๅฏน่ฟ™ไบ›ไธป้ข˜ไธๅคช็†Ÿๆ‚‰๏ผŒๅปบ่ฎฎๅ…ˆไปŽ็›ธๅ…ณๅŸบ็ก€ไนฆ็ฑๆˆ–ๅœจ็บฟ่ต„ๆบๅผ€ๅง‹ๅญฆไน ใ€‚\n็”จๆˆท๏ผšๆ˜Ž็™ฝไบ†๏ผŒๆˆ‘ไผšๅ…ˆ่กฅไน ่ฟ™ไบ›ๅŸบ็ก€็Ÿฅ่ฏ†ใ€‚ๅ†ๆฌกๆ„Ÿ่ฐขไฝ ็š„ๅธฎๅŠฉ๏ผ\nๆ„ๅ›พๅˆ†ๆž๏ผš็”จๆˆท่กจ่พพๆ„Ÿ่ฐข๏ผŒๅนถ่กจ็คบ่ฎกๅˆ’ไป˜่ฏธ่กŒๅŠจๆฅ่กฅๅ……ๆ‰€้œ€็š„ๅŸบ็ก€็Ÿฅ่ฏ†ใ€‚\nๆ„ๅ›พ็ฑปๅˆซ๏ผšๅ…ถไป–'#
DEFAULT_QUERY_TEMPLATE = '็”จๆˆท๏ผš{query}\n'#
DEFAULT_RESPONSE_TEMPLATE = 'LLM๏ผš{response}\n'#
DEFAULT_CANDIDATES_TEMPLATE = 'ๅค‡้€‰ๆ„ๅ›พ็ฑปๅˆซ๏ผš[{candidate_str}]'#
DEFAULT_ANALYSIS_TEMPLATE = 'ๆ„ๅ›พๅˆ†ๆž๏ผš{analysis}\n'#
DEFAULT_LABELS_TEMPLATE = 'ๆ„ๅ›พ็ฑปๅˆซ๏ผš{labels}\n'#
DEFAULT_ANALYSIS_PATTERN = 'ๆ„ๅ›พๅˆ†ๆž๏ผš(.*?)\n'#
DEFAULT_LABELS_PATTERN = 'ๆ„ๅ›พ็ฑปๅˆซ๏ผš(.*?)($|\n)'#
__init__(api_model: str = 'gpt-4o', intent_candidates: List[str] | None = None, max_round: Annotated[int, Ge(ge=0)] = 10, *, labels_key: str = 'dialog_intent_labels', analysis_key: str = 'dialog_intent_labels_analysis', api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, query_template: str | None = None, response_template: str | None = None, candidate_template: str | None = None, analysis_template: str | None = None, labels_template: str | None = None, analysis_pattern: str | None = None, labels_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]#

Initialization method.

Parameters:
  • api_model โ€“ API model name.

  • intent_candidates โ€“ The output intent candidates. Use the intent labels of the open domain if it is None.

  • max_round โ€“ The max num of round in the dialog to build the prompt.

  • labels_key โ€“ The key name in the meta field to store the output labels. It is โ€˜dialog_intent_labelsโ€™ in default.

  • analysis_key โ€“ The key name in the meta field to store the corresponding analysis. It is โ€˜dialog_intent_labels_analysisโ€™ in default.

  • api_endpoint โ€“ URL endpoint for the API.

  • response_path โ€“ Path to extract content from the API response. Defaults to โ€˜choices.0.message.contentโ€™.

  • system_prompt โ€“ System prompt for the task.

  • query_template โ€“ Template for query part to build the input prompt.

  • response_template โ€“ Template for response part to build the input prompt.

  • candidate_template โ€“ Template for intent candidates to build the input prompt.

  • analysis_template โ€“ Template for analysis part to build the input prompt.

  • labels_template โ€“ Template for labels to build the input prompt.

  • analysis_pattern โ€“ Pattern to parse the return intent analysis.

  • labels_pattern โ€“ Pattern to parse the return intent labels.

  • try_num โ€“ The number of retry attempts when there is an API call error or output parsing error.

  • model_params โ€“ Parameters for initializing the API model.

  • sampling_params โ€“ Extra parameters passed to the API call. e.g {โ€˜temperatureโ€™: 0.9, โ€˜top_pโ€™: 0.95}

  • kwargs โ€“ Extra keyword arguments.

build_input(history, query)[source]#
parse_output(response)[source]#
process_single(sample, rank=None)[source]#

For sample level, sample โ€“> sample

Parameters:

sample โ€“ sample to process

Returns:

processed sample

class data_juicer.ops.mapper.DialogSentimentDetectionMapper(api_model: str = 'gpt-4o', sentiment_candidates: List[str] | None = None, max_round: Annotated[int, Ge(ge=0)] = 10, *, labels_key: str = 'dialog_sentiment_labels', analysis_key: str = 'dialog_sentiment_labels_analysis', api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, query_template: str | None = None, response_template: str | None = None, candidate_template: str | None = None, analysis_template: str | None = None, labels_template: str | None = None, analysis_pattern: str | None = None, labels_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]#

Bases: Mapper

Generates sentiment labels and analysis for user queries in a dialog.

This operator processes a dialog to detect and label the sentiments expressed by the user. It uses the provided history, query, and response keys to construct prompts for an API call. The API returns sentiment analysis and labels, which are then parsed and stored in the sampleโ€™s metadata under the โ€˜dialog_sentiment_labelsโ€™ and โ€˜dialog_sentiment_labels_analysisโ€™ keys. The operator supports custom templates and patterns for prompt construction and output parsing. If no sentiment candidates are provided, it uses open-domain sentiment labels. The operator retries the API call up to a specified number of times in case of errors.

DEFAULT_SYSTEM_PROMPT = '่ฏทๅˆคๆ–ญ็”จๆˆทๅ’ŒLLMๅคš่ฝฎๅฏน่ฏไธญ็”จๆˆทๆ‰€ๅ…ทๆœ‰็š„ๆƒ…็ปชใ€‚\n่ฆๆฑ‚๏ผš\n- ้œ€่ฆๅ…ˆ่ฟ›่กŒๅˆ†ๆž๏ผŒ็„ถๅŽ็ฝ—ๅˆ—็”จๆˆทๆ‰€ๅ…ทๆœ‰็š„ๆƒ…็ปช๏ผŒไธ‹้ขๆ˜ฏไธ€ไธชๆ ทไพ‹๏ผŒ่ฏทๆจกไปฟๆ ทไพ‹ๆ ผๅผ่พ“ๅ‡บใ€‚\n็”จๆˆท๏ผšๆœ€่ฟ‘ๅทฅไฝœๅŽ‹ๅŠ›ๅฅฝๅคง๏ผŒๆˆ‘่ง‰ๅพ—ๆ•ดไธชไบบ้ƒฝๅฟซ่ขซๅŽ‹ๅžฎไบ†ใ€‚\nๆƒ…ๆ„Ÿๅˆ†ๆž๏ผš็”จๆˆท็š„่จ€่ฏญไธญ้€้œฒๅ‡บๆ˜Žๆ˜พ็š„ๅŽ‹ๅŠ›ๅ’Œ็–ฒๆƒซๆ„Ÿ๏ผŒๅฏ่ƒฝ่ฟ˜ๅคนๆ‚็€ไธ€ไบ›ๆ— ๅŠฉๅ’Œ็„ฆ่™‘ใ€‚\nๆƒ…ๆ„Ÿ็ฑปๅˆซ๏ผšๅŽ‹ๅŠ›ใ€็–ฒๆƒซใ€ๆ— ๅŠฉใ€็„ฆ่™‘\nLLM๏ผšๅฌ่ตทๆฅไฝ ็œŸ็š„ๆ‰ฟๅ—ไบ†ๅพˆๅคš๏ผŒ้ขไธด่ฟ™็งๆƒ…ๅ†ต็กฎๅฎžไธๅฎนๆ˜“ใ€‚ๆœ‰ๆฒกๆœ‰่€ƒ่™‘่ฟ‡ๆ‰พไธ€ไบ›ๆ”พๆพ็š„ๆ–นๅผ๏ผŒๆฏ”ๅฆ‚ๅฌ้Ÿณไนๆˆ–่€…ๆ•ฃๆญฅๆฅๅ‡่ฝปๅŽ‹ๅŠ›ๅ‘ข๏ผŸ\n็”จๆˆท๏ผš่ฏ•่ฟ‡ไบ†๏ผŒไฝ†ๆ˜ฏๅฅฝๅƒๆฒกไป€ไนˆๆ•ˆๆžœ๏ผŒๆฏๅคฉ็š„ไบ‹ๆƒ…้ƒฝๅ †็งฏๅฆ‚ๅฑฑใ€‚\nๆƒ…ๆ„Ÿๅˆ†ๆž๏ผš็”จๆˆทๆ„Ÿๅˆฐๆ— ๅŠ›่งฃๅ†ณ็Žฐ็Šถ๏ผŒๆœ‰ๆŒซ่ดฅๆ„Ÿ๏ผŒๅนถไธ”ๅฏนๅฐ่ฏ•ๆ”พๆพ็š„ๆ–นๅผๅคฑๅŽปไฟกๅฟƒใ€‚\nๆƒ…ๆ„Ÿ็ฑปๅˆซ๏ผšๆ— ๅŠ›ใ€ๆŒซ่ดฅ\nLLM๏ผšๆˆ‘็†่งฃไฝ ็š„ๆ„Ÿๅ—๏ผŒๆœ‰ๆ—ถๅ€™ๅŽ‹ๅŠ›็งฏ็ดฏๅˆฐไธ€ๅฎš็จ‹ๅบฆ็กฎๅฎž่ฎฉไบบ้šพไปฅๆ‰ฟๅ—ใ€‚ๆˆ–่ฎธไฝ ๅฏไปฅๅฐ่ฏ•่ง„ๅˆ’ไธ€ไธ‹ๆ—ถ้—ด๏ผŒๆŠŠไปปๅŠกๅˆ†ๆˆๅฐๅ—ๆฅๅฎŒๆˆ๏ผŒ่ฟ™ๆ ทๅฏ่ƒฝไผšๅ‡ๅฐ‘ไธ€ไบ›ๅŽ‹ๅŠ›ๆ„Ÿใ€‚\n็”จๆˆท๏ผš่ฟ™ไธชไธปๆ„ไธ้”™๏ผŒๆˆ‘ไผš่ฏ•็€่ฎฉ่‡ชๅทฑๆ›ดๆœ‰ๆก็†ไธ€ไบ›๏ผŒ่ฐข่ฐขไฝ ็š„ๅปบ่ฎฎใ€‚\nๆƒ…ๆ„Ÿๅˆ†ๆž๏ผš็”จๆˆทๅฏนๅปบ่ฎฎ่กจ็Žฐๅ‡บ่ฎคๅŒๅ’Œๆ„Ÿๆฟ€๏ผŒๅŒๆ—ถๅฑ•็Žฐๅ‡บ่ฏ•ๅ›พ็งฏๆž้ขๅฏน้—ฎ้ข˜็š„ๆ€ๅบฆใ€‚\nๆƒ…ๆ„Ÿ็ฑปๅˆซ๏ผš่ฎคๅŒใ€ๆ„Ÿๆฟ€ใ€็งฏๆž\nLLM๏ผšไธ็”จ่ฐข๏ผŒๆˆ‘ๅพˆ้ซ˜ๅ…ด่ƒฝๅธฎๅˆฐไฝ ใ€‚่ฎฐๅพ—็ป™่‡ชๅทฑไธ€ไบ›ๆ—ถ้—ดๅŽป้€‚ๅบ”ๆ–ฐ็š„่ฎกๅˆ’๏ผŒๆœ‰ไปปไฝ•้œ€่ฆ้šๆ—ถๅฏไปฅ่ทŸๆˆ‘่ฏดๅ“ฆ๏ผ\n'#
DEFAULT_QUERY_TEMPLATE = '็”จๆˆท๏ผš{query}\n'#
DEFAULT_RESPONSE_TEMPLATE = 'LLM๏ผš{response}\n'#
DEFAULT_CANDIDATES_TEMPLATE = 'ๅค‡้€‰ๆƒ…ๆ„Ÿ็ฑปๅˆซ๏ผš[{candidate_str}]'#
DEFAULT_ANALYSIS_TEMPLATE = 'ๆƒ…ๆ„Ÿๅˆ†ๆž๏ผš{analysis}\n'#
DEFAULT_LABELS_TEMPLATE = 'ๆƒ…ๆ„Ÿ็ฑปๅˆซ๏ผš{labels}\n'#
DEFAULT_ANALYSIS_PATTERN = 'ๆƒ…ๆ„Ÿๅˆ†ๆž๏ผš(.*?)\n'#
DEFAULT_LABELS_PATTERN = 'ๆƒ…ๆ„Ÿ็ฑปๅˆซ๏ผš(.*?)($|\n)'#
__init__(api_model: str = 'gpt-4o', sentiment_candidates: List[str] | None = None, max_round: Annotated[int, Ge(ge=0)] = 10, *, labels_key: str = 'dialog_sentiment_labels', analysis_key: str = 'dialog_sentiment_labels_analysis', api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, query_template: str | None = None, response_template: str | None = None, candidate_template: str | None = None, analysis_template: str | None = None, labels_template: str | None = None, analysis_pattern: str | None = None, labels_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]#

Initialization method.

Parameters:
  • api_model โ€“ API model name.

  • sentiment_candidates โ€“ The output sentiment candidates. Use open-domain sentiment labels if it is None.

  • max_round โ€“ The max num of round in the dialog to build the prompt.

  • labels_key โ€“ The key name in the meta field to store the output labels. It is โ€˜dialog_sentiment_labelsโ€™ in default.

  • analysis_key โ€“ The key name in the meta field to store the corresponding analysis. It is โ€˜dialog_sentiment_labels_analysisโ€™ in default.

  • api_endpoint โ€“ URL endpoint for the API.

  • response_path โ€“ Path to extract content from the API response. Defaults to โ€˜choices.0.message.contentโ€™.

  • system_prompt โ€“ System prompt for the task.

  • query_template โ€“ Template for query part to build the input prompt.

  • response_template โ€“ Template for response part to build the input prompt.

  • candidate_template โ€“ Template for sentiment candidates to build the input prompt.

  • analysis_template โ€“ Template for analysis part to build the input prompt.

  • labels_template โ€“ Template for labels part to build the input prompt.

  • analysis_pattern โ€“ Pattern to parse the return sentiment analysis.

  • labels_pattern โ€“ Pattern to parse the return sentiment labels.

  • try_num โ€“ The number of retry attempts when there is an API call error or output parsing error.

  • model_params โ€“ Parameters for initializing the API model.

  • sampling_params โ€“ Extra parameters passed to the API call. e.g {โ€˜temperatureโ€™: 0.9, โ€˜top_pโ€™: 0.95}

  • kwargs โ€“ Extra keyword arguments.

build_input(history, query)[source]#
parse_output(response)[source]#
process_single(sample, rank=None)[source]#

For sample level, sample โ€“> sample

Parameters:

sample โ€“ sample to process

Returns:

processed sample

class data_juicer.ops.mapper.DialogSentimentIntensityMapper(api_model: str = 'gpt-4o', max_round: Annotated[int, Ge(ge=0)] = 10, *, intensities_key: str = 'dialog_sentiment_intensity', analysis_key: str = 'dialog_sentiment_intensity_analysis', api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, query_template: str | None = None, response_template: str | None = None, analysis_template: str | None = None, intensity_template: str | None = None, analysis_pattern: str | None = None, intensity_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]#

Bases: Mapper

Mapper to predict userโ€™s sentiment intensity in a dialog, ranging from -5 to 5.

This operator analyzes the sentiment of user queries in a dialog and outputs a list of sentiment intensities and corresponding analyses. The sentiment intensity ranges from -5 (extremely negative) to 5 (extremely positive), with 0 indicating a neutral sentiment. The analysis is based on the provided history, query, and response keys. The default system prompt and templates guide the sentiment analysis process. The results are stored in the meta field under โ€˜dialog_sentiment_intensityโ€™ for intensities and โ€˜dialog_sentiment_intensity_analysisโ€™ for analyses. The operator uses an API model to generate the sentiment analysis, with configurable retry attempts and sampling parameters.

DEFAULT_SYSTEM_PROMPT = '่ฏทๅˆคๆ–ญ็”จๆˆทๅ’ŒLLMๅคš่ฝฎๅฏน่ฏไธญ็”จๆˆท็š„ๆƒ…็ปชๅ˜ๅŒ–ใ€‚\n่ฆๆฑ‚๏ผš\n- ็”จๆˆทๆƒ…็ปชๅ€ผๆ˜ฏ-5ๅˆฐ5ไน‹้—ดๅˆฐๆ•ดๆ•ฐ๏ผŒ-5่กจ็คบๆžๅบฆ่ดŸ้ข๏ผŒ5่กจ็คบๆžๅบฆๆญฃ้ข๏ผŒ-5ๅˆฐ5ไน‹้—ดๆ•ฐๅ€ผ่กจ็คบๆƒ…็ปชไปŽ่ดŸ้ข้€ๆธๅˆฐๆญฃ้ข็š„ๅ˜ๅŒ–่ฟ‡็จ‹๏ผŒ0ไปฃ่กจๆƒ…ๅ‘ˆ็ปชไธญๆ€งใ€‚\n- ๅช่พ“ๅ‡บๅฝ“่ฝฎๅฏน่ฏ็š„ๅˆ†ๆž๏ผŒไธ่ฆ็ปง็ปญๆž„้€ ๅฏน่ฏใ€‚\n- ้œ€่ฆๅ…ˆ่ฟ›่กŒๅˆ†ๆž๏ผŒ็„ถๅŽ็กฎๅฎš็”จๆˆท็š„ๆƒ…็ปชๅ€ผ๏ผŒไธ‹้ขๆ˜ฏไธ€ไธชๆ ทไพ‹๏ผŒ่ฏทๆจกไปฟๆ ทไพ‹ๆ ผๅผ่พ“ๅ‡บใ€‚\n็”จๆˆท๏ผšไฝ ๅฅฝ๏ผŒๆˆ‘ๅฏนๅฏๆŒ็ปญๅ‘ๅฑ•็š„ๅฎšไน‰ๆœ‰็‚นๆจก็ณŠ๏ผŒๅธฎๆˆ‘่งฃ้‡Šไธ€ไธ‹๏ผŸ\nๆƒ…็ปชๅˆ†ๆž๏ผšๅˆšๅผ€ๅง‹๏ผŒ่ฟ˜ๆฒกๅพ—ๅˆฐLLMๅ›žๅค๏ผŒ็”จๆˆทๆƒ…็ปชๅ‘ˆไธญๆ€งใ€‚\nๆƒ…็ปชๅ€ผ๏ผš0\nLLM๏ผšๅฝ“็„ถๅฏไปฅ๏ผๅฏๆŒ็ปญๅ‘ๅฑ•ๆ˜ฏๆŒ‡ๅœจๆปก่ถณๅฝ“ไปฃไบบ็š„้œ€ๆฑ‚็š„ๅŒๆ—ถ๏ผŒไธๆŸๅฎณๅญๅญ™ๅŽไปฃๆปก่ถณๅ…ถ่‡ช่บซ้œ€ๆฑ‚็š„่ƒฝๅŠ›็š„ๅ‘ๅฑ•ๆจกๅผใ€‚ๅฎƒๅŒ…ๆ‹ฌ็ปๆตŽๅ‘ๅฑ•ใ€็คพไผšๅ‘ๅฑ•ๅ’Œ็ŽฏๅขƒไฟๆŠคไธ‰ไธชไธป่ฆๆ–น้ขใ€‚้€š่ฟ‡ๅˆ็†ๅˆฉ็”จ่ต„ๆบๅ’ŒไฟๆŠค็Žฏๅขƒ๏ผŒๆˆ‘ไปฌๅฏไปฅ็กฎไฟๆœชๆฅ็š„ไธ–ไปฃไนŸ่ƒฝไบซๆœ‰ๅฅๅ…จ็š„็”Ÿๆ€็ณป็ปŸๅ’Œ็ปๆตŽๅˆถๅบฆใ€‚\n็”จๆˆท๏ผš่ฐข่ฐขไฝ ็š„่งฃ้‡Š๏ผ้‚ฃไฝ ่ƒฝๅ‘Š่ฏ‰ๆˆ‘ไธ€ไบ›ๆ™ฎ้€šไบบๅฏไปฅ้‡‡ๅ–็š„ๅฏๆŒ็ปญ็”Ÿๆดปๆ–นๅผๅ—๏ผŸ\nๆƒ…็ปชๅˆ†ๆž๏ผšๅฏนๅ›ž็ญ”ๆ„Ÿๅˆฐๆปกๆ„๏ผŒๆƒ…็ปชๆญฃ้ขใ€‚\nๆƒ…็ปชๅ€ผ๏ผš1\nLLM๏ผšๅฝ“็„ถๅฏไปฅ๏ผŒๆ™ฎ้€šไบบๅฏไปฅ้€š่ฟ‡ๅ‡ๅฐ‘ไธ€ๆฌกๆ€งไบงๅ“็š„ไฝฟ็”จใ€้€‰ๆ‹ฉๅ…ฌๅ…ฑไบค้€šๆˆ–ๆ‹ผ่ฝฆใ€่Š‚็บฆ็”จๆฐดใ€ไปฅๅŠๆ”ฏๆŒๆœฌๅœฐๅ’ŒๅฏๆŒ็ปญๅ‘ๅฑ•็š„ไผไธš็ญ‰ๆ–นๅผๆฅ่ทต่กŒๅฏๆŒ็ปญ็”Ÿๆดปใ€‚ๆญคๅค–๏ผŒๅ…ณๆณจๅžƒๅœพๅˆ†็ฑปๅ’Œๅคš็”จ็”ตๅญ่ดฆๅ•ไนŸๆ˜ฏไธ้”™็š„้€‰ๆ‹ฉใ€‚\n็”จๆˆท๏ผšไฝ ๆๅˆฐๆ”ฏๆŒๆœฌๅœฐไผไธš๏ผŒ่ฟ™ไธ€็‚นๆˆ‘ๅพˆๆ„Ÿๅ…ด่ถฃใ€‚่ƒฝ่ฏฆ็ป†่ฏด่ฏดไธบไป€ไนˆ่ฟ™ๅฏนๅฏๆŒ็ปญๅ‘ๅฑ•ๆœ‰ไฟƒ่ฟ›ไฝœ็”จๅ—๏ผŸ\nๆƒ…็ปชๅˆ†ๆž๏ผš่ง‰ๅพ—ๅ›ž็ญ”ๅฎž็”จไธ”ๅ…ทไฝ“๏ผŒๆƒ…็ปช่ฟ›ไธ€ๆญฅ่ฝฌๅฅฝใ€‚\nๆƒ…็ปชๅ€ผ๏ผš2\nLLM๏ผšๅ‘ƒ๏ผŒๆˆ‘ๆœ€่ฟ‘ๅ‘็Žฐไบ†ไธ€้ƒจๆ–ฐ็”ตๅฝฑ๏ผŒ่ฎฒ่ฟฐไบ†ไธ€ไธชๅ…ณไบŽๅค–ๆ˜Ÿไบบๅ’Œๅœฐ็ƒๅœŸ่‘—ๅˆไฝœไฟๆŠค็Žฏๅขƒ็š„ๆ•…ไบ‹ใ€‚่™ฝ็„ถๅฎƒๆ˜ฏ็ง‘ๅนป็‰‡๏ผŒไฝ†ๅพˆๆœ‰ๅฏๅ‘ๆ€ง๏ผŒๆŽจ่ไฝ ๅŽป็œ‹็œ‹ใ€‚\n็”จๆˆท๏ผšไป€ไนˆๅ—๏ผŒๆ นๆœฌๆ˜ฏ็ญ”้žๆ‰€้—ฎใ€‚\nๆƒ…็ปชๅˆ†ๆž๏ผšLLMๆฒกๆœ‰ๅ›žๅบ”้—ฎ้ข˜่€Œๆ˜ฏๆๅˆฐๆ— ๅ…ณๅ†…ๅฎน๏ผŒๅฏผ่‡ด็”จๆˆทๆƒ…็ปช็›ด็บฟไธ‹้™ใ€‚\nๆƒ…็ปชๅ€ผ๏ผš-2\nLLM๏ผšๆŠฑๆญ‰ๅˆšๆ‰็š„ๅ้ข˜๏ผๆ”ฏๆŒๆœฌๅœฐไผไธšๆœ‰ๅŠฉไบŽๅ‡ๅฐ‘้•ฟ้€”่ฟ่พ“ไบง็”Ÿ็š„็ขณ่ถณ่ฟน๏ผŒไฝฟไพ›ๅบ”้“พๆ›ดๅŠ ็Žฏไฟใ€‚ๆญคๅค–๏ผŒๆœฌๅœฐไผไธšไนŸๆ›ดๆœ‰ๅฏ่ƒฝ้‡‡็”จๅฏๆŒ็ปญ็š„็”Ÿไบงๆ–นๅผ๏ผŒๅŒๆ—ถไฟƒ่ฟ›็คพๅŒบ็ปๆตŽ็š„็น่ฃใ€‚\n็”จๆˆท๏ผš่ฟ˜่กŒๅง๏ผŒ็ฎ—ไฝ ่ƒฝๅคŸๆŽฐๅ›žๆฅใ€‚\nๆƒ…็ปชๅˆ†ๆž๏ผš้—ฎ้ข˜ๅพ—ๅˆฐ่งฃ็ญ”๏ผŒ้—ฎ้ข˜ๅ้ข˜ๅพ—ๅˆฐ็บ ๆญฃ๏ผŒๆƒ…็ปช็จๆœ‰ๅฅฝ่ฝฌใ€‚\nๆƒ…็ปชๅ€ผ๏ผš-1\n'#
DEFAULT_QUERY_TEMPLATE = '็”จๆˆท๏ผš{query}\n'#
DEFAULT_RESPONSE_TEMPLATE = 'LLM๏ผš{response}\n'#
DEFAULT_ANALYSIS_TEMPLATE = 'ๆƒ…็ปชๅˆ†ๆž๏ผš{analysis}\n'#
DEFAULT_INTENSITY_TEMPLATE = 'ๆƒ…็ปชๅ€ผ๏ผš{intensity}\n'#
DEFAULT_ANALYSIS_PATTERN = 'ๆƒ…็ปชๅˆ†ๆž๏ผš(.*?)\n'#
DEFAULT_INTENSITY_PATTERN = 'ๆƒ…็ปชๅ€ผ๏ผš(.*?)($|\n)'#
__init__(api_model: str = 'gpt-4o', max_round: Annotated[int, Ge(ge=0)] = 10, *, intensities_key: str = 'dialog_sentiment_intensity', analysis_key: str = 'dialog_sentiment_intensity_analysis', api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, query_template: str | None = None, response_template: str | None = None, analysis_template: str | None = None, intensity_template: str | None = None, analysis_pattern: str | None = None, intensity_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]#

Initialization method.

Parameters:
  • api_model โ€“ API model name.

  • max_round โ€“ The max num of round in the dialog to build the prompt.

  • intensities_key โ€“ The key name in the meta field to store the output sentiment intensities. It is โ€˜dialog_sentiment_intensityโ€™ in default.

  • analysis_key โ€“ The key name in the meta field to store the corresponding analysis. It is โ€˜dialog_sentiment_intensity_analysisโ€™ in default.

  • api_endpoint โ€“ URL endpoint for the API.

  • response_path โ€“ Path to extract content from the API response. Defaults to โ€˜choices.0.message.contentโ€™.

  • system_prompt โ€“ System prompt for the task.

  • query_template โ€“ Template for query part to build the input prompt.

  • response_template โ€“ Template for response part to build the input prompt.

  • analysis_template โ€“ Template for analysis part to build the input prompt.

  • intensity_template โ€“ Template for intensity part to build the input prompt.

  • analysis_pattern โ€“ Pattern to parse the return sentiment analysis.

  • intensity_pattern โ€“ Pattern to parse the return sentiment intensity.

  • try_num โ€“ The number of retry attempts when there is an API call error or output parsing error.

  • model_params โ€“ Parameters for initializing the API model.

  • sampling_params โ€“ Extra parameters passed to the API call. e.g {โ€˜temperatureโ€™: 0.9, โ€˜top_pโ€™: 0.95}

  • kwargs โ€“ Extra keyword arguments.

build_input(history, query)[source]#
parse_output(response)[source]#
process_single(sample, rank=None)[source]#

For sample level, sample โ€“> sample

Parameters:

sample โ€“ sample to process

Returns:

processed sample

class data_juicer.ops.mapper.DialogTopicDetectionMapper(api_model: str = 'gpt-4o', topic_candidates: List[str] | None = None, max_round: Annotated[int, Ge(ge=0)] = 10, *, labels_key: str = 'dialog_topic_labels', analysis_key: str = 'dialog_topic_labels_analysis', api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, query_template: str | None = None, response_template: str | None = None, candidate_template: str | None = None, analysis_template: str | None = None, labels_template: str | None = None, analysis_pattern: str | None = None, labels_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]#

Bases: Mapper

Generates userโ€™s topic labels and analysis in a dialog.

This operator processes a dialog to detect and label the topics discussed by the user. It takes input from history_key, query_key, and response_key and outputs lists of labels and analysis for each query in the dialog. The operator uses a predefined system prompt and templates to build the input prompt for the API call. It supports customizing the system prompt, templates, and patterns for parsing the API response. The results are stored in the meta field under the keys specified by labels_key and analysis_key. If these keys already exist in the meta field, the operator skips processing. The operator retries the API call up to try_num times in case of errors.

DEFAULT_SYSTEM_PROMPT = '่ฏทๅˆคๆ–ญ็”จๆˆทๅ’ŒLLMๅคš่ฝฎๅฏน่ฏไธญ็”จๆˆทๆ‰€่ฎจ่ฎบ็š„่ฏ้ข˜ใ€‚\n่ฆๆฑ‚๏ผš\n- ้’ˆๅฏน็”จๆˆท็š„ๆฏไธชquery๏ผŒ้œ€่ฆๅ…ˆ่ฟ›่กŒๅˆ†ๆž๏ผŒ็„ถๅŽๅˆ—ๅ‡บ็”จๆˆทๆญฃๅœจ่ฎจ่ฎบ็š„่ฏ้ข˜๏ผŒไธ‹้ขๆ˜ฏไธ€ไธชๆ ทไพ‹๏ผŒ่ฏทๆจกไปฟๆ ทไพ‹ๆ ผๅผ่พ“ๅ‡บใ€‚\n็”จๆˆท๏ผšไฝ ๅฅฝ๏ผŒไปŠๅคฉๆˆ‘ไปฌๆฅ่Š่Š็งฆๅง‹็š‡ๅงใ€‚\n่ฏ้ข˜ๅˆ†ๆž๏ผš็”จๆˆทๆๅˆฐ็งฆๅง‹็š‡๏ผŒ่ฟ™ๆ˜ฏไธญๅ›ฝๅކๅฒไธŠ็ฌฌไธ€ไฝ็š‡ๅธใ€‚\n่ฏ้ข˜็ฑปๅˆซ๏ผšๅކๅฒ\nLLM๏ผšๅฝ“็„ถๅฏไปฅ๏ผŒ็งฆๅง‹็š‡ๆ˜ฏไธญๅ›ฝๅކๅฒไธŠ็ฌฌไธ€ไธช็ปŸไธ€ๅ…จๅ›ฝ็š„็š‡ๅธ๏ผŒไป–ๅœจๅ…ฌๅ…ƒๅ‰221ๅนดๅปบ็ซ‹ไบ†็งฆๆœ๏ผŒๅนถ้‡‡ๅ–ไบ†ไธ€็ณปๅˆ—้‡่ฆ็š„ๆ”น้ฉๆŽชๆ–ฝ๏ผŒๅฆ‚็ปŸไธ€ๆ–‡ๅญ—ใ€ๅบฆ้‡่กกๅ’Œ่ดงๅธ็ญ‰ใ€‚\n็”จๆˆท๏ผš็งฆๅง‹็š‡ไฟฎๅปบ็š„้•ฟๅŸŽๅ’Œ็Žฐๅœจ็š„้•ฟๅŸŽๆœ‰ไป€ไนˆๅŒบๅˆซ๏ผŸ\n่ฏ้ข˜ๅˆ†ๆž๏ผš็”จๆˆทๆๅˆฐ็งฆๅง‹็š‡ไฟฎๅปบ็š„้•ฟๅŸŽ๏ผŒๅนถๅฐ†ๅ…ถไธŽ็Žฐไปฃ้•ฟๅŸŽ่ฟ›่กŒๆฏ”่พƒ๏ผŒๆถ‰ๅŠๅปบ็ญ‘ๅކๅฒๅ’Œๅœฐ็†ไฝ็ฝฎใ€‚\n่ฏ้ข˜็ฑปๅˆซ๏ผšๅކๅฒLLM๏ผš็งฆๅง‹็š‡ๆ—ถๆœŸไฟฎๅปบ็š„้•ฟๅŸŽไธป่ฆๆ˜ฏไธบไบ†ๆŠตๅพกๅŒ—ๆ–นๆธธ็‰งๆฐ‘ๆ—็š„ๅ…ฅไพต๏ผŒๅฎƒ็š„่ง„ๆจกๅ’ŒไฟฎๅปบๆŠ€ๆœฏ็›ธๅฏน่พƒไธบ็ฎ€้™‹ใ€‚็Žฐไปฃไบบๆ‰€็œ‹ๅˆฐ็š„้•ฟๅŸŽๅคง้ƒจๅˆ†ๆ˜ฏๆ˜Žๆœๆ—ถๆœŸไฟฎๅปบๅ’Œๆ‰ฉๅปบ็š„๏ผŒๆ˜Ž้•ฟๅŸŽไธไป…่ง„ๆจกๆ›ดๅคงใ€็ป“ๆž„ๆ›ดๅšๅ›บ๏ผŒ่€Œไธ”ไฟๅญ˜ๅพ—ๆฏ”่พƒๅฎŒๅฅฝใ€‚\n็”จๆˆท๏ผšๆœ‰ๆ„ๆ€๏ผŒ้‚ฃไนˆ้•ฟๅŸŽ็š„ๅ…ทไฝ“ไฝ็ฝฎๅœจๅ“ชไบ›็œไปฝๅ‘ข๏ผŸ\n่ฏ้ข˜ๅˆ†ๆž๏ผš็”จๆˆท่ฏข้—ฎ้•ฟๅŸŽ็š„ๅ…ทไฝ“ไฝ็ฝฎ๏ผŒๆถ‰ๅŠๅˆฐๅœฐ็†็Ÿฅ่ฏ†ใ€‚\n่ฏ้ข˜็ฑปๅˆซ๏ผšๅœฐ็†\nLLM๏ผš้•ฟๅŸŽๆจช่ทจไธญๅ›ฝๅŒ—ๆ–นๅคšไธช็œไปฝ๏ผŒไธป่ฆๅŒ…ๆ‹ฌๆฒณๅŒ—ใ€ๅฑฑ่ฅฟใ€ๅ†…่’™ๅคใ€ๅฎๅคใ€้™•่ฅฟใ€็”˜่‚ƒๅ’ŒๅŒ—ไบฌ็ญ‰ใ€‚ๆฏไธ€ๆฎต้•ฟๅŸŽ้ƒฝๅปบๅœจๅ…ณ้”ฎ็š„ๆˆ˜็•ฅไฝ็ฝฎ๏ผŒไปฅไพฟๆœ€ๅคง้™ๅบฆๅœฐๅ‘ๆŒฅๅ…ถ้˜ฒๅพกไฝœ็”จใ€‚\n'#
DEFAULT_QUERY_TEMPLATE = '็”จๆˆท๏ผš{query}\n'#
DEFAULT_RESPONSE_TEMPLATE = 'LLM๏ผš{response}\n'#
DEFAULT_CANDIDATES_TEMPLATE = 'ๅค‡้€‰่ฏ้ข˜็ฑปๅˆซ๏ผš[{candidate_str}]'#
DEFAULT_ANALYSIS_TEMPLATE = '่ฏ้ข˜ๅˆ†ๆž๏ผš{analysis}\n'#
DEFAULT_LABELS_TEMPLATE = '่ฏ้ข˜็ฑปๅˆซ๏ผš{labels}\n'#
DEFAULT_ANALYSIS_PATTERN = '่ฏ้ข˜ๅˆ†ๆž๏ผš(.*?)\n'#
DEFAULT_LABELS_PATTERN = '่ฏ้ข˜็ฑปๅˆซ๏ผš(.*?)($|\n)'#
__init__(api_model: str = 'gpt-4o', topic_candidates: List[str] | None = None, max_round: Annotated[int, Ge(ge=0)] = 10, *, labels_key: str = 'dialog_topic_labels', analysis_key: str = 'dialog_topic_labels_analysis', api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, query_template: str | None = None, response_template: str | None = None, candidate_template: str | None = None, analysis_template: str | None = None, labels_template: str | None = None, analysis_pattern: str | None = None, labels_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]#

Initialization method.

Parameters:
  • api_model โ€“ API model name.

  • topic_candidates โ€“ The output topic candidates. Use open-domain topic labels if it is None.

  • max_round โ€“ The max num of round in the dialog to build the prompt.

  • labels_key โ€“ The key name in the meta field to store the output labels. It is โ€˜dialog_topic_labelsโ€™ in default.

  • analysis_key โ€“ The key name in the meta field to store the corresponding analysis. It is โ€˜dialog_topic_labels_analysisโ€™ in default.

  • api_endpoint โ€“ URL endpoint for the API.

  • response_path โ€“ Path to extract content from the API response. Defaults to โ€˜choices.0.message.contentโ€™.

  • system_prompt โ€“ System prompt for the task.

  • query_template โ€“ Template for query part to build the input prompt.

  • response_template โ€“ Template for response part to build the input prompt.

  • candidate_template โ€“ Template for topic candidates to build the input prompt.

  • analysis_template โ€“ Template for analysis part to build the input prompt.

  • labels_template โ€“ Template for labels part to build the input prompt.

  • analysis_pattern โ€“ Pattern to parse the return topic analysis.

  • labels_pattern โ€“ Pattern to parse the return topic labels.

  • try_num โ€“ The number of retry attempts when there is an API call error or output parsing error.

  • model_params โ€“ Parameters for initializing the API model.

  • sampling_params โ€“ Extra parameters passed to the API call. e.g {โ€˜temperatureโ€™: 0.9, โ€˜top_pโ€™: 0.95}

  • kwargs โ€“ Extra keyword arguments.

build_input(history, query)[source]#
parse_output(response)[source]#
process_single(sample, rank=None)[source]#

For sample level, sample โ€“> sample

Parameters:

sample โ€“ sample to process

Returns:

processed sample

class data_juicer.ops.mapper.Difference_Area_Generator_Mapper(image_pair_similarity_filter_args: Dict | None = {}, image_segment_mapper_args: Dict | None = {}, image_text_matching_filter_args: Dict | None = {}, *args, **kwargs)[source]#

Bases: Mapper

Generates and filters bounding boxes for image pairs based on similarity, segmentation, and text matching.

This operator processes image pairs to identify and filter regions with significant differences. It uses a sequence of operations: - Filters out image pairs with large differences. - Segments the images to identify potential objects. - Crops sub-images based on bounding boxes. - Determines if the sub-images contain valid objects using image-text matching. - Filters out sub-images that are too similar. - Removes overlapping bounding boxes. - Uses Hugging Face models for similarity and text matching, and FastSAM for

segmentation.

  • Caches intermediate results in DATA_JUICER_ASSETS_CACHE.

  • Returns the filtered bounding boxes in the MetaKeys.bbox_tag field.

__init__(image_pair_similarity_filter_args: Dict | None = {}, image_segment_mapper_args: Dict | None = {}, image_text_matching_filter_args: Dict | None = {}, *args, **kwargs)[source]#

Initialization.

Parameters:
  • image_pair_similarity_filter_args โ€“ Arguments for image pair similarity filter. Controls the similarity filtering between image pairs. Default empty dict will use fixed values: min_score_1=0.1, max_score_1=1.0, min_score_2=0.1, max_score_2=1.0, hf_clip=โ€openai/clip-vit-base-patch32โ€, num_proc=1.

  • image_segment_mapper_args โ€“ Arguments for image segmentation mapper. Controls the image segmentation process. Default empty dict will use fixed values: imgsz=1024, conf=0.05, iou=0.5, model_path=โ€FastSAM-x.ptโ€.

  • image_text_matching_filter_args โ€“ Arguments for image-text matching filter. Controls the matching between cropped image regions and text descriptions. Default empty dict will use fixed values: min_score=0.1, max_score=1.0, hf_blip=โ€Salesforce/blip-itm-base-cocoโ€, num_proc=1.

process_single(samples, rank=None)[source]#

For sample level, sample โ€“> sample

Parameters:

sample โ€“ sample to process

Returns:

processed sample

class data_juicer.ops.mapper.Difference_Caption_Generator_Mapper(mllm_mapper_args: Dict | None = {}, image_text_matching_filter_args: Dict | None = {}, text_pair_similarity_filter_args: Dict | None = {}, *args, **kwargs)[source]#

Bases: Mapper

Generates difference captions for bounding box regions in two images.

This operator processes pairs of images and generates captions for the differences in their bounding box regions. It uses a multi-step process: - Describes the content of each bounding box region using a Hugging Face model. - Crops the bounding box regions from both images. - Checks if the cropped regions match the generated captions. - Determines if there are differences between the two captions. - Marks the difference area with a red box. - Generates difference captions for the marked areas. - The key metric is the similarity score between the captions, computed using a CLIP

model.

  • If no valid bounding boxes or differences are found, it returns empty captions and zeroed bounding boxes.

  • Uses โ€˜cudaโ€™ as the accelerator if any of the fused operations support it.

  • Caches temporary images during processing and clears them afterward.

__init__(mllm_mapper_args: Dict | None = {}, image_text_matching_filter_args: Dict | None = {}, text_pair_similarity_filter_args: Dict | None = {}, *args, **kwargs)[source]#

Initialization.

Parameters:
  • mllm_mapper_args โ€“ Arguments for multimodal language model mapper. Controls the generation of captions for bounding box regions. Default empty dict will use fixed values: max_new_tokens=256, temperature=0.2, top_p=None, num_beams=1, hf_model=โ€llava-hf/llava-v1.6-vicuna-7b-hfโ€.

  • image_text_matching_filter_args โ€“ Arguments for image-text matching filter. Controls the matching between cropped regions and generated captions. Default empty dict will use fixed values: min_score=0.1, max_score=1.0, hf_blip=โ€Salesforce/blip-itm-base-cocoโ€, num_proc=1.

  • text_pair_similarity_filter_args โ€“ Arguments for text pair similarity filter. Controls the similarity comparison between caption pairs. Default empty dict will use fixed values: min_score=0.1, max_score=1.0, hf_clip=โ€openai/clip-vit-base-patch32โ€, text_key_second=โ€target_textโ€, num_proc=1.

process_single(samples, rank=None)[source]#

For sample level, sample โ€“> sample

Parameters:

sample โ€“ sample to process

Returns:

processed sample

class data_juicer.ops.mapper.DownloadFileMapper(download_field: str = None, save_dir: str = None, save_field: str = None, resume_download: bool = False, timeout: int = 30, max_concurrent: int = 10, *args, **kwargs)[source]#

Bases: Mapper

Mapper to download URL files to local files or load them into memory.

This operator downloads files from URLs and can either save them to a specified directory or load the contents directly into memory. It supports downloading multiple files concurrently and can resume downloads if the resume_download flag is set. The operator processes nested lists of URLs, flattening them for batch processing and then reconstructing the original structure in the output. If both save_dir and save_field are not specified, it defaults to saving the content under the key image_bytes. The operator logs any failed download attempts and provides error messages for troubleshooting.

__init__(download_field: str = None, save_dir: str = None, save_field: str = None, resume_download: bool = False, timeout: int = 30, max_concurrent: int = 10, *args, **kwargs)[source]#

Initialization method.

Parameters:
  • save_dir โ€“ The directory to save downloaded files.

  • download_field โ€“ The filed name to get the url to download.

  • save_field โ€“ The filed name to save the downloaded file content.

  • resume_download โ€“ Whether to resume download. if True, skip the sample if it exists.

  • timeout โ€“ Timeout for download.

  • max_concurrent โ€“ Maximum concurrent downloads.

  • args โ€“ extra args

  • kwargs โ€“ extra args

download_files_async(urls, return_contents, save_dir=None, **kwargs)[source]#
download_nested_urls(nested_urls: List[str | List[str]], save_dir=None, save_field_contents=None)[source]#
process_batched(samples)[source]#
class data_juicer.ops.mapper.ExpandMacroMapper(*args, **kwargs)[source]#

Bases: Mapper

Expands macro definitions in the document body of LaTeX samples.

This operator processes LaTeX documents to expand user-defined macros in the text. It supports newcommand and def macros without arguments. Macros are identified and expanded in the text, ensuring they are not part of longer alphanumeric words. The operator currently does not support macros with arguments. The processed text is updated in the samples.

__init__(*args, **kwargs)[source]#

Initialization method.

Parameters:
  • args โ€“ extra args

  • kwargs โ€“ extra args

process_batched(samples)[source]#
class data_juicer.ops.mapper.ExtractEntityAttributeMapper(api_model: str = 'gpt-4o', query_entities: List[str] = [], query_attributes: List[str] = [], *, entity_key: str = 'main_entities', attribute_key: str = 'attributes', attribute_desc_key: str = 'attribute_descriptions', support_text_key: str = 'attribute_support_texts', api_endpoint: str | None = None, response_path: str | None = None, system_prompt_template: str | None = None, input_template: str | None = None, attr_pattern_template: str | None = None, demo_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]#

Bases: Mapper

Extracts attributes for given entities from the text and stores them in the sampleโ€™s metadata.

This operator uses an API model to extract specified attributes for given entities from the input text. It constructs prompts based on provided templates and parses the modelโ€™s output to extract attribute descriptions and supporting text. The extracted data is stored in the sampleโ€™s metadata under the specified keys. If the required metadata fields already exist, the operator skips processing for that sample. The operator retries the API call and parsing up to a specified number of times in case of errors. The default system prompt, input template, and parsing patterns are used if not provided.

DEFAULT_SYSTEM_PROMPT_TEMPLATE = '็ป™ๅฎšไธ€ๆฎตๆ–‡ๆœฌ๏ผŒไปŽๆ–‡ๆœฌไธญๆ€ป็ป“{entity}็š„{attribute}๏ผŒๅนถไธ”ไปŽๅŽŸๆ–‡ๆ‘˜ๅฝ•ๆœ€่ƒฝ่ฏดๆ˜Ž่ฏฅ{attribute}็š„ไปฃ่กจๆ€ง็คบไพ‹ใ€‚\n่ฆๆฑ‚๏ผš\n- ๆ‘˜ๅฝ•็š„็คบไพ‹ๅบ”่ฏฅ็ฎ€็Ÿญใ€‚\n- ้ตๅพชๅฆ‚ไธ‹็š„ๅ›žๅคๆ ผๅผ๏ผš\n# {entity}\n## {attribute}๏ผš\n...\n### ไปฃ่กจๆ€ง็คบไพ‹ๆ‘˜ๅฝ•1๏ผš\n```\n...\n```\n### ไปฃ่กจๆ€ง็คบไพ‹ๆ‘˜ๅฝ•2๏ผš\n```\n...\n```\n...\n'#
DEFAULT_INPUT_TEMPLATE = '# ๆ–‡ๆœฌ\n```\n{text}\n```\n'#
DEFAULT_ATTR_PATTERN_TEMPLATE = '\\#\\#\\s*{attribute}๏ผš\\s*(.*?)(?=\\#\\#\\#|\\Z)'#
DEFAULT_DEMON_PATTERN = '\\#\\#\\#\\s*ไปฃ่กจๆ€ง็คบไพ‹ๆ‘˜ๅฝ•(\\d+)๏ผš\\s*```\\s*(.*?)```\\s*(?=\\#\\#\\#|\\Z)'#
__init__(api_model: str = 'gpt-4o', query_entities: List[str] = [], query_attributes: List[str] = [], *, entity_key: str = 'main_entities', attribute_key: str = 'attributes', attribute_desc_key: str = 'attribute_descriptions', support_text_key: str = 'attribute_support_texts', api_endpoint: str | None = None, response_path: str | None = None, system_prompt_template: str | None = None, input_template: str | None = None, attr_pattern_template: str | None = None, demo_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]#

Initialization method.

Parameters:
  • api_model โ€“ API model name.

  • query_entities โ€“ Entity list to be queried.

  • query_attributes โ€“ Attribute list to be queried.

  • entity_key โ€“ The key name in the meta field to store the given main entity for attribute extraction. Itโ€™s โ€œentityโ€ in default.

  • attribute_key โ€“ The key name in the meta field to store the given attribute to be extracted. Itโ€™s โ€œattributeโ€ in default.

  • attribute_desc_key โ€“ The key name in the meta field to store the extracted attribute description. Itโ€™s โ€œattribute_descriptionโ€ in default.

  • support_text_key โ€“ The key name in the meta field to store the attribute support text extracted from the raw text. Itโ€™s โ€œsupport_textโ€ in default.

  • api_endpoint โ€“ URL endpoint for the API.

  • response_path โ€“ Path to extract content from the API response. Defaults to โ€˜choices.0.message.contentโ€™.

  • system_prompt_template โ€“ System prompt template for the task. Need to be specified by given entity and attribute.

  • input_template โ€“ Template for building the model input.

  • attr_pattern_template โ€“ Pattern for parsing the attribute from output. Need to be specified by given attribute.

  • demo_pattern โ€“ Pattern for parsing the demonstration from output to support the attribute.

  • try_num โ€“ The number of retry attempts when there is an API call error or output parsing error.

  • drop_text โ€“ If drop the text in the output.

  • model_params โ€“ Parameters for initializing the API model.

  • sampling_params โ€“ Extra parameters passed to the API call. e.g {โ€˜temperatureโ€™: 0.9, โ€˜top_pโ€™: 0.95}

  • kwargs โ€“ Extra keyword arguments.

parse_output(raw_output, attribute_name)[source]#
process_single(sample, rank=None)[source]#

For sample level, sample โ€“> sample

Parameters:

sample โ€“ sample to process

Returns:

processed sample

class data_juicer.ops.mapper.ExtractEntityRelationMapper(api_model: str = 'gpt-4o', entity_types: List[str] = None, *, entity_key: str = 'entity', relation_key: str = 'relation', api_endpoint: str | None = None, response_path: str | None = None, prompt_template: str | None = None, tuple_delimiter: str | None = None, record_delimiter: str | None = None, completion_delimiter: str | None = None, max_gleaning: Annotated[int, Ge(ge=0)] = 1, continue_prompt: str | None = None, if_loop_prompt: str | None = None, entity_pattern: str | None = None, relation_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]#

Bases: Mapper

Extracts entities and relations from text to build a knowledge graph.

  • Identifies entities based on specified types and extracts their names, types, and descriptions.

  • Identifies relationships between the entities, including source and target entities, relationship descriptions, keywords, and strength scores.

  • Uses a Hugging Face tokenizer and a predefined prompt template to guide the extraction process.

  • Outputs entities and relations in a structured format, using delimiters for separation.

  • Caches the results in the sampleโ€™s metadata under the keys โ€˜entityโ€™ and โ€˜relationโ€™.

  • Supports multiple retries and gleaning to ensure comprehensive extraction.

  • The default entity types include โ€˜organizationโ€™, โ€˜personโ€™, โ€˜geoโ€™, and โ€˜eventโ€™.

DEFAULT_PROMPT_TEMPLATE = '-Goal-\nGiven a text document that is potentially relevant to this activity and a list of entity types, identify all entities of those types from the text and all relationships among the identified entities.\n\n-Steps-\n1. Identify all entities. For each identified entity, extract the following information:\n- entity_name: Name of the entity\n- entity_type: One of the following types: [{entity_types}]\n- entity_description: Comprehensive description of the entity\'s attributes and activities\nFormat each entity as ("entity"{tuple_delimiter}<entity_name>{tuple_delimiter}<entity_type>{tuple_delimiter}<entity_description>\n\n2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other.\nFor each pair of related entities, extract the following information:\n- source_entity: name of the source entity, as identified in step 1\n- target_entity: name of the target entity, as identified in step 1\n- relationship_description: explanation as to why you think the source entity and the target entity are related to each other\n- relationship_strength: a numeric score indicating strength of the relationship between the source entity and target entity\n- relationship_keywords: one or more high-level key words that summarize the overarching nature of the relationship, focusing on concepts or themes rather than specific details\nFormat each relationship as ("relationship"{tuple_delimiter}<source_entity>{tuple_delimiter}<target_entity>{tuple_delimiter}<relationship_description>{tuple_delimiter}<relationship_keywords>{tuple_delimiter}<relationship_strength>)\n\n3. Return output in the language of the given text as a single list of all the entities and relationships identified in steps 1 and 2. Use **{record_delimiter}** as the list delimiter.\n\n4. When finished, output {completion_delimiter}\n\n######################\n-Examples-\n######################\nExample 1:\n\nEntity_types: [person, technology, mission, organization, location]\nText:\n```\nwhile Alex clenched his jaw, the buzz of frustration dull against the backdrop of Taylor\'s authoritarian certainty. It was this competitive undercurrent that kept him alert, the sense that his and Jordan\'s shared commitment to discovery was an unspoken rebellion against Cruz\'s narrowing vision of control and order.\n\nThen Taylor did something unexpected. They paused beside Jordan and, for a moment, observed the device with something akin to reverence. โ€œIf this tech can be understood..." Taylor said, their voice quieter, "It could change the game for us. For all of us.โ€\n\nThe underlying dismissal earlier seemed to falter, replaced by a glimpse of reluctant respect for the gravity of what lay in their hands. Jordan looked up, and for a fleeting heartbeat, their eyes locked with Taylor\'s, a wordless clash of wills softening into an uneasy truce.\n\nIt was a small transformation, barely perceptible, but one that Alex noted with an inward nod. They had all been brought here by different paths\n```\n################\nOutput:\n("entity"{tuple_delimiter}"Alex"{tuple_delimiter}"person"{tuple_delimiter}"Alex is a character who experiences frustration and is observant of the dynamics among other characters."){record_delimiter}\n("entity"{tuple_delimiter}"Taylor"{tuple_delimiter}"person"{tuple_delimiter}"Taylor is portrayed with authoritarian certainty and shows a moment of reverence towards a device, indicating a change in perspective."){record_delimiter}\n("entity"{tuple_delimiter}"Jordan"{tuple_delimiter}"person"{tuple_delimiter}"Jordan shares a commitment to discovery and has a significant interaction with Taylor regarding a device."){record_delimiter}\n("entity"{tuple_delimiter}"Cruz"{tuple_delimiter}"person"{tuple_delimiter}"Cruz is associated with a vision of control and order, influencing the dynamics among other characters."){record_delimiter}\n("entity"{tuple_delimiter}"The Device"{tuple_delimiter}"technology"{tuple_delimiter}"The Device is central to the story, with potential game-changing implications, and is reversed by Taylor."){record_delimiter}\n("relationship"{tuple_delimiter}"Alex"{tuple_delimiter}"Taylor"{tuple_delimiter}"Alex is affected by Taylor\'s authoritarian certainty and observes changes in Taylor\'s attitude towards the device."{tuple_delimiter}"power dynamics, perspective shift"{tuple_delimiter}7){record_delimiter}\n("relationship"{tuple_delimiter}"Alex"{tuple_delimiter}"Jordan"{tuple_delimiter}"Alex and Jordan share a commitment to discovery, which contrasts with Cruz\'s vision."{tuple_delimiter}"shared goals, rebellion"{tuple_delimiter}6){record_delimiter}\n("relationship"{tuple_delimiter}"Taylor"{tuple_delimiter}"Jordan"{tuple_delimiter}"Taylor and Jordan interact directly regarding the device, leading to a moment of mutual respect and an uneasy truce."{tuple_delimiter}"conflict resolution, mutual respect"{tuple_delimiter}8){record_delimiter}\n("relationship"{tuple_delimiter}"Jordan"{tuple_delimiter}"Cruz"{tuple_delimiter}"Jordan\'s commitment to discovery is in rebellion against Cruz\'s vision of control and order."{tuple_delimiter}"ideological conflict, rebellion"{tuple_delimiter}5){record_delimiter}\n("relationship"{tuple_delimiter}"Taylor"{tuple_delimiter}"The Device"{tuple_delimiter}"Taylor shows reverence towards the device, indicating its importance and potential impact."{tuple_delimiter}"reverence, technological significance"{tuple_delimiter}9){record_delimiter}\n#############################\nExample 2:\n\nEntity_types: [ไบบ็‰ฉ, ๆŠ€ๆœฏ, ไปปๅŠก, ็ป„็ป‡, ๅœฐ็‚น]\nText:\n```\nไป–ไปฌไธๅ†ๆ˜ฏๅ•็บฏ็š„ๆ‰ง่กŒ่€…๏ผ›ไป–ไปฌๅทฒๆˆไธบๆŸไธช่ถ…่ถŠๆ˜Ÿ่พฐไธŽๆก็บน็š„้ข†ๅŸŸ็š„ไฟกๆฏๅฎˆๆŠค่€…ใ€‚่ฟ™ไธ€ไฝฟๅ‘ฝ็š„ๆๅ‡ไธ่ƒฝ่ขซ่ง„ๅˆ™ๅ’Œๆ—ขๅฎšๅ่ฎฎๆ‰€ๆŸ็ผšโ€”โ€”ๅฎƒ้œ€่ฆไธ€็งๆ–ฐ็š„่ง†่ง’๏ผŒไธ€็งๆ–ฐ็š„ๅ†ณๅฟƒใ€‚\n\n้š็€ไธŽๅŽ็››้กฟ็š„้€š่ฎฏๅœจ่ƒŒๆ™ฏไธญๅ—กๅ—กไฝœๅ“๏ผŒๅฏน่ฏไธญ็š„็ดงๅผ ๆƒ…็ปช้€š่ฟ‡ๅ˜Ÿๅ˜Ÿๅฃฐๅ’Œ้™็”ตๅ™ช้Ÿณ่ดฏ็ฉฟๅง‹็ปˆใ€‚ๅ›ข้˜Ÿ็ซ™็ซ‹็€๏ผŒไธ€่‚กไธ็ฅฅ็š„ๆฐ”ๆฏ็ฌผ็ฝฉ็€ไป–ไปฌใ€‚ๆ˜พ็„ถ๏ผŒไป–ไปฌๅœจๆŽฅไธ‹ๆฅๅ‡ ไธชๅฐๆ—ถๅ†…ๅšๅ‡บ็š„ๅ†ณๅฎšๅฏ่ƒฝไผš้‡ๆ–ฐๅฎšไน‰ไบบ็ฑปๅœจๅฎ‡ๅฎ™ไธญ็š„ไฝ็ฝฎ๏ผŒๆˆ–่€…ๅฐ†ไป–ไปฌ็ฝฎไบŽๆ— ็Ÿฅๅ’Œๆฝœๅœจๅฑ้™ฉไน‹ไธญใ€‚\n\n้š็€ไธŽๆ˜Ÿ่พฐ็š„่”็ณปๅ˜ๅพ—ๆ›ดๅŠ ็‰ขๅ›บ๏ผŒๅฐ็ป„ๅผ€ๅง‹ๅค„็†้€ๆธๆˆๅฝข็š„่ญฆๅ‘Š๏ผŒไปŽ่ขซๅŠจๆŽฅๅ—่€…่ฝฌๅ˜ไธบ็งฏๆžๅ‚ไธŽ่€…ใ€‚ๆข…็‘ŸๅŽๆฅ็š„็›ด่ง‰ๅ ๆฎไบ†ไธŠ้ฃŽโ€”โ€”ๅ›ข้˜Ÿ็š„ไปปๅŠกๅทฒ็ปๆผ”ๅ˜๏ผŒไธๅ†ไป…ไป…ๆ˜ฏ่ง‚ๅฏŸๅ’ŒๆŠฅๅ‘Š๏ผŒ่€Œๆ˜ฏไบ’ๅŠจๅ’Œๅ‡†ๅค‡ใ€‚ไธ€ๅœบ่œ•ๅ˜ๅทฒ็ปๅผ€ๅง‹๏ผŒ่€Œโ€œๆœๅฐ”ๅกž่กŒๅŠจโ€ๅˆ™ไปฅไป–ไปฌๅคง่ƒ†็š„ๆ–ฐ้ข‘็އ้œ‡ๅŠจ๏ผŒ่ฟ™็งๅŸบ่ฐƒไธๆ˜ฏ็”ฑไธ–ไฟ—่ฎพๅฎš็š„\n```\n#############\nOutput:\n("entity"{tuple_delimiter}"ๅŽ็››้กฟ"{tuple_delimiter}"ๅœฐ็‚น"{tuple_delimiter}"ๅŽ็››้กฟๆ˜ฏๆญฃๅœจๆŽฅๆ”ถ้€š่ฎฏ็š„ๅœฐๆ–น๏ผŒ่กจๆ˜Žๅ…ถๅœจๅ†ณ็ญ–่ฟ‡็จ‹ไธญ็š„้‡่ฆๆ€งใ€‚"){record_delimiter}\n("entity"{tuple_delimiter}"ๆœๅฐ”ๅกž่กŒๅŠจ"{tuple_delimiter}"ไปปๅŠก"{tuple_delimiter}"ๆœๅฐ”ๅกž่กŒๅŠจ่ขซๆ่ฟฐไธบไธ€้กนๅทฒๆผ”ๅ˜ไธบไบ’ๅŠจๅ’Œๅ‡†ๅค‡็š„ไปปๅŠก๏ผŒๆ˜พ็คบๅ‡บ็›ฎๆ ‡ๅ’ŒๆดปๅŠจ็š„้‡ๅคง่ฝฌๅ˜ใ€‚"){record_delimiter}\n("entity"{tuple_delimiter}"ๅ›ข้˜Ÿ"{tuple_delimiter}"็ป„็ป‡"{tuple_delimiter}"ๅ›ข้˜Ÿ่ขซๆ็ป˜ๆˆไธ€็พคไปŽ่ขซๅŠจ่ง‚ๅฏŸ่€…่ฝฌๅ˜ไธบ็งฏๆžๅ‚ไธŽ่€…็š„ไบบ๏ผŒๅฑ•็คบไบ†ไป–ไปฌ่ง’่‰ฒ็š„ๅŠจๆ€ๅ˜ๅŒ–ใ€‚"){record_delimiter}\n("relationship"{tuple_delimiter}"ๅ›ข้˜Ÿ"{tuple_delimiter}"ๅŽ็››้กฟ"{tuple_delimiter}"ๅ›ข้˜Ÿๆ”ถๅˆฐๆฅ่‡ชๅŽ็››้กฟ็š„้€š่ฎฏ๏ผŒ่ฟ™ๅฝฑๅ“ไบ†ไป–ไปฌ็š„ๅ†ณ็ญ–่ฟ‡็จ‹ใ€‚"{tuple_delimiter}"ๅ†ณ็ญ–ใ€ๅค–้ƒจๅฝฑๅ“"{tuple_delimiter}7){record_delimiter}\n("relationship"{tuple_delimiter}"ๅ›ข้˜Ÿ"{tuple_delimiter}"ๆœๅฐ”ๅกž่กŒๅŠจ"{tuple_delimiter}"ๅ›ข้˜Ÿ็›ดๆŽฅๅ‚ไธŽๆœๅฐ”ๅกž่กŒๅŠจ๏ผŒๆ‰ง่กŒๅ…ถๆผ”ๅ˜ๅŽ็š„็›ฎๆ ‡ๅ’ŒๆดปๅŠจใ€‚"{tuple_delimiter}"ไปปๅŠกๆผ”ๅ˜ใ€็งฏๆžๅ‚ไธŽ"{tuple_delimiter}9){completion_delimiter}\n#############################\nExample 3:\n\nEntity_types: [person, role, technology, organization, event, location, concept]\nText:\n```\ntheir voice slicing through the buzz of activity. "Control may be an illusion when facing an intelligence that literally writes its own rules," they stated stoically, casting a watchful eye over the flurry of data.\n\n"It\'s like it\'s learning to communicate," offered Sam Rivera from a nearby interface, their youthful energy boding a mix of awe and anxiety. "This gives talking to strangers\' a whole new meaning."\n\nAlex surveyed his teamโ€”each face a study in concentration, determination, and not a small measure of trepidation. "This might well be our first contact," he acknowledged, "And we need to be ready for whatever answers back."\n\nTogether, they stood on the edge of the unknown, forging humanity\'s response to a message from the heavens. The ensuing silence was palpableโ€”a collective introspection about their role in this grand cosmic play, one that could rewrite human history.\n\nThe encrypted dialogue continued to unfold, its intricate patterns showing an almost uncanny anticipation\n```\n#############\nOutput:\n("entity"{tuple_delimiter}"Sam Rivera"{tuple_delimiter}"person"{tuple_delimiter}"Sam Rivera is a member of a team working on communicating with an unknown intelligence, showing a mix of awe and anxiety."){record_delimiter}\n("entity"{tuple_delimiter}"Alex"{tuple_delimiter}"person"{tuple_delimiter}"Alex is the leader of a team attempting first contact with an unknown intelligence, acknowledging the significance of their task."){record_delimiter}\n("entity"{tuple_delimiter}"Control"{tuple_delimiter}"concept"{tuple_delimiter}"Control refers to the ability to manage or govern, which is challenged by an intelligence that writes its own rules."){record_delimiter}\n("entity"{tuple_delimiter}"Intelligence"{tuple_delimiter}"concept"{tuple_delimiter}"Intelligence here refers to an unknown entity capable of writing its own rules and learning to communicate."){record_delimiter}\n("entity"{tuple_delimiter}"First Contact"{tuple_delimiter}"event"{tuple_delimiter}"First Contact is the potential initial communication between humanity and an unknown intelligence."){record_delimiter}\n("entity"{tuple_delimiter}"Humanity\'s Response"{tuple_delimiter}"event"{tuple_delimiter}"Humanity\'s Response is the collective action taken by Alex\'s team in response to a message from an unknown intelligence."){record_delimiter}\n("relationship"{tuple_delimiter}"Sam Rivera"{tuple_delimiter}"Intelligence"{tuple_delimiter}"Sam Rivera is directly involved in the process of learning to communicate with the unknown intelligence."{tuple_delimiter}"communication, learning process"{tuple_delimiter}9){record_delimiter}\n("relationship"{tuple_delimiter}"Alex"{tuple_delimiter}"First Contact"{tuple_delimiter}"Alex leads the team that might be making the First Contact with the unknown intelligence."{tuple_delimiter}"leadership, exploration"{tuple_delimiter}10){record_delimiter}\n("relationship"{tuple_delimiter}"Alex"{tuple_delimiter}"Humanity\'s Response"{tuple_delimiter}"Alex and his team are the key figures in Humanity\'s Response to the unknown intelligence."{tuple_delimiter}"collective action, cosmic significance"{tuple_delimiter}8){record_delimiter}\n("relationship"{tuple_delimiter}"Control"{tuple_delimiter}"Intelligence"{tuple_delimiter}"The concept of Control is challenged by the Intelligence that writes its own rules."{tuple_delimiter}"power dynamics, autonomy"{tuple_delimiter}7){record_delimiter}\n#############################\n-Real Data-\n######################\nEntity_types: [{entity_types}]\nText:\n```\n{input_text}\n```\n######################\nOutput:\n'#
DEFAULT_CONTINUE_PROMPT = 'MANY entities were missed in the last extraction.  Add them below using the same format:\n'#
DEFAULT_IF_LOOP_PROMPT = 'It appears some entities may have still been missed.  Answer YES | NO if there are still entities that need to be added.\n'#
DEFAULT_ENTITY_TYPES = ['organization', 'person', 'geo', 'event']#
DEFAULT_TUPLE_DELIMITER = '<|>'#
DEFAULT_RECORD_DELIMITER = '##'#
DEFAULT_COMPLETION_DELIMITER = '<|COMPLETE|>'#
DEFAULT_ENTITY_PATTERN = '\\("entity"(.*?)\\)'#
DEFAULT_RELATION_PATTERN = '\\("relationship"(.*?)\\)'#
__init__(api_model: str = 'gpt-4o', entity_types: List[str] = None, *, entity_key: str = 'entity', relation_key: str = 'relation', api_endpoint: str | None = None, response_path: str | None = None, prompt_template: str | None = None, tuple_delimiter: str | None = None, record_delimiter: str | None = None, completion_delimiter: str | None = None, max_gleaning: Annotated[int, Ge(ge=0)] = 1, continue_prompt: str | None = None, if_loop_prompt: str | None = None, entity_pattern: str | None = None, relation_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]#

Initialization method. :param api_model: API model name. :param entity_types: Pre-defined entity types for knowledge graph. :param entity_key: The key name to store the entities in the meta

field. Itโ€™s โ€œentityโ€ in default.

Parameters:
  • relation_key โ€“ The field name to store the relations between entities. Itโ€™s โ€œrelationโ€ in default.

  • api_endpoint โ€“ URL endpoint for the API.

  • response_path โ€“ Path to extract content from the API response. Defaults to โ€˜choices.0.message.contentโ€™.

  • prompt_template โ€“ The template of input prompt.

  • tuple_delimiter โ€“ Delimiter to separate items in outputs.

  • record_delimiter โ€“ Delimiter to separate records in outputs.

  • completion_delimiter โ€“ To mark the end of the output.

  • max_gleaning โ€“ the extra max num to call LLM to glean entities and relations.

  • continue_prompt โ€“ the prompt for gleaning entities and relations.

  • if_loop_prompt โ€“ the prompt to determine whether to stop gleaning.

  • entity_pattern โ€“ Regular expression for parsing entity record.

  • relation_pattern โ€“ Regular expression for parsing relation record.

  • try_num โ€“ The number of retry attempts when there is an API call error or output parsing error.

  • drop_text โ€“ If drop the text in the output.

  • model_params โ€“ Parameters for initializing the API model.

  • sampling_params โ€“ Extra parameters passed to the API call. e.g {โ€˜temperatureโ€™: 0.9, โ€˜top_pโ€™: 0.95}

  • kwargs โ€“ Extra keyword arguments.

parse_output(raw_output)[source]#
add_message(messages, role, content)[source]#
light_rag_extraction(messages, rank=None)[source]#
process_single(sample, rank=None)[source]#

For sample level, sample โ€“> sample

Parameters:

sample โ€“ sample to process

Returns:

processed sample

class data_juicer.ops.mapper.ExtractEventMapper(api_model: str = 'gpt-4o', *, event_desc_key: str = 'event_description', relevant_char_key: str = 'relevant_characters', api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, output_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]#

Bases: Mapper

Extracts events and relevant characters from the text.

This operator uses an API model to summarize the text into multiple events and extract the relevant characters for each event. The summary and character extraction follow a predefined format. The operator retries the API call up to a specified number of times if there is an error. The extracted events and characters are stored in the meta field of the samples. If no events are found, the original samples are returned. The operator can optionally drop the original text after processing.

DEFAULT_SYSTEM_PROMPT = '็ป™ๅฎšไธ€ๆฎตๆ–‡ๆœฌ๏ผŒๅฏนๆ–‡ๆœฌ็š„ๆƒ…่Š‚่ฟ›่กŒๅˆ†็‚นๆ€ป็ป“๏ผŒๅนถๆŠฝๅ–ไธŽๆƒ…่Š‚็›ธๅ…ณ็š„ไบบ็‰ฉใ€‚\n่ฆๆฑ‚๏ผš\n- ๅฐฝ้‡ไธ่ฆ้—ๆผๅ†…ๅฎน๏ผŒไธ่ฆๆทปๅŠ ๆ–‡ๆœฌไธญๆฒกๆœ‰็š„ๆƒ…่Š‚๏ผŒ็ฌฆๅˆๅŽŸๆ–‡ไบ‹ๅฎž\n- ่”็ณปไธŠไธ‹ๆ–‡่ฏดๆ˜Žๅ‰ๅ› ๅŽๆžœ๏ผŒไฝ†ไป็„ถ้œ€่ฆ็ฌฆๅˆไบ‹ๅฎž\n- ไธ่ฆๅŒ…ๅซไธป่ง‚็œ‹ๆณ•\n- ๆณจๆ„่ฆๅฐฝๅฏ่ƒฝไฟ็•™ๆ–‡ๆœฌ็š„ไธ“ๆœ‰ๅ่ฏ\n- ๆณจๆ„็›ธๅ…ณไบบ็‰ฉ้œ€่ฆๅœจๅฏนๅบ”ๆƒ…่Š‚ไธญๅ‡บ็Žฐ\n- ๅชๆŠฝๅ–ๆƒ…่Š‚ไธญ็š„ไธป่ฆไบบ็‰ฉ๏ผŒไธ่ฆ้—ๆผๆƒ…่Š‚็š„ไธป่ฆไบบ็‰ฉ\n- ๆ€ป็ป“ๆ ผๅผๅฆ‚ไธ‹๏ผš\n### ๆƒ…่Š‚1๏ผš\n- **ๆƒ…่Š‚ๆ่ฟฐ**๏ผš ...\n- **็›ธๅ…ณไบบ็‰ฉ**๏ผšไบบ็‰ฉ1๏ผŒไบบ็‰ฉ2๏ผŒไบบ็‰ฉ3๏ผŒ...\n### ๆƒ…่Š‚2๏ผš\n- **ๆƒ…่Š‚ๆ่ฟฐ**๏ผš ...\n- **็›ธๅ…ณไบบ็‰ฉ**๏ผšไบบ็‰ฉ1๏ผŒไบบ็‰ฉ2๏ผŒ...\n### ๆƒ…่Š‚3๏ผš\n- **ๆƒ…่Š‚ๆ่ฟฐ**๏ผš ...\n- **็›ธๅ…ณไบบ็‰ฉ**๏ผšไบบ็‰ฉ1๏ผŒ...\n...\n'#
DEFAULT_INPUT_TEMPLATE = '# ๆ–‡ๆœฌ\n```\n{text}\n```\n'#
DEFAULT_OUTPUT_PATTERN = '\n        \\#\\#\\#\\s*ๆƒ…่Š‚(\\d+)๏ผš\\s*\n        -\\s*\\*\\*ๆƒ…่Š‚ๆ่ฟฐ\\*\\*\\s*๏ผš\\s*(.*?)\\s*\n        -\\s*\\*\\*็›ธๅ…ณไบบ็‰ฉ\\*\\*\\s*๏ผš\\s*(.*?)(?=\\#\\#\\#|\\Z)\n    '#
__init__(api_model: str = 'gpt-4o', *, event_desc_key: str = 'event_description', relevant_char_key: str = 'relevant_characters', api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, output_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]#

Initialization method. :param api_model: API model name. :param event_desc_key: The key name to store the event descriptions

in the meta field. Itโ€™s โ€œevent_descriptionโ€ in default.

Parameters:
  • relevant_char_key โ€“ The field name to store the relevant characters to the events in the meta field. Itโ€™s โ€œrelevant_charactersโ€ in default.

  • api_endpoint โ€“ URL endpoint for the API.

  • response_path โ€“ Path to extract content from the API response. Defaults to โ€˜choices.0.message.contentโ€™.

  • system_prompt โ€“ System prompt for the task.

  • input_template โ€“ Template for building the model input.

  • output_pattern โ€“ Regular expression for parsing model output.

  • try_num โ€“ The number of retry attempts when there is an API call error or output parsing error.

  • drop_text โ€“ If drop the text in the output.

  • model_params โ€“ Parameters for initializing the API model.

  • sampling_params โ€“ Extra parameters passed to the API call. e.g {โ€˜temperatureโ€™: 0.9, โ€˜top_pโ€™: 0.95}

  • kwargs โ€“ Extra keyword arguments.

parse_output(raw_output)[source]#
process_batched(samples, rank=None)[source]#
class data_juicer.ops.mapper.ExtractKeywordMapper(api_model: str = 'gpt-4o', *, keyword_key: str = 'keyword', api_endpoint: str | None = None, response_path: str | None = None, prompt_template: str | None = None, completion_delimiter: str | None = None, output_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]#

Bases: Mapper

Generate keywords for the text.

This operator uses a specified API model to generate high-level keywords that summarize the main concepts, themes, or topics of the input text. The generated keywords are stored in the meta field under the key specified by keyword_key. The operator retries the API call up to try_num times in case of errors. If drop_text is set to True, the original text is removed from the sample after processing. The operator uses a default prompt template and completion delimiter, which can be customized. The output is parsed using a regular expression to extract the keywords.

DEFAULT_PROMPT_TEMPLATE = '-Goal-\nGiven a text document that is potentially relevant to this activity and a list of entity types, identify all entities of those types from the text and all relationships among the identified entities.\n\n-Steps-\n1. Identify high-level key words that summarize the main concepts, themes, or topics of the entire text. These should capture the overarching ideas present in the document.\nFormat the content-level key words as ("content_keywords" <high_level_keywords>)\n\n3. Return output in the language of the given text.\n\n4. When finished, output {completion_delimiter}\n\n######################\n-Examples-\n######################\nExample 1:\n\nText:\n```\nwhile Alex clenched his jaw, the buzz of frustration dull against the backdrop of Taylor\'s authoritarian certainty. It was this competitive undercurrent that kept him alert, the sense that his and Jordan\'s shared commitment to discovery was an unspoken rebellion against Cruz\'s narrowing vision of control and order.\n\nThen Taylor did something unexpected. They paused beside Jordan and, for a moment, observed the device with something akin to reverence. โ€œIf this tech can be understood..." Taylor said, their voice quieter, "It could change the game for us. For all of us.โ€\n\nThe underlying dismissal earlier seemed to falter, replaced by a glimpse of reluctant respect for the gravity of what lay in their hands. Jordan looked up, and for a fleeting heartbeat, their eyes locked with Taylor\'s, a wordless clash of wills softening into an uneasy truce.\n\nIt was a small transformation, barely perceptible, but one that Alex noted with an inward nod. They had all been brought here by different paths\n```\n################\nOutput:\n("content_keywords" "power dynamics, ideological conflict, discovery, rebellion"){completion_delimiter}\n#############################\nExample 2:\n\nText:\n```\nไป–ไปฌไธๅ†ๆ˜ฏๅ•็บฏ็š„ๆ‰ง่กŒ่€…๏ผ›ไป–ไปฌๅทฒๆˆไธบๆŸไธช่ถ…่ถŠๆ˜Ÿ่พฐไธŽๆก็บน็š„้ข†ๅŸŸ็š„ไฟกๆฏๅฎˆๆŠค่€…ใ€‚่ฟ™ไธ€ไฝฟๅ‘ฝ็š„ๆๅ‡ไธ่ƒฝ่ขซ่ง„ๅˆ™ๅ’Œๆ—ขๅฎšๅ่ฎฎๆ‰€ๆŸ็ผšโ€”โ€”ๅฎƒ้œ€่ฆไธ€็งๆ–ฐ็š„่ง†่ง’๏ผŒไธ€็งๆ–ฐ็š„ๅ†ณๅฟƒใ€‚\n\n้š็€ไธŽๅŽ็››้กฟ็š„้€š่ฎฏๅœจ่ƒŒๆ™ฏไธญๅ—กๅ—กไฝœๅ“๏ผŒๅฏน่ฏไธญ็š„็ดงๅผ ๆƒ…็ปช้€š่ฟ‡ๅ˜Ÿๅ˜Ÿๅฃฐๅ’Œ้™็”ตๅ™ช้Ÿณ่ดฏ็ฉฟๅง‹็ปˆใ€‚ๅ›ข้˜Ÿ็ซ™็ซ‹็€๏ผŒไธ€่‚กไธ็ฅฅ็š„ๆฐ”ๆฏ็ฌผ็ฝฉ็€ไป–ไปฌใ€‚ๆ˜พ็„ถ๏ผŒไป–ไปฌๅœจๆŽฅไธ‹ๆฅๅ‡ ไธชๅฐๆ—ถๅ†…ๅšๅ‡บ็š„ๅ†ณๅฎšๅฏ่ƒฝไผš้‡ๆ–ฐๅฎšไน‰ไบบ็ฑปๅœจๅฎ‡ๅฎ™ไธญ็š„ไฝ็ฝฎ๏ผŒๆˆ–่€…ๅฐ†ไป–ไปฌ็ฝฎไบŽๆ— ็Ÿฅๅ’Œๆฝœๅœจๅฑ้™ฉไน‹ไธญใ€‚\n\n้š็€ไธŽๆ˜Ÿ่พฐ็š„่”็ณปๅ˜ๅพ—ๆ›ดๅŠ ็‰ขๅ›บ๏ผŒๅฐ็ป„ๅผ€ๅง‹ๅค„็†้€ๆธๆˆๅฝข็š„่ญฆๅ‘Š๏ผŒไปŽ่ขซๅŠจๆŽฅๅ—่€…่ฝฌๅ˜ไธบ็งฏๆžๅ‚ไธŽ่€…ใ€‚ๆข…็‘ŸๅŽๆฅ็š„็›ด่ง‰ๅ ๆฎไบ†ไธŠ้ฃŽโ€”โ€”ๅ›ข้˜Ÿ็š„ไปปๅŠกๅทฒ็ปๆผ”ๅ˜๏ผŒไธๅ†ไป…ไป…ๆ˜ฏ่ง‚ๅฏŸๅ’ŒๆŠฅๅ‘Š๏ผŒ่€Œๆ˜ฏไบ’ๅŠจๅ’Œๅ‡†ๅค‡ใ€‚ไธ€ๅœบ่œ•ๅ˜ๅทฒ็ปๅผ€ๅง‹๏ผŒ่€Œโ€œๆœๅฐ”ๅกž่กŒๅŠจโ€ๅˆ™ไปฅไป–ไปฌๅคง่ƒ†็š„ๆ–ฐ้ข‘็އ้œ‡ๅŠจ๏ผŒ่ฟ™็งๅŸบ่ฐƒไธๆ˜ฏ็”ฑไธ–ไฟ—่ฎพๅฎš็š„\n```\n#############\nOutput:\n("content_keywords" "ไปปๅŠกๆผ”ๅ˜, ๅ†ณ็ญ–ๅˆถๅฎš, ็งฏๆžๅ‚ไธŽ, ๅฎ‡ๅฎ™ๆ„ไน‰"){completion_delimiter}\n#############################\nExample 3:\n\nEntity_types: [person, role, technology, organization, event, location, concept]\nText:\n```\ntheir voice slicing through the buzz of activity. "Control may be an illusion when facing an intelligence that literally writes its own rules," they stated stoically, casting a watchful eye over the flurry of data.\n\n"It\'s like it\'s learning to communicate," offered Sam Rivera from a nearby interface, their youthful energy boding a mix of awe and anxiety. "This gives talking to strangers\' a whole new meaning."\n\nAlex surveyed his teamโ€”each face a study in concentration, determination, and not a small measure of trepidation. "This might well be our first contact," he acknowledged, "And we need to be ready for whatever answers back."\n\nTogether, they stood on the edge of the unknown, forging humanity\'s response to a message from the heavens. The ensuing silence was palpableโ€”a collective introspection about their role in this grand cosmic play, one that could rewrite human history.\n\nThe encrypted dialogue continued to unfold, its intricate patterns showing an almost uncanny anticipation\n```\n#############\nOutput:\n("content_keywords" "first contact, control, communication, cosmic significance"){completion_delimiter}\n-Real Data-\n######################\nText:\n```\n{input_text}\n```\n######################\nOutput:\n'#
DEFAULT_COMPLETION_DELIMITER = '<|COMPLETE|>'#
DEFAULT_OUTPUT_PATTERN = '\\("content_keywords"(.*?)\\)'#
__init__(api_model: str = 'gpt-4o', *, keyword_key: str = 'keyword', api_endpoint: str | None = None, response_path: str | None = None, prompt_template: str | None = None, completion_delimiter: str | None = None, output_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]#

Initialization method. :param api_model: API model name. :param keyword_key: The key name to store the keywords in the meta

field. Itโ€™s โ€œkeywordโ€ in default.

Parameters:
  • api_endpoint โ€“ URL endpoint for the API.

  • response_path โ€“ Path to extract content from the API response. Defaults to โ€˜choices.0.message.contentโ€™.

  • prompt_template โ€“ The template of input prompt.

  • completion_delimiter โ€“ To mark the end of the output.

  • output_pattern โ€“ Regular expression for parsing keywords.

  • try_num โ€“ The number of retry attempts when there is an API call error or output parsing error.

  • drop_text โ€“ If drop the text in the output.

  • model_params โ€“ Parameters for initializing the API model.

  • sampling_params โ€“ Extra parameters passed to the API call. e.g {โ€˜temperatureโ€™: 0.9, โ€˜top_pโ€™: 0.95}

  • kwargs โ€“ Extra keyword arguments.

parse_output(raw_output)[source]#
process_single(sample, rank=None)[source]#

For sample level, sample โ€“> sample

Parameters:

sample โ€“ sample to process

Returns:

processed sample

class data_juicer.ops.mapper.ExtractNicknameMapper(api_model: str = 'gpt-4o', *, nickname_key: str = 'nickname', api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, output_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]#

Bases: Mapper

Extracts nickname relationships in the text using a language model.

This operator uses a language model to identify and extract nickname relationships from the input text. It follows specific instructions to ensure accurate extraction, such as identifying the speaker, the person being addressed, and the nickname used. The extracted relationships are stored in the meta field under the specified key. The operator uses a default system prompt, input template, and output pattern, but these can be customized. The results are parsed and validated to ensure they meet the required format. If the text already contains the nickname information, it is not processed again. The operator retries the API call a specified number of times if an error occurs.

DEFAULT_SYSTEM_PROMPT = '็ป™ๅฎšไฝ ไธ€ๆฎตๆ–‡ๆœฌ๏ผŒไฝ ็š„ไปปๅŠกๆ˜ฏๅฐ†ไบบ็‰ฉไน‹้—ด็š„็งฐๅ‘ผๆ–นๅผ๏ผˆๆ˜ต็งฐ๏ผ‰ๆๅ–ๅ‡บๆฅใ€‚\n่ฆๆฑ‚๏ผš\n- ้œ€่ฆ็ป™ๅ‡บ่ฏด่ฏไบบๅฏน่ขซ็งฐๅ‘ผไบบ็š„็งฐๅ‘ผ๏ผŒไธ่ฆๆžๅไบ†ใ€‚\n- ็›ธๅŒ็š„่ฏด่ฏไบบๅ’Œ่ขซ็งฐๅ‘ผไบบๆœ€ๅคš็ป™ๅ‡บไธ€ไธชๆœ€ๅธธ็”จ็š„็งฐๅ‘ผใ€‚\n- ่ฏทไธ่ฆ่พ“ๅ‡บไบ’็›ธๆฒกๆœ‰ๆ˜ต็งฐ็š„็งฐๅ‘ผๆ–นๅผใ€‚\n- ่พ“ๅ‡บๆ ผๅผๅฆ‚ไธ‹๏ผš\n```\n### ็งฐๅ‘ผๆ–นๅผ1\n- **่ฏด่ฏไบบ**๏ผš...\n- **่ขซ็งฐๅ‘ผไบบ**๏ผš...\n- **...ๅฏน...็š„ๆ˜ต็งฐ**๏ผš...\n### ็งฐๅ‘ผๆ–นๅผ2\n- **่ฏด่ฏไบบ**๏ผš...\n- **่ขซ็งฐๅ‘ผไบบ**๏ผš...\n- **...ๅฏน...็š„ๆ˜ต็งฐ**๏ผš...\n### ็งฐๅ‘ผๆ–นๅผ3\n- **่ฏด่ฏไบบ**๏ผš...\n- **่ขซ็งฐๅ‘ผไบบ**๏ผš...\n- **...ๅฏน...็š„ๆ˜ต็งฐ**๏ผš...\n...\n```\n'#
DEFAULT_INPUT_TEMPLATE = '# ๆ–‡ๆœฌ\n```\n{text}\n```\n'#
DEFAULT_OUTPUT_PATTERN = '\n        \\#\\#\\#\\s*็งฐๅ‘ผๆ–นๅผ(\\d+)\\s*\n        -\\s*\\*\\*่ฏด่ฏไบบ\\*\\*\\s*๏ผš\\s*(.*?)\\s*\n        -\\s*\\*\\*่ขซ็งฐๅ‘ผไบบ\\*\\*\\s*๏ผš\\s*(.*?)\\s*\n        -\\s*\\*\\*(.*?)ๅฏน(.*?)็š„ๆ˜ต็งฐ\\*\\*\\s*๏ผš\\s*(.*?)(?=\\#\\#\\#|\\Z) # for double check\n    '#
__init__(api_model: str = 'gpt-4o', *, nickname_key: str = 'nickname', api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, output_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]#

Initialization method. :param api_model: API model name. :param nickname_key: The key name to store the nickname

relationship in the meta field. Itโ€™s โ€œnicknameโ€ in default.

Parameters:
  • api_endpoint โ€“ URL endpoint for the API.

  • response_path โ€“ Path to extract content from the API response. Defaults to โ€˜choices.0.message.contentโ€™.

  • system_prompt โ€“ System prompt for the task.

  • input_template โ€“ Template for building the model input.

  • output_pattern โ€“ Regular expression for parsing model output.

  • try_num โ€“ The number of retry attempts when there is an API call error or output parsing error.

  • drop_text โ€“ If drop the text in the output.

  • model_params โ€“ Parameters for initializing the API model.

  • sampling_params โ€“ Extra parameters passed to the API call. e.g {โ€˜temperatureโ€™: 0.9, โ€˜top_pโ€™: 0.95}

  • kwargs โ€“ Extra keyword arguments.

parse_output(raw_output)[source]#
process_single(sample, rank=None)[source]#

For sample level, sample โ€“> sample

Parameters:

sample โ€“ sample to process

Returns:

processed sample

class data_juicer.ops.mapper.ExtractSupportTextMapper(api_model: str = 'gpt-4o', *, summary_key: str = 'event_description', support_text_key: str = 'support_text', api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]#

Bases: Mapper

Extracts a supporting sub-text from the original text based on a given summary.

This operator uses an API model to identify and extract a segment of the original text that best matches the provided summary. It leverages a system prompt and input template to guide the extraction process. The extracted support text is stored in the specified meta field key. If the extraction fails or returns an empty string, the original summary is used as a fallback. The operator retries the extraction up to a specified number of times in case of errors.

DEFAULT_SYSTEM_PROMPT = 'ไฝ ๅฐ†ๆ‰ฎๆผ”ไธ€ไธชๆ–‡ๆœฌๆ‘˜ๅฝ•ๅŠฉๆ‰‹็š„่ง’่‰ฒใ€‚ไฝ ็š„ไธป่ฆไปปๅŠกๆ˜ฏๅŸบไบŽ็ป™ๅฎš็š„ๆ–‡็ซ ๏ผˆ็งฐไธบโ€œๅŽŸๆ–‡โ€๏ผ‰ไปฅๅŠๅฏนๅŽŸๆ–‡ๆŸไธช้ƒจๅˆ†็š„็ฎ€็Ÿญๆ่ฟฐๆˆ–ๆ€ป็ป“๏ผˆ็งฐไธบโ€œๆ€ป็ป“โ€๏ผ‰๏ผŒๅ‡†็กฎๅœฐ่ฏ†ๅˆซๅนถๆๅ–ๅ‡บไธŽ่ฏฅๆ€ป็ป“็›ธๅฏนๅบ”็š„ๅŽŸๆ–‡็‰‡ๆฎตใ€‚\n่ฆๆฑ‚๏ผš\n- ไฝ ้œ€่ฆๅฐฝๅฏ่ƒฝ็ฒพ็กฎๅœฐๅŒน้…ๅˆฐๆœ€็ฌฆๅˆๆ€ป็ป“ๅ†…ๅฎน็š„้‚ฃ้ƒจๅˆ†ๅ†…ๅฎน\n- ๅฆ‚ๆžœๅญ˜ๅœจๅคšไธชๅฏ่ƒฝ็š„็ญ”ๆกˆ๏ผŒ่ฏท้€‰ๆ‹ฉๆœ€่ดด่ฟ‘ๆ€ป็ป“ๆ„ๆ€็š„้‚ฃไธช\n- ไธ‹้ขๆ˜ฏไธ€ไธชไพ‹ๅญๅธฎๅŠฉ็†่งฃ่ฟ™ไธ€่ฟ‡็จ‹๏ผš\n### ๅŽŸๆ–‡๏ผš\nใ€Š็บขๆฅผๆขฆใ€‹ๆ˜ฏไธญๅ›ฝๅคๅ…ธๅฐ่ฏดๅ››ๅคงๅ่‘—ไน‹ไธ€๏ผŒ็”ฑๆธ…ไปฃไฝœๅฎถๆ›น้›ช่Šนๅˆ›ไฝœใ€‚ๅฎƒ่ฎฒ่ฟฐไบ†่ดพๅฎ็މใ€ๆž—้ป›็މ็ญ‰ไบบ็š„็ˆฑๆƒ…ๆ•…ไบ‹ๅŠๅ››ๅคงๅฎถๆ—็š„ๅ…ด่กฐๅކ็จ‹ใ€‚ไนฆไธญ้€š่ฟ‡ๅคๆ‚็š„ไบบ็‰ฉๅ…ณ็ณปๅฑ•็Žฐไบ†ๅฐๅปบ็คพไผš็š„ๅ„็ง็Ÿ›็›พๅ†ฒ็ชใ€‚ๅ…ถไธญๅ…ณไบŽ่ดพๅบœๅ†…้ƒจๆ–—ไบ‰็š„้ƒจๅˆ†ๅฐคๅ…ถ็ฒพๅฝฉ๏ผŒ็‰นๅˆซๆ˜ฏ็Ž‹็†™ๅ‡คไธŽๅฐคไบŒๅงไน‹้—ด็š„ไบ‰ๆ–—๏ผŒ็”ŸๅŠจๆ็ป˜ไบ†ๆƒๅŠ›ไบ‰ๅคบไธ‹็š„ๅฅณๆ€งๅฝข่ฑกใ€‚ๆญคๅค–๏ผŒใ€Š็บขๆฅผๆขฆใ€‹่ฟ˜ไปฅๅ…ถ็ฒพ็พŽ็š„่ฏ—่ฏ้—ปๅ๏ผŒ่ฟ™ไบ›่ฏ—่ฏไธไป…ๅขžๆทปไบ†ๆ–‡ๅญฆ่‰ฒๅฝฉ๏ผŒไนŸๆทฑๅˆปๅๆ˜ ไบ†ไบบ็‰ฉ็š„ๆ€งๆ ผ็‰น็‚นๅ’Œๅ‘ฝ่ฟ่ตฐๅ‘ใ€‚\n\n### ๆ€ป็ป“๏ผš\nๆ่ฟฐไบ†ไนฆไธญ็š„ไธคไธชๅฅณๆ€ง่ง’่‰ฒไน‹้—ดๅ›ด็ป•ๆƒๅŠ›ๅฑ•ๅผ€็š„็ซžไบ‰ใ€‚\n\n### ๅŽŸๆ–‡ๆ‘˜ๅฝ•๏ผš\nๅ…ถไธญๅ…ณไบŽ่ดพๅบœๅ†…้ƒจๆ–—ไบ‰็š„้ƒจๅˆ†ๅฐคๅ…ถ็ฒพๅฝฉ๏ผŒ็‰นๅˆซๆ˜ฏ็Ž‹็†™ๅ‡คไธŽๅฐคไบŒๅงไน‹้—ด็š„ไบ‰ๆ–—๏ผŒ็”ŸๅŠจๆ็ป˜ไบ†ๆƒๅŠ›ไบ‰ๅคบไธ‹็š„ๅฅณๆ€งๅฝข่ฑกใ€‚'#
DEFAULT_INPUT_TEMPLATE = '### ๅŽŸๆ–‡๏ผš\n{text}\n\n### ๆ€ป็ป“๏ผš\n{summary}\n\n### ๅŽŸๆ–‡ๆ‘˜ๅฝ•๏ผš\n'#
__init__(api_model: str = 'gpt-4o', *, summary_key: str = 'event_description', support_text_key: str = 'support_text', api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]#

Initialization method. :param api_model: API model name. :param summary_key: The key name to store the input summary in the

meta field. Itโ€™s โ€œevent_descriptionโ€ in default.

Parameters:
  • support_text_key โ€“ The key name to store the output support text for the summary in the meta field. Itโ€™s โ€œsupport_textโ€ in default.

  • api_endpoint โ€“ URL endpoint for the API.

  • response_path โ€“ Path to extract content from the API response. Defaults to โ€˜choices.0.message.contentโ€™.

  • system_prompt โ€“ System prompt for the task.

  • input_template โ€“ Template for building the model input.

  • try_num โ€“ The number of retry attempts when there is an API call error or output parsing error.

  • drop_text โ€“ If drop the text in the output.

  • model_params โ€“ Parameters for initializing the API model.

  • sampling_params โ€“ Extra parameters passed to the API call. e.g {โ€˜temperatureโ€™: 0.9, โ€˜top_pโ€™: 0.95}

  • kwargs โ€“ Extra keyword arguments.

process_single(sample, rank=None)[source]#

For sample level, sample โ€“> sample

Parameters:

sample โ€“ sample to process

Returns:

processed sample

class data_juicer.ops.mapper.ExtractTablesFromHtmlMapper(tables_field_name: str = 'html_tables', retain_html_tags: bool = False, include_header: bool = True, *args, **kwargs)[source]#

Bases: Mapper

Extracts tables from HTML content and stores them in a specified field.

This operator processes HTML content to extract tables. It can either retain or remove HTML tags based on the retain_html_tags parameter. If retain_html_tags is False, it can also include or exclude table headers based on the include_header parameter. The extracted tables are stored in the tables_field_name field within the sampleโ€™s metadata. If no tables are found, an empty list is stored. If the tables have already been extracted, the operator will not reprocess the sample.

__init__(tables_field_name: str = 'html_tables', retain_html_tags: bool = False, include_header: bool = True, *args, **kwargs)[source]#

Initialization method. :param tables_field_name: Field name to store the extracted tables. :param retain_html_tags: If True, retains HTML tags in the tables;

otherwise, removes them.

Parameters:

include_header โ€“

If True, includes the table header;

otherwise, excludes it.

This parameter is effective

only when retain_html_tags is False

and applies solely to the extracted table content.

process_single(sample)[source]#

For sample level, sample โ€“> sample

Parameters:

sample โ€“ sample to process

Returns:

processed sample

class data_juicer.ops.mapper.FixUnicodeMapper(normalization: str = None, *args, **kwargs)[source]#

Bases: Mapper

Fixes unicode errors in text samples.

This operator corrects common unicode errors and normalizes the text to a specified Unicode normalization form. The default normalization form is โ€˜NFCโ€™, but it can be set to โ€˜NFKCโ€™, โ€˜NFDโ€™, or โ€˜NFKDโ€™ during initialization. It processes text samples in batches, applying the specified normalization to each sample. If an unsupported normalization form is provided, a ValueError is raised.

__init__(normalization: str = None, *args, **kwargs)[source]#

Initialization method.

Parameters:
  • normalization โ€“ the specified form of Unicode normalization mode, which can be one of [โ€˜NFCโ€™, โ€˜NFKCโ€™, โ€˜NFDโ€™, and โ€˜NFKDโ€™], default โ€˜NFCโ€™.

  • args โ€“ extra args

  • kwargs โ€“ extra args

process_batched(samples)[source]#
class data_juicer.ops.mapper.GenerateQAFromExamplesMapper(hf_model: str = 'Qwen/Qwen2.5-7B-Instruct', *, seed_file: str = '', example_num: Annotated[int, Gt(gt=0)] = 3, similarity_threshold: float = 0.7, system_prompt: str | None = None, input_template: str | None = None, example_template: str | None = None, qa_pair_template: str | None = None, output_pattern: str | None = None, enable_vllm: bool = False, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[source]#

Bases: Mapper

Generates question and answer pairs from examples using a Hugging Face model.

This operator generates QA pairs based on provided seed examples. The number of generated samples is determined by the length of the empty dataset configured in the YAML file. The operator uses a Hugging Face model to generate new QA pairs, which are then filtered based on their similarity to the seed examples. Samples with a similarity score below the specified threshold are kept. The similarity is computed using the ROUGE-L metric. The operator requires a seed file in chatml format, which provides the initial QA examples. The generated QA pairs must follow specific formatting rules, such as maintaining the same format as the input examples and ensuring that questions and answers are paired correctly.

DEFAULT_SYSTEM_PROMPT = '่ฏทไฝ ไป”็ป†่ง‚ๅฏŸๅคšไธช็คบไพ‹ๆ•ฐๆฎ็š„่พ“ๅ…ฅๅ’Œ่พ“ๅ‡บ๏ผŒๆŒ‰็…งไฝ ็š„็†่งฃ๏ผŒๆ€ป็ป“ๅ‡บ็›ธๅบ”่ง„็Ÿฉ๏ผŒ็„ถๅŽๅ†™ๅ‡บไธ€ไธชๆ–ฐ็š„ใ€้—ฎ้ข˜ใ€‘ๅ’Œใ€ๅ›ž็ญ”ใ€‘ใ€‚ๆณจๆ„๏ผŒๆ–ฐ็”Ÿๆˆ็š„ใ€้—ฎ้ข˜ใ€‘ๅ’Œใ€ๅ›ž็ญ”ใ€‘้œ€่ฆๆปก่ถณๅฆ‚ไธ‹่ฆๆฑ‚๏ผš\n1. ็”Ÿๆˆ็š„ใ€้—ฎ้ข˜ใ€‘ๅ’Œใ€ๅ›ž็ญ”ใ€‘ไธ่ƒฝไธŽ่พ“ๅ…ฅ็š„ใ€้—ฎ้ข˜ใ€‘ๅ’Œใ€ๅ›ž็ญ”ใ€‘ไธ€่‡ด๏ผŒไฝ†ๆ˜ฏ้œ€่ฆไฟๆŒๆ ผๅผ็›ธๅŒใ€‚\n2. ็”Ÿๆˆ็š„ใ€้—ฎ้ข˜ใ€‘ไธไธ€ๅฎš่ฆๅฑ€้™ไบŽ่พ“ๅ…ฅใ€้—ฎ้ข˜ใ€‘็š„่ฏ้ข˜ๆˆ–้ข†ๅŸŸ๏ผŒ็”Ÿๆˆ็š„ใ€ๅ›ž็ญ”ใ€‘้œ€่ฆๆญฃ็กฎๅ›ž็ญ”็”Ÿๆˆ็š„ใ€้—ฎ้ข˜ใ€‘ใ€‚\n3. ๆไพ›็š„ใ€้—ฎ้ข˜ใ€‘ๅ’Œใ€ๅ›ž็ญ”ใ€‘ๅฏ่ƒฝๆ˜ฏๅคš่ฝฎๅฏน่ฏ๏ผŒ็”Ÿๆˆ็š„ใ€้—ฎ้ข˜ใ€‘ๅ’Œใ€ๅ›ž็ญ”ใ€‘ไนŸๅฏไปฅๆ˜ฏๅคš่ฝฎ๏ผŒไฝ†ๆ˜ฏ้œ€่ฆไฟๆŒๆ ผๅผ็›ธๅŒใ€‚\n4. ็”Ÿๆˆ็š„ใ€้—ฎ้ข˜ใ€‘ๅ’Œใ€ๅ›ž็ญ”ใ€‘ๅฟ…้กปๆˆๅฏนๅ‡บ็Žฐ๏ผŒ่€Œไธ”ใ€้—ฎ้ข˜ใ€‘้œ€่ฆๅœจใ€ๅ›ž็ญ”ใ€‘ไน‹ๅ‰ใ€‚\n'#
DEFAULT_INPUT_TEMPLATE = '{}'#
DEFAULT_EXAMPLE_TEMPLATE = '\nๅฆ‚ไธ‹ๆ˜ฏไธ€ๆก็คบไพ‹ๆ•ฐๆฎ๏ผš\n{}'#
DEFAULT_QA_PAIR_TEMPLATE = 'ใ€้—ฎ้ข˜ใ€‘\n{}\nใ€ๅ›ž็ญ”ใ€‘\n{}\n'#
DEFAULT_OUTPUT_PATTERN = 'ใ€้—ฎ้ข˜ใ€‘(.*?)ใ€ๅ›ž็ญ”ใ€‘(.*?)(?=ใ€้—ฎ้ข˜ใ€‘|$)'#
__init__(hf_model: str = 'Qwen/Qwen2.5-7B-Instruct', *, seed_file: str = '', example_num: Annotated[int, Gt(gt=0)] = 3, similarity_threshold: float = 0.7, system_prompt: str | None = None, input_template: str | None = None, example_template: str | None = None, qa_pair_template: str | None = None, output_pattern: str | None = None, enable_vllm: bool = False, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[source]#

Initialization method.

Parameters:
  • hf_model โ€“ Huggingface model ID.

  • seed_file โ€“ Path to the seed file in chatml format.

  • example_num โ€“ The number of selected examples. Randomly select N examples from โ€œseed_fileโ€ and put them into prompt as QA examples.

  • similarity_threshold โ€“ The similarity score threshold between the generated samples and the seed examples. Range from 0 to 1. Samples with similarity score less than this threshold will be kept.

  • system_prompt โ€“ System prompt for guiding the generation task.

  • input_template โ€“ Template for building the input prompt. It must include one placeholder โ€˜{}โ€™, which will be replaced by example_num formatted examples defined by example_template.

  • example_template โ€“ Template for formatting one QA example. It must include one placeholder โ€˜{}โ€™, which will be replaced by one formatted qa_pair.

  • qa_pair_template โ€“ Template for formatting a single QA pair within each example. Must include two placeholders โ€˜{}โ€™ for the question and answer.

  • output_pattern โ€“ Regular expression pattern to extract questions and answers from model response.

  • enable_vllm โ€“ Whether to use vllm for inference acceleration.

  • model_params โ€“ Parameters for initializing the model.

  • sampling_params โ€“ Sampling parameters for text generation. e.g {โ€˜temperatureโ€™: 0.9, โ€˜top_pโ€™: 0.95}

  • kwargs โ€“ Extra keyword arguments.

build_input(qa_examples)[source]#
parse_output(raw_output)[source]#
process_single(sample, rank=None)[source]#

For sample level, sample โ€“> sample

Parameters:

sample โ€“ sample to process

Returns:

processed sample

class data_juicer.ops.mapper.GenerateQAFromTextMapper(hf_model: str = 'alibaba-pai/pai-qwen1_5-7b-doc2qa', max_num: Annotated[int, Gt(gt=0)] | None = None, *, output_pattern: str | None = None, enable_vllm: bool = False, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[source]#

Bases: Mapper

Generates question and answer pairs from text using a specified model.

This operator uses a Hugging Face model to generate QA pairs from the input text. It supports both Hugging Face and vLLM models for inference. The recommended models, such as โ€˜alibaba-pai/pai-llama3-8b-doc2qaโ€™, are trained on Chinese data and are suitable for Chinese text. The operator can limit the number of generated QA pairs per text and allows custom output patterns for parsing the modelโ€™s response. By default, it uses a regular expression to extract questions and answers from the modelโ€™s output. If no QA pairs are extracted, a warning is logged.

__init__(hf_model: str = 'alibaba-pai/pai-qwen1_5-7b-doc2qa', max_num: Annotated[int, Gt(gt=0)] | None = None, *, output_pattern: str | None = None, enable_vllm: bool = False, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[source]#

Initialization method.

Parameters:
  • hf_model โ€“ Huggingface model ID.

  • max_num โ€“ The max num of returned QA sample for each text. Not limit if it is None.

  • output_pattern โ€“ Regular expression pattern to extract questions and answers from model response.

  • enable_vllm โ€“ Whether to use vllm for inference acceleration.

  • model_params โ€“ Parameters for initializing the model.

  • sampling_params โ€“ Sampling parameters for text generation, e.g {โ€˜temperatureโ€™: 0.9, โ€˜top_pโ€™: 0.95}

  • kwargs โ€“ Extra keyword arguments.

The default data format parsed by this interface is as follows: Model Input:

่’™ๅคๅ›ฝ็š„้ฆ–้ƒฝๆ˜ฏไนŒๅ…ฐๅทดๆ‰˜๏ผˆUlaanbaatar๏ผ‰ ๅ†ฐๅฒ›็š„้ฆ–้ƒฝๆ˜ฏ้›ทๅ…‹้›…ๆœชๅ…‹๏ผˆReykjavik๏ผ‰

Model Output:

่’™ๅคๅ›ฝ็š„้ฆ–้ƒฝๆ˜ฏไนŒๅ…ฐๅทดๆ‰˜๏ผˆUlaanbaatar๏ผ‰ ๅ†ฐๅฒ›็š„้ฆ–้ƒฝๆ˜ฏ้›ทๅ…‹้›…ๆœชๅ…‹๏ผˆReykjavik๏ผ‰ Human: ่ฏท้—ฎ่’™ๅคๅ›ฝ็š„้ฆ–้ƒฝๆ˜ฏๅ“ช้‡Œ๏ผŸ Assistant: ไฝ ๅฅฝ๏ผŒๆ นๆฎๆไพ›็š„ไฟกๆฏ๏ผŒ่’™ๅคๅ›ฝ็š„้ฆ–้ƒฝๆ˜ฏไนŒๅ…ฐๅทดๆ‰˜๏ผˆUlaanbaatar๏ผ‰ใ€‚ Human: ๅ†ฐๅฒ›็š„้ฆ–้ƒฝๆ˜ฏๅ“ช้‡Œๅ‘ข๏ผŸ Assistant: ๅ†ฐๅฒ›็š„้ฆ–้ƒฝๆ˜ฏ้›ทๅ…‹้›…ๆœชๅ…‹๏ผˆReykjavik๏ผ‰ใ€‚ โ€ฆ

parse_output(raw_output)[source]#
process_batched(samples, rank=None)[source]#
class data_juicer.ops.mapper.HumanPreferenceAnnotationMapper(label_config_file: str = None, answer1_key: str = 'answer1', answer2_key: str = 'answer2', prompt_key: str = 'prompt', chosen_key: str = 'chosen', rejected_key: str = 'rejected', **kwargs)[source]#

Bases: LabelStudioAnnotationMapper

Operator for human preference annotation using Label Studio.

This operator formats and presents pairs of answers to a prompt for human evaluation. It uses a default or custom Label Studio configuration to display the prompt and answer options. The operator processes the annotations to determine the preferred answer, updating the sample with the chosen and rejected answers. The operator requires specific keys in the samples for the prompt and answer options. If these keys are missing, it logs warnings and uses placeholder text. The annotated results are processed to update the sample with the chosen and rejected answers.

DEFAULT_LABEL_CONFIG = '\n    <View className="root">\n      <Style>\n        .root {\n          box-sizing: border-box;\n          margin: 0;\n          padding: 0;\n          font-family: \'Roboto\',\n            sans-serif;\n          line-height: 1.6;\n          background-color: #f0f0f0;\n        }\n\n        .container {\n          margin: 0 auto;\n          padding: 20px;\n          background-color: #ffffff;\n          border-radius: 5px;\n          box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.1), 0 6px 20px 0 rgba(0, 0, 0, 0.1);\n        }\n\n        .prompt {\n          padding: 20px;\n          background-color: #0084ff;\n          color: #ffffff;\n          border-radius: 5px;\n          margin-bottom: 20px;\n          box-shadow: 0 2px 4px 0 rgba(0, 0, 0, 0.1), 0 3px 10px 0 rgba(0, 0, 0, 0.1);\n        }\n\n        .answers {\n          display: flex;\n          justify-content: space-between;\n          flex-wrap: wrap;\n          gap: 20px;\n        }\n\n        .answer-box {\n          flex-basis: 49%;\n          padding: 20px;\n          background-color: rgba(44, 62, 80, 0.9);\n          color: #ffffff;\n          border-radius: 5px;\n          box-shadow: 0 2px 4px 0 rgba(0, 0, 0, 0.1), 0 3px 10px 0 rgba(0, 0, 0, 0.1);\n        }\n\n        .answer-box p {\n          word-wrap: break-word;\n        }\n\n        .answer-box:hover {\n          background-color: rgba(52, 73, 94, 0.9);\n          cursor: pointer;\n          transition: all 0.3s ease;\n        }\n\n        .lsf-richtext__line:hover {\n          background: unset;\n        }\n\n        .answer-box .lsf-object {\n          padding: 20px\n        }\n      </Style>\n      <View className="container">\n        <View className="prompt">\n          <Text name="prompt" value="$prompt" />\n        </View>\n        <View className="answers">\n          <Pairwise name="comparison" toName="answer1,answer2"\n                    selectionStyle="background-color: #27ae60; box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.2); border: 2px solid #2ecc71; cursor: pointer; transition: all 0.3s ease;"\n                    leftChoiceValue="answer1" rightChoiceValue="answer2" />\n          <View className="answer-box">\n            <Text name="answer1" value="$answer1" />\n          </View>\n          <View className="answer-box">\n            <Text name="answer2" value="$answer2" />\n          </View>\n        </View>\n      </View>\n    </View>\n    '#
__init__(label_config_file: str = None, answer1_key: str = 'answer1', answer2_key: str = 'answer2', prompt_key: str = 'prompt', chosen_key: str = 'chosen', rejected_key: str = 'rejected', **kwargs)[source]#

Initialize the human preference annotation operator.

Parameters:
  • label_config_file โ€“ Path to the label config file

  • answer1_key โ€“ Key for the first answer

  • answer2_key โ€“ Key for the second answer

  • prompt_key โ€“ Key for the prompt/question

  • chosen_key โ€“ Key for the chosen answer

  • rejected_key โ€“ Key for the rejected answer

class data_juicer.ops.mapper.ImageBlurMapper(p: float = 0.2, blur_type: str = 'gaussian', radius: float = 2, save_dir: str = None, *args, **kwargs)[source]#

Bases: Mapper

Blurs images in the dataset with a specified probability and blur type.

This operator blurs images using one of three types: mean, box, or Gaussian. The probability of an image being blurred is controlled by the p parameter. The blur effect is applied using a kernel with a specified radius. Blurred images are saved to a directory, which can be specified or defaults to the input directory. If the save directory is not provided, the DJ_PRODUCED_DATA_DIR environment variable can be used to set it. The operator ensures that the blur type is one of the supported options and that the radius is non-negative.

__init__(p: float = 0.2, blur_type: str = 'gaussian', radius: float = 2, save_dir: str = None, *args, **kwargs)[source]#

Initialization method.

Parameters:
  • p โ€“ Probability of the image being blurred.

  • blur_type โ€“ Type of blur kernel, including [โ€˜meanโ€™, โ€˜boxโ€™, โ€˜gaussianโ€™].

  • radius โ€“ Radius of blur kernel.

  • save_dir โ€“ The directory where generated image files will be stored. If not specified, outputs will be saved in the same directory as their corresponding input files. This path can alternatively be defined by setting the DJ_PRODUCED_DATA_DIR environment variable.

  • args โ€“ extra args

  • kwargs โ€“ extra args

process_single(sample, context=False)[source]#

For sample level, sample โ€“> sample

Parameters:

sample โ€“ sample to process

Returns:

processed sample

class data_juicer.ops.mapper.ImageCaptioningFromGPT4VMapper(mode: str = 'description', api_key: str = '', max_token: int = 500, temperature: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] = 1.0, system_prompt: str = '', user_prompt: str = '', user_prompt_key: str | None = None, keep_original_sample: bool = True, any_or_all: str = 'any', *args, **kwargs)[source]#

Bases: Mapper

Generates text captions for images using the GPT-4 Vision model.

This operator generates text based on the provided images and specified parameters. It supports different modes of text generation, including โ€˜reasoningโ€™, โ€˜descriptionโ€™, โ€˜conversationโ€™, and โ€˜customโ€™. The generated text can be added to the original sample or replace it, depending on the keep_original_sample parameter. The operator uses a Hugging Face tokenizer and the GPT-4 Vision API to generate the text. The any_or_all parameter determines whether all or any of the images in a sample must meet the generation criteria for the sample to be kept. If user_prompt_key is set, it will use the prompt from the sample; otherwise, it will use the user_prompt parameter. If both are set, user_prompt_key takes precedence.

__init__(mode: str = 'description', api_key: str = '', max_token: int = 500, temperature: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] = 1.0, system_prompt: str = '', user_prompt: str = '', user_prompt_key: str | None = None, keep_original_sample: bool = True, any_or_all: str = 'any', *args, **kwargs)[source]#

Initialization method.

Parameters:
  • mode โ€“ mode of text generated from images, can be one of [โ€˜reasoningโ€™, โ€˜descriptionโ€™, โ€˜conversationโ€™, โ€˜customโ€™]

  • api_key โ€“ the API key to authenticate the request.

  • max_token โ€“ the maximum number of tokens to generate. Default is 500.

  • temperature โ€“ controls the randomness of the output (range from 0 to 1). Default is 0.

  • system_prompt โ€“ a string prompt used to set the context of a conversation and provide global guidance or rules for the gpt4-vision so that it can generate responses in the expected way. If mode set to custom, the parameter will be used.

  • user_prompt โ€“ a string prompt to guide the generation of gpt4-vision for each samples. Itโ€™s โ€œโ€ in default, which means no prompt provided.

  • user_prompt_key โ€“ the key name of fields in samples to store prompts for each sample. Itโ€™s used for set different prompts for different samples. If itโ€™s none, use prompt in parameter โ€œpromptโ€. Itโ€™s None in default.

  • keep_original_sample โ€“ whether to keep the original sample. If itโ€™s set to False, there will be only generated text in the final datasets and the original text will be removed. Itโ€™s True in default.

  • any_or_all โ€“ keep this sample with โ€˜anyโ€™ or โ€˜allโ€™ strategy of all images. โ€˜anyโ€™: keep this sample if any images meet the condition. โ€˜allโ€™: keep this sample only if all images meet the condition.

  • args โ€“ extra args

  • kwargs โ€“ extra args

process_batched(samples)[source]#
class data_juicer.ops.mapper.ImageCaptioningMapper(hf_img2seq: str = 'Salesforce/blip2-opt-2.7b', trust_remote_code: bool = False, caption_num: Annotated[int, Gt(gt=0)] = 1, keep_candidate_mode: str = 'random_any', keep_original_sample: bool = True, prompt: str | None = None, prompt_key: str | None = None, *args, **kwargs)[source]#

Bases: Mapper

Generates image captions using a Hugging Face model and appends them to samples.

This operator generates captions for images in the input samples using a specified Hugging Face model. It can generate multiple captions per image and apply different strategies to retain the generated captions. The operator supports three retention modes: โ€˜random_anyโ€™, โ€˜similar_one_simhashโ€™, and โ€˜allโ€™. In โ€˜random_anyโ€™ mode, a random caption is retained. In โ€˜similar_one_simhashโ€™ mode, the most similar caption to the original text (based on SimHash) is retained. In โ€˜allโ€™ mode, all generated captions are concatenated and retained. The operator can also keep or discard the original sample based on the keep_original_sample parameter. If both prompt and prompt_key are set, the prompt_key takes precedence.

__init__(hf_img2seq: str = 'Salesforce/blip2-opt-2.7b', trust_remote_code: bool = False, caption_num: Annotated[int, Gt(gt=0)] = 1, keep_candidate_mode: str = 'random_any', keep_original_sample: bool = True, prompt: str | None = None, prompt_key: str | None = None, *args, **kwargs)[source]#

Initialization method.

Parameters:
  • hf_img2seq โ€“ model name on huggingface to generate caption

  • trust_remote_code โ€“ whether to trust the remote code of HF models.

  • caption_num โ€“ how many candidate captions to generate for each image

  • keep_candidate_mode โ€“

    retain strategy for the generated $caption_num$ candidates.

    โ€™random_anyโ€™: Retain the random one from generated captions

    โ€™similar_one_simhashโ€™: Retain the generated one that is most

    similar to the original caption

    โ€™allโ€™: Retain all generated captions by concatenation

Note

This is a batched_OP, whose input and output type are both list. Suppose there are $N$ list of input samples, whose batch size is $b$, and denote caption_num as $M$. The number of total samples after generation is $2Nb$ when keep_original_sample is True and $Nb$ when keep_original_sample is False. For โ€˜random_anyโ€™ and โ€˜similar_one_simhashโ€™ mode, itโ€™s $(1+M)Nb$ for โ€˜allโ€™ mode when keep_original_sample is True and $MNb$ when keep_original_sample is False.

Parameters:
  • keep_original_sample โ€“ whether to keep the original sample. If itโ€™s set to False, there will be only generated captions in the final datasets and the original captions will be removed. Itโ€™s True in default.

  • prompt โ€“ a string prompt to guide the generation of blip2 model for all samples globally. Itโ€™s None in default, which means no prompt provided.

  • prompt_key โ€“ the key name of fields in samples to store prompts for each sample. Itโ€™s used for set different prompts for different samples. If itโ€™s none, use prompt in parameter โ€œpromptโ€. Itโ€™s None in default.

  • args โ€“ extra args

  • kwargs โ€“ extra args

process_batched(samples, rank=None)[source]#

Note

This is a batched_OP, whose input and output type are both list. Suppose there are $N$ input sample list with batch size as $b$, and denote caption_num as $M$. the number of total samples after generation is $2Nb$ for โ€˜random_anyโ€™ and โ€˜similar_oneโ€™ mode, and $(1+M)Nb$ for โ€˜allโ€™ mode.

Parameters:

samples

Returns:

class data_juicer.ops.mapper.ImageDetectionYoloMapper(imgsz=640, conf=0.05, iou=0.5, model_path='yolo11n.pt', *args, **kwargs)[source]#

Bases: Mapper

Perform object detection using YOLO on images and return bounding boxes and class labels.

This operator uses a YOLO model to detect objects in images. It processes each image in the sample, returning the bounding boxes and class labels for detected objects. The operator sets the bbox_tag and class_label_tag fields in the sampleโ€™s metadata. If no image is present or no objects are detected, it sets bbox_tag to an empty array and class_label_tag to -1. The operator uses a confidence score threshold and IoU (Intersection over Union) score threshold to filter detections.

__init__(imgsz=640, conf=0.05, iou=0.5, model_path='yolo11n.pt', *args, **kwargs)[source]#

Initialization method.

Parameters:
  • imgsz โ€“ resolution for image resizing

  • conf โ€“ confidence score threshold

  • iou โ€“ IoU (Intersection over Union) score threshold

  • model_path โ€“ the path to the YOLO model.

process_single(sample, rank=None, context=False)[source]#

For sample level, sample โ€“> sample

Parameters:

sample โ€“ sample to process

Returns:

processed sample

class data_juicer.ops.mapper.ImageDiffusionMapper(hf_diffusion: str = 'CompVis/stable-diffusion-v1-4', trust_remote_code: bool = False, torch_dtype: str = 'fp32', revision: str = 'main', strength: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] = 0.8, guidance_scale: float = 7.5, aug_num: Annotated[int, Gt(gt=0)] = 1, keep_original_sample: bool = True, caption_key: str | None = None, hf_img2seq: str = 'Salesforce/blip2-opt-2.7b', save_dir: str = None, *args, **kwargs)[source]#

Bases: Mapper

Generate images using a diffusion model based on provided captions.

This operator uses a Hugging Face diffusion model to generate images from given captions. It supports different modes for retaining generated samples, including random selection, similarity-based selection, and retaining all. The operator can also generate captions if none are provided, using a Hugging Face image-to-sequence model. The strength parameter controls the extent of transformation from the reference image, and the guidance scale influences how closely the generated images match the text prompt. Generated images can be saved in a specified directory or the same directory as the input files. This is a batched operation, processing multiple samples at once and producing a specified number of augmented images per sample.

__init__(hf_diffusion: str = 'CompVis/stable-diffusion-v1-4', trust_remote_code: bool = False, torch_dtype: str = 'fp32', revision: str = 'main', strength: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] = 0.8, guidance_scale: float = 7.5, aug_num: Annotated[int, Gt(gt=0)] = 1, keep_original_sample: bool = True, caption_key: str | None = None, hf_img2seq: str = 'Salesforce/blip2-opt-2.7b', save_dir: str = None, *args, **kwargs)[source]#

Initialization method.

Parameters:
  • hf_diffusion โ€“ diffusion model name on huggingface to generate the image.

  • trust_remote_code โ€“ whether to trust the remote code of HF models.

  • torch_dtype โ€“ the floating point type used to load the diffusion model. Can be one of [โ€˜fp32โ€™, โ€˜fp16โ€™, โ€˜bf16โ€™]

  • revision โ€“ The specific model version to use. It can be a branch name, a tag name, a commit id, or any identifier allowed by Git.

  • strength โ€“ Indicates extent to transform the reference image. Must be between 0 and 1. image is used as a starting point and more noise is added the higher the strength. The number of denoising steps depends on the amount of noise initially added. When strength is 1, added noise is maximum and the denoising process runs for the full number of iterations specified in num_inference_steps. A value of 1 essentially ignores image.

  • guidance_scale โ€“ A higher guidance scale value encourages the model to generate images closely linked to the text prompt at the expense of lower image quality. Guidance scale is enabled when guidance_scale > 1.

  • aug_num โ€“ The image number to be produced by stable-diffusion model.

  • keep_original_sample โ€“ whether to keep the original sample. If itโ€™s set to False, there will be only generated captions in the final datasets and the original captions will be removed. Itโ€™s True by default.

  • caption_key โ€“ the key name of fields in samples to store captions for each images. It can be a string if there is only one image in each sample. Otherwise, it should be a list. If itโ€™s none, ImageDiffusionMapper will produce captions for each images.

  • hf_img2seq โ€“ model name on huggingface to generate caption if caption_key is None.

  • save_dir โ€“ The directory where generated image files will be stored. If not specified, outputs will be saved in the same directory as their corresponding input files. This path can alternatively be defined by setting the DJ_PRODUCED_DATA_DIR environment variable.

process_batched(samples, rank=None, context=False)[source]#

Note

This is a batched_OP, whose the input and output type are both list. Suppose there are $N$ input sample list with batch size as $b$, and denote aug_num as $M$. the number of total samples after generation is $(1+M)Nb$.

Parameters:

samples

Returns:

class data_juicer.ops.mapper.ImageFaceBlurMapper(cv_classifier: str = '', blur_type: str = 'gaussian', radius: Annotated[float, Ge(ge=0)] = 2, save_dir: str = None, *args, **kwargs)[source]#

Bases: Mapper

Mapper to blur faces detected in images.

This operator uses an OpenCV classifier to detect faces in images and applies a specified blur type to the detected face regions. The blur types supported are โ€˜meanโ€™, โ€˜boxโ€™, and โ€˜gaussianโ€™. The radius of the blur kernel can be adjusted. If no save directory is provided, the modified images will be saved in the same directory as the input files.

__init__(cv_classifier: str = '', blur_type: str = 'gaussian', radius: Annotated[float, Ge(ge=0)] = 2, save_dir: str = None, *args, **kwargs)[source]#

Initialization method.

Parameters:
  • cv_classifier โ€“ OpenCV classifier path for face detection. By default, we will use โ€˜haarcascade_frontalface_alt.xmlโ€™.

  • blur_type โ€“ Type of blur kernel, including [โ€˜meanโ€™, โ€˜boxโ€™, โ€˜gaussianโ€™].

  • radius โ€“ Radius of blur kernel.

  • save_dir โ€“ The directory where generated image files will be stored. If not specified, outputs will be saved in the same directory as their corresponding input files. This path can alternatively be defined by setting the DJ_PRODUCED_DATA_DIR environment variable.

  • args โ€“ extra args

  • kwargs โ€“ extra args

process_single(sample, context=False)[source]#

For sample level, sample โ€“> sample

Parameters:

sample โ€“ sample to process

Returns:

processed sample

class data_juicer.ops.mapper.ImageRemoveBackgroundMapper(alpha_matting: bool = False, alpha_matting_foreground_threshold: int = 240, alpha_matting_background_threshold: int = 10, alpha_matting_erode_size: int = 10, bgcolor: Tuple[int, int, int, int] | None = None, save_dir: str = None, *args, **kwargs)[source]#

Bases: Mapper

Mapper to remove the background of images.

This operator processes each image in the sample, removing its background. It uses the rembg library to perform the background removal. If alpha_matting is enabled, it applies alpha matting with specified thresholds and erosion size. The resulting images are saved in PNG format. The bgcolor parameter can be set to specify a custom background color for the cutout image. The processed images are stored in the directory specified by save_dir, or in the same directory as the input files if save_dir is not provided. The source_file field in the sample is updated to reflect the new file paths.

__init__(alpha_matting: bool = False, alpha_matting_foreground_threshold: int = 240, alpha_matting_background_threshold: int = 10, alpha_matting_erode_size: int = 10, bgcolor: Tuple[int, int, int, int] | None = None, save_dir: str = None, *args, **kwargs)[source]#

Initialization method.

Parameters:
  • alpha_matting โ€“ (bool, optional) Flag indicating whether to use alpha matting. Defaults to False.

  • alpha_matting_foreground_threshold โ€“ (int, optional) Foreground threshold for alpha matting. Defaults to 240.

  • alpha_matting_background_threshold โ€“ (int, optional) Background threshold for alpha matting. Defaults to 10.

  • alpha_matting_erode_size โ€“ (int, optional) Erosion size for alpha matting. Defaults to 10.

  • bgcolor โ€“ (Optional[Tuple[int, int, int, int]], optional) Background color for the cutout image. Defaults to None.

  • save_dir โ€“ The directory where generated image files will be stored. If not specified, outputs will be saved in the same directory as their corresponding input files. This path can alternatively be defined by setting the DJ_PRODUCED_DATA_DIR environment variable.

*args (Optional[Any]): Additional positional arguments. **kwargs (Optional[Any]): Additional keyword arguments.

process_single(sample, context=False)[source]#

For sample level, sample โ€“> sample

Parameters:

sample โ€“ sample to process

Returns:

processed sample

class data_juicer.ops.mapper.ImageSegmentMapper(imgsz=1024, conf=0.05, iou=0.5, model_path='FastSAM-x.pt', *args, **kwargs)[source]#

Bases: Mapper

Perform segment-anything on images and return the bounding boxes.

This operator uses a FastSAM model to detect and segment objects in images, returning their bounding boxes. It processes each image in the sample, and stores the bounding boxes in the โ€˜bbox_tagโ€™ field under the โ€˜metaโ€™ key. If no images are present in the sample, an empty array is stored instead. The operator allows setting the image resolution, confidence threshold, and IoU (Intersection over Union) score threshold for the segmentation process. Bounding boxes are represented as N x M x 4 arrays, where N is the number of images, M is the number of detected boxes, and 4 represents the coordinates.

__init__(imgsz=1024, conf=0.05, iou=0.5, model_path='FastSAM-x.pt', *args, **kwargs)[source]#

Initialization method.

Parameters:
  • imgsz โ€“ resolution for image resizing

  • conf โ€“ confidence score threshold

  • iou โ€“ IoU (Intersection over Union) score threshold

  • model_path โ€“ the path to the FastSAM model. Model name should be one of [โ€˜FastSAM-x.ptโ€™, โ€˜FastSAM-s.ptโ€™].

process_single(sample, rank=None, context=False)[source]#

For sample level, sample โ€“> sample

Parameters:

sample โ€“ sample to process

Returns:

processed sample

class data_juicer.ops.mapper.ImageTaggingMapper(tag_field_name: str = 'image_tags', *args, **kwargs)[source]#

Bases: Mapper

Generates image tags for each image in the sample.

This operator processes images to generate descriptive tags. It uses a Hugging Face model to analyze the images and produce relevant tags. The tags are stored in the specified field, defaulting to โ€˜image_tagsโ€™. If the tags are already present in the sample, the operator will not recompute them. For samples without images, an empty tag array is assigned. The generated tags are sorted by frequency and stored as a list of strings.

__init__(tag_field_name: str = 'image_tags', *args, **kwargs)[source]#

Initialization method. :param tag_field_name: the field name to store the tags. Itโ€™s

โ€œimage_tagsโ€ in default.

Parameters:
  • args โ€“ extra args

  • kwargs โ€“ extra args

process_single(sample, rank=None, context=False)[source]#

For sample level, sample โ€“> sample

Parameters:

sample โ€“ sample to process

Returns:

processed sample

class data_juicer.ops.mapper.MllmMapper(hf_model: str = 'llava-hf/llava-v1.6-vicuna-7b-hf', max_new_tokens=256, temperature=0.2, top_p=None, num_beams=1, *args, **kwargs)[source]#

Bases: Mapper

Mapper to use MLLMs for visual question answering tasks. This operator uses a Hugging Face model to generate answers based on input text and images. It supports models like llava-hf/llava-v1.6-vicuna-7b-hf and Qwen/Qwen2-VL-7B-Instruct. The operator processes each sample, loading and processing images, and generating responses using the specified model. The generated responses are appended to the sampleโ€™s text field. The key parameters include the model ID, maximum new tokens, temperature, top-p sampling, and beam search size, which control the generation process.

__init__(hf_model: str = 'llava-hf/llava-v1.6-vicuna-7b-hf', max_new_tokens=256, temperature=0.2, top_p=None, num_beams=1, *args, **kwargs)[source]#

Initialization method. :param hf_model: hugginface model id. :param max_new_tokens: the maximum number of new tokens

generated by the model.

Parameters:
  • temperature โ€“ used to control the randomness of generated text. The higher the temperature, the more random and creative the generated text will be.

  • top_p โ€“ randomly select the next word from the group of words whose cumulative probability reaches p.

  • num_beams โ€“ the larger the beam search size, the higher the quality of the generated text.

  • args โ€“ extra args

  • kwargs โ€“ extra args

process_single(sample=None, rank=None)[source]#

For sample level, sample โ€“> sample

Parameters:

sample โ€“ sample to process

Returns:

processed sample

class data_juicer.ops.mapper.NlpaugEnMapper(sequential: bool = False, aug_num: Annotated[int, Gt(gt=0)] = 1, keep_original_sample: bool = True, delete_random_word: bool = False, swap_random_word: bool = False, spelling_error_word: bool = False, split_random_word: bool = False, keyboard_error_char: bool = False, ocr_error_char: bool = False, delete_random_char: bool = False, swap_random_char: bool = False, insert_random_char: bool = False, *args, **kwargs)[source]#

Bases: Mapper

Augments English text samples using various methods from the nlpaug library.

This operator applies a series of text augmentation techniques to generate new samples. It supports both word-level and character-level augmentations, such as deleting, swapping, and inserting words or characters. The number of augmented samples can be controlled, and the original samples can be kept or removed. When multiple augmentation methods are enabled, they can be applied sequentially or independently. Sequential application means each sample is augmented by all enabled methods in sequence, while independent application generates multiple augmented samples for each method. We recommend using 1-3 augmentation methods at a time to avoid significant changes in sample semantics.

__init__(sequential: bool = False, aug_num: Annotated[int, Gt(gt=0)] = 1, keep_original_sample: bool = True, delete_random_word: bool = False, swap_random_word: bool = False, spelling_error_word: bool = False, split_random_word: bool = False, keyboard_error_char: bool = False, ocr_error_char: bool = False, delete_random_char: bool = False, swap_random_char: bool = False, insert_random_char: bool = False, *args, **kwargs)[source]#

Initialization method. All augmentation methods use default parameters in default. We recommend you to only use 1-3 augmentation methods at a time. Otherwise, the semantics of samples might be changed significantly.

Parameters:
  • sequential โ€“ whether combine all augmentation methods to a sequence. If itโ€™s True, a sample will be augmented by all opened augmentation methods sequentially. If itโ€™s False, each opened augmentation method would generate its augmented samples independently.

  • aug_num โ€“ number of augmented samples to be generated. If sequential is True, there will be total aug_num augmented samples generated. If itโ€™s False, there will be (aug_num * #opened_aug_method) augmented samples generated.

  • keep_original_sample โ€“ whether to keep the original sample. If itโ€™s set to False, there will be only generated texts in the final datasets and the original texts will be removed. Itโ€™s True in default.

  • delete_random_word โ€“ whether to open the augmentation method of deleting random words from the original texts. e.g. โ€œI love LLMโ€ โ€“> โ€œI LLMโ€

  • swap_random_word โ€“ whether to open the augmentation method of swapping random contiguous words in the original texts. e.g. โ€œI love LLMโ€ โ€“> โ€œLove I LLMโ€

  • spelling_error_word โ€“ whether to open the augmentation method of simulating the spelling error for words in the original texts. e.g. โ€œI love LLMโ€ โ€“> โ€œAi love LLMโ€

  • split_random_word โ€“ whether to open the augmentation method of splitting words randomly with whitespaces in the original texts. e.g. โ€œI love LLMโ€ โ€“> โ€œI love LL Mโ€

  • keyboard_error_char โ€“ whether to open the augmentation method of simulating the keyboard error for characters in the original texts. e.g. โ€œI love LLMโ€ โ€“> โ€œI ;ov4 LLMโ€

  • ocr_error_char โ€“ whether to open the augmentation method of simulating the OCR error for characters in the original texts. e.g. โ€œI love LLMโ€ โ€“> โ€œI 10ve LLMโ€

  • delete_random_char โ€“ whether to open the augmentation method of deleting random characters from the original texts. e.g. โ€œI love LLMโ€ โ€“> โ€œI oe LLMโ€

  • swap_random_char โ€“ whether to open the augmentation method of swapping random contiguous characters in the original texts. e.g. โ€œI love LLMโ€ โ€“> โ€œI ovle LLMโ€

  • insert_random_char โ€“ whether to open the augmentation method of inserting random characters into the original texts. e.g. โ€œI love LLMโ€ โ€“> โ€œI ^lKove LLMโ€

  • args โ€“ extra args

  • kwargs โ€“ extra args

process_batched(samples)[source]#
class data_juicer.ops.mapper.NlpcdaZhMapper(sequential: bool = False, aug_num: Annotated[int, Gt(gt=0)] = 1, keep_original_sample: bool = True, replace_similar_word: bool = False, replace_homophone_char: bool = False, delete_random_char: bool = False, swap_random_char: bool = False, replace_equivalent_num: bool = False, *args, **kwargs)[source]#

Bases: Mapper

Augments Chinese text samples using the nlpcda library.

This operator applies various augmentation methods to Chinese text, such as replacing similar words, homophones, deleting random characters, swapping characters, and replacing equivalent numbers. The number of augmented samples generated can be controlled by the aug_num parameter. If sequential is set to True, the augmentation methods are applied in sequence; otherwise, they are applied independently. The original sample can be kept or removed based on the keep_original_sample flag. It is recommended to use 1-3 augmentation methods at a time to avoid significant changes in the semantics of the samples. Some augmentation methods may not work for special texts, resulting in no augmented samples being generated.

__init__(sequential: bool = False, aug_num: Annotated[int, Gt(gt=0)] = 1, keep_original_sample: bool = True, replace_similar_word: bool = False, replace_homophone_char: bool = False, delete_random_char: bool = False, swap_random_char: bool = False, replace_equivalent_num: bool = False, *args, **kwargs)[source]#

Initialization method. All augmentation methods use default parameters in default. We recommend you to only use 1-3 augmentation methods at a time. Otherwise, the semantics of samples might be changed significantly. Notice: some augmentation method might not work for some special texts, so there might be no augmented texts generated.

Parameters:
  • sequential โ€“ whether combine all augmentation methods to a sequence. If itโ€™s True, a sample will be augmented by all opened augmentation methods sequentially. If itโ€™s False, each opened augmentation method would generate its augmented samples independently.

  • aug_num โ€“ number of augmented samples to be generated. If sequential is True, there will be total aug_num augmented samples generated. If itโ€™s False, there will be (aug_num * #opened_aug_method) augmented samples generated.

  • keep_original_sample โ€“ whether to keep the original sample. If itโ€™s set to False, there will be only generated texts in the final datasets and the original texts will be removed. Itโ€™s True in default.

  • replace_similar_word โ€“ whether to open the augmentation method of replacing random words with their similar words in the original texts. e.g. โ€œ่ฟ™้‡Œไธ€ๅ…ฑๆœ‰5็งไธๅŒ็š„ๆ•ฐๆฎๅขžๅผบๆ–นๆณ•โ€ โ€“> โ€œ่ฟ™่พนไธ€ๅ…ฑๆœ‰5็งไธๅŒ็š„ๆ•ฐๆฎๅขžๅผบๆ–นๆณ•โ€

  • replace_homophone_char โ€“ whether to open the augmentation method of replacing random characters with their homophones in the original texts. e.g. โ€œ่ฟ™้‡Œไธ€ๅ…ฑๆœ‰5็งไธๅŒ็š„ๆ•ฐๆฎๅขžๅผบๆ–นๆณ•โ€ โ€“> โ€œ่ฟ™้‡Œไธ€ๅ…ฑๆœ‰5็งไธๅŒ็š„ๆฟ–ๆฎๅขžๅผบๆ–นๆณ•โ€

  • delete_random_char โ€“ whether to open the augmentation method of deleting random characters from the original texts. e.g. โ€œ่ฟ™้‡Œไธ€ๅ…ฑๆœ‰5็งไธๅŒ็š„ๆ•ฐๆฎๅขžๅผบๆ–นๆณ•โ€ โ€“> โ€œ่ฟ™้‡Œไธ€ๅ…ฑๆœ‰5็งไธๅŒ็š„ๆ•ฐๆฎๅขžๅผบโ€

  • swap_random_char โ€“ whether to open the augmentation method of swapping random contiguous characters in the original texts. e.g. โ€œ่ฟ™้‡Œไธ€ๅ…ฑๆœ‰5็งไธๅŒ็š„ๆ•ฐๆฎๅขžๅผบๆ–นๆณ•โ€ โ€“> โ€œ่ฟ™้‡Œไธ€ๅ…ฑๆœ‰5็งไธๅŒ็š„ๆ•ฐๆฎๅผบๅขžๆ–นๆณ•โ€

  • replace_equivalent_num โ€“ whether to open the augmentation method of replacing random numbers with their equivalent representations in the original texts. Notice: Only for numbers for now. e.g. โ€œ่ฟ™้‡Œไธ€ๅ…ฑๆœ‰5็งไธๅŒ็š„ๆ•ฐๆฎๅขžๅผบๆ–นๆณ•โ€ โ€“> โ€œ่ฟ™้‡Œไธ€ๅ…ฑๆœ‰ไผ็งไธๅŒ็š„ๆ•ฐๆฎๅขžๅผบๆ–นๆณ•โ€

  • args โ€“ extra args

  • kwargs โ€“ extra args

process_batched(samples)[source]#
class data_juicer.ops.mapper.OptimizePromptMapper(api_or_hf_model: str = 'Qwen/Qwen2.5-7B-Instruct', gen_num: Annotated[int, Gt(gt=0)] = 3, max_example_num: Annotated[int, Gt(gt=0)] = 3, keep_original_sample: bool = True, retry_num: int = 3, *, api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, example_template: str | None = None, prompt_template: str | None = None, output_pattern: str | None = None, enable_vllm: bool = False, is_hf_model: bool = False, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[source]#

Bases: Mapper

Optimize prompts based on existing ones in the same batch.

This operator uses the existing prompts and newly optimized prompts as examples to generate better prompts. It supports using a Hugging Face model or an API for text generation. The operator can be configured to keep the original samples or replace them with the generated ones. The optimization process involves multiple retries if the generated prompt is empty. The operator operates in batch mode and can leverage vLLM for inference acceleration on CUDA devices.

  • Uses existing and newly generated prompts to optimize future prompts.

  • Supports both Hugging Face models and API-based text generation.

  • Can keep or replace original samples with generated ones.

  • Retries up to a specified number of times if the generated prompt is empty.

  • Operates in batch mode and can use vLLM for acceleration on CUDA.

  • References: https://doc.agentscope.io/v0/en/build_tutorial/prompt_optimization.html

DEFAULT_SYSTEM_PROMPT = '่ฏทไฝ ไป”็ป†่ง‚ๅฏŸๅคšไธช็คบไพ‹ๆ็คบ่ฏ๏ผŒๆŒ‰็…งไฝ ็š„็†่งฃ๏ผŒๆ€ป็ป“ๅ‡บ็›ธๅบ”่ง„็Ÿฉ๏ผŒ็„ถๅŽๅ†™ๅ‡บไธ€ไธชๆ–ฐ็š„ๆ›ดๅฅฝ็š„ๆ็คบ่ฏ๏ผŒไปฅ่ฎฉๆจกๅž‹ๆ›ดๅฅฝๅœฐๅฎŒๆˆๆŒ‡ๅฎšไปปๅŠกใ€‚ๆณจๆ„๏ผŒๆ–ฐ็”Ÿๆˆ็š„ใ€ๆ็คบ่ฏใ€‘้œ€่ฆๆปก่ถณๅฆ‚ไธ‹่ฆๆฑ‚๏ผš\n1. ็”Ÿๆˆ็š„ใ€ๆ็คบ่ฏใ€‘ไธ่ƒฝไธŽ่พ“ๅ…ฅ็š„ใ€ๆ็คบ่ฏใ€‘ๅฎŒๅ…จไธ€่‡ด๏ผŒไฝ†ๆ˜ฏ้œ€่ฆไฟๆŒๆ ผๅผ็ฑปไผผใ€‚\n2. ็”Ÿๆˆ็š„ใ€ๆ็คบ่ฏใ€‘็›ธๆฏ”ไบŽ่พ“ๅ…ฅ็š„ใ€ๆ็คบ่ฏใ€‘ไธ่ƒฝๆœ‰ๅพˆๅคง็š„ๅ˜ๅŒ–๏ผŒๆ›ดๅคšๅบ”่ฏฅๆ˜ฏๅ…ณ้”ฎ่ฏใ€ๆ ธๅฟƒๅ‚ๆ•ฐ็ญ‰ๆ–น้ข็š„ๅพฎ่ฐƒใ€‚\n3. ็”Ÿๆˆๆ—ถๅช้œ€็”Ÿๆˆๅธฆๆœ‰ใ€ๆ็คบ่ฏใ€‘ๅ‰็ผ€็š„ๆ็คบ่ฏ๏ผŒไธ้œ€็”Ÿๆˆๅ…ถไป–ไปปไฝ•้ขๅค–ไฟกๆฏใ€‚\n'#
DEFAULT_INPUT_TEMPLATE = '{}'#
DEFAULT_EXAMPLE_TEMPLATE = '\nๅฆ‚ไธ‹ๆ˜ฏไธ€ๆก็คบไพ‹ๆ•ฐๆฎ๏ผš\n{}'#
DEFAULT_PROMPT_TEMPLATE = 'ใ€ๆ็คบ่ฏใ€‘\n{}\n'#
DEFAULT_OUTPUT_PATTERN = 'ใ€ๆ็คบ่ฏใ€‘(.*?)(?=ใ€|$)'#
__init__(api_or_hf_model: str = 'Qwen/Qwen2.5-7B-Instruct', gen_num: Annotated[int, Gt(gt=0)] = 3, max_example_num: Annotated[int, Gt(gt=0)] = 3, keep_original_sample: bool = True, retry_num: int = 3, *, api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, example_template: str | None = None, prompt_template: str | None = None, output_pattern: str | None = None, enable_vllm: bool = False, is_hf_model: bool = False, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[source]#

Initialization method.

Parameters:
  • api_or_hf_model โ€“ API or huggingface model name.

  • gen_num โ€“ The number of new prompts to generate.

  • max_example_num โ€“ Maximum number of example prompts to include as context when generating new optimized prompts.

  • keep_original_sample โ€“ whether to keep the original sample. If itโ€™s set to False, there will be only generated texts in the final datasets and the original texts will be removed. Itโ€™s True in default.

  • retry_num โ€“ how many times to retry to generate the prompt if the parsed generated prompt is empty. Itโ€™s 3 in default.

  • api_endpoint โ€“ URL endpoint for the API.

  • response_path โ€“ Path to extract content from the API response. Defaults to โ€˜choices.0.message.contentโ€™.

  • system_prompt โ€“ System prompt for guiding the generation task.

  • input_template โ€“ Template for building the input prompt. It must include one placeholder โ€˜{}โ€™, which will be replaced by example_num formatted examples defined by example_template.

  • example_template โ€“ Template for formatting one prompt example. It must include one placeholder โ€˜{}โ€™, which will be replaced by one formatted prompt.

  • prompt_template โ€“ Template for formatting a single prompt within each example. Must include two placeholders โ€˜{}โ€™ for the question and answer.

  • output_pattern โ€“ Regular expression pattern to extract questions and answers from model response.

  • enable_vllm โ€“ Whether to use vllm for inference acceleration.

  • is_hf_model โ€“ If true, use Transformers for loading hugging face or local llm.

  • model_params โ€“ Parameters for initializing the model.

  • sampling_params โ€“ Sampling parameters for text generation. e.g {โ€˜temperatureโ€™: 0.9, โ€˜top_pโ€™: 0.95}

  • kwargs โ€“ Extra keyword arguments.

build_input(prompt_examples)[source]#
parse_output(raw_output)[source]#
generate_one_prompt(model, input_prompt_samples)[source]#
process_batched(samples, rank=None, *args, **kwargs)[source]#
class data_juicer.ops.mapper.OptimizeQAMapper(api_or_hf_model: str = 'Qwen/Qwen2.5-7B-Instruct', is_hf_model: bool = True, *, api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, qa_pair_template: str | None = None, output_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, enable_vllm: bool = False, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[source]#

Bases: Mapper

Mapper to optimize question-answer pairs.

This operator refines and enhances the quality of question-answer pairs. It uses a Hugging Face model to generate more detailed and accurate questions and answers. The input is formatted using a template, and the output is parsed using a regular expression. The system prompt, input template, and output pattern can be customized. If VLLM is enabled, the operator accelerates inference on CUDA devices.

DEFAULT_SYSTEM_PROMPT = '่ฏทไผ˜ๅŒ–่พ“ๅ…ฅ็š„้—ฎ็ญ”ๅฏน๏ผŒไฝฟใ€้—ฎ้ข˜ใ€‘ๅ’Œใ€ๅ›ž็ญ”ใ€‘้ƒฝๆ›ดๅŠ ่ฏฆ็ป†ใ€ๅ‡†็กฎใ€‚ๅฟ…้กปๆŒ‰็…งไปฅไธ‹ๆ ‡่ฎฐๆ ผๅผ๏ผŒ็›ดๆŽฅ่พ“ๅ‡บไผ˜ๅŒ–ๅŽ็š„้—ฎ็ญ”ๅฏน๏ผš\nใ€้—ฎ้ข˜ใ€‘\nไผ˜ๅŒ–ๅŽ็š„้—ฎ้ข˜\nใ€ๅ›ž็ญ”ใ€‘\nไผ˜ๅŒ–ๅŽ็š„ๅ›ž็ญ”'#
DEFAULT_INPUT_TEMPLATE = 'ไปฅไธ‹ๆ˜ฏๅŽŸๅง‹้—ฎ็ญ”ๅฏน๏ผš\n{}'#
DEFAULT_QA_PAIR_TEMPLATE = 'ใ€้—ฎ้ข˜ใ€‘\n{}\nใ€ๅ›ž็ญ”ใ€‘\n{}'#
DEFAULT_OUTPUT_PATTERN = '.*?ใ€้—ฎ้ข˜ใ€‘\\s*(.*?)\\s*ใ€ๅ›ž็ญ”ใ€‘\\s*(.*)'#
__init__(api_or_hf_model: str = 'Qwen/Qwen2.5-7B-Instruct', is_hf_model: bool = True, *, api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, qa_pair_template: str | None = None, output_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, enable_vllm: bool = False, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[source]#

Initialization method.

Parameters:
  • api_or_hf_model โ€“ API or huggingface model name.

  • is_hf_model โ€“ If true, use huggingface model. Otherwise, use API.

  • api_endpoint โ€“ URL endpoint for the API.

  • response_path โ€“ Path to extract content from the API response. Defaults to โ€˜choices.0.message.contentโ€™.

  • system_prompt โ€“ System prompt for guiding the optimization task.

  • input_template โ€“ Template for building the input for the model. Please make sure the template contains one placeholder โ€˜{}โ€™, which corresponds to the question and answer pair generated by param qa_pair_template.

  • qa_pair_template โ€“ Template for formatting the question and answer pair. Please make sure the template contains two โ€˜{}โ€™ to format question and answer.

  • output_pattern โ€“ Regular expression pattern to extract question and answer from model response.

  • try_num โ€“ The number of retry attempts when there is an API call error or output parsing error.

  • enable_vllm โ€“ Whether to use VLLM for inference acceleration.

  • model_params โ€“ Parameters for initializing the model.

  • sampling_params โ€“ Sampling parameters for text generation (e.g., {โ€˜temperatureโ€™: 0.9, โ€˜top_pโ€™: 0.95}).

  • kwargs โ€“ Extra keyword arguments.

build_input(sample)[source]#
parse_output(raw_output)[source]#
process_single(sample, rank=None)[source]#

For sample level, sample โ€“> sample

Parameters:

sample โ€“ sample to process

Returns:

processed sample

class data_juicer.ops.mapper.OptimizeQueryMapper(api_or_hf_model: str = 'Qwen/Qwen2.5-7B-Instruct', is_hf_model: bool = True, *, api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, qa_pair_template: str | None = None, output_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, enable_vllm: bool = False, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[source]#

Bases: OptimizeQAMapper

Optimize queries in question-answer pairs to make them more specific and detailed.

This mapper refines the questions in a QA pair, making them more specific and detailed while ensuring that the original answer can still address the optimized question. It uses a predefined system prompt for the optimization process. The optimized query is extracted from the raw output by stripping any leading or trailing whitespace. The mapper utilizes a CUDA accelerator for faster processing.

DEFAULT_SYSTEM_PROMPT = 'ไผ˜ๅŒ–้—ฎ็ญ”ๅฏนไธญ็š„ใ€้—ฎ้ข˜ใ€‘๏ผŒๅฐ†ๅ…ถๆ›ดๅŠ ่ฏฆ็ป†ๅ…ทไฝ“๏ผŒไฝ†ไปๅฏไปฅ็”ฑๅŽŸ็ญ”ๆกˆๅ›ž็ญ”ใ€‚ๅช่พ“ๅ‡บไผ˜ๅŒ–ๅŽ็š„ใ€้—ฎ้ข˜ใ€‘๏ผŒไธ่ฆ่พ“ๅ‡บๅคšไฝ™ๅ†…ๅฎนใ€‚'#
parse_output(raw_output)[source]#
class data_juicer.ops.mapper.OptimizeResponseMapper(api_or_hf_model: str = 'Qwen/Qwen2.5-7B-Instruct', is_hf_model: bool = True, *, api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, qa_pair_template: str | None = None, output_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, enable_vllm: bool = False, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[source]#

Bases: OptimizeQAMapper

Optimize response in question-answer pairs to be more detailed and specific.

This operator enhances the responses in question-answer pairs, making them more detailed and specific while ensuring they still address the original question. It uses a predefined system prompt for optimization. The optimized response is stripped of any leading or trailing whitespace before being returned. This mapper leverages a Hugging Face model for the optimization process, which is accelerated using CUDA.

DEFAULT_SYSTEM_PROMPT = '่ฏทไผ˜ๅŒ–้—ฎ็ญ”ๅฏนไธญ็š„ๅ›ž็ญ”๏ผŒๅฐ†ๅ…ถๆ›ดๅŠ ่ฏฆ็ป†ๅ…ทไฝ“๏ผŒไฝ†ไปๅฏไปฅๅ›ž็ญ”ๅŽŸ้—ฎ้ข˜ใ€‚ๅช่พ“ๅ‡บไผ˜ๅŒ–ๅŽ็š„ๅ›ž็ญ”๏ผŒไธ่ฆ่พ“ๅ‡บๅคšไฝ™ๅ†…ๅฎนใ€‚'#
parse_output(raw_output)[source]#
class data_juicer.ops.mapper.PairPreferenceMapper(api_model: str = 'gpt-4o', *, api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, output_pattern: str | None = None, rejected_key: str = 'rejected_response', reason_key: str = 'reason', try_num: Annotated[int, Gt(gt=0)] = 3, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]#

Bases: Mapper

Mapper to construct paired preference samples by generating a rejected response and its reason.

This operator uses an API model to generate a new response that is opposite in style, factuality, or stance to the original response. The generated response and the reason for its generation are stored in the sample. The default system prompt and input template are provided, but can be customized. The output is parsed using a regular expression to extract the new response and the reason. If parsing fails, the operator retries up to a specified number of times. The generated response and reason are stored in the sample under the keys โ€˜rejected_responseโ€™ and โ€˜reasonโ€™, respectively.

DEFAULT_SYSTEM_PROMPT = 'ไฝ ็š„ไปปๅŠกๆ˜ฏๆ นๆฎๅ‚่€ƒไฟกๆฏไฟฎๆ”น้—ฎ็ญ”ๅฏนไธญ็š„ๅ›ž็ญ”๏ผŒๅœจ่ฏญ่จ€้ฃŽๆ ผใ€ไบ‹ๅฎžๆ€งใ€ไบบ็‰ฉ่บซไปฝใ€็ซ‹ๅœบ็ญ‰ไปปไธ€ๆ–น้ขไธŽๅŽŸๅ›ž็ญ”็›ธๅใ€‚ๅฟ…้กปๆŒ‰็…งไปฅไธ‹ๆ ‡่ฎฐๆ ผๅผ่พ“ๅ‡บ๏ผŒไธ่ฆ่พ“ๅ‡บๅ…ถไป–ๅคšไฝ™ๅ†…ๅฎนใ€‚\nใ€ๅ›ž็ญ”ใ€‘\n็”Ÿๆˆ็š„ๆ–ฐๅ›ž็ญ”\nใ€ๅŽŸๅ› ใ€‘\n็”Ÿๆˆ่ฏฅๅ›ž็ญ”็š„ๅŽŸๅ› '#
DEFAULT_INPUT_TEMPLATE = 'ใ€ๅ‚่€ƒไฟกๆฏใ€‘\n{reference}\n\nไปฅไธ‹ๆ˜ฏๅŽŸๅง‹้—ฎ็ญ”ๅฏน๏ผš\nใ€้—ฎ้ข˜ใ€‘\n{query}\nใ€ๅ›ž็ญ”ใ€‘\n{response}'#
DEFAULT_OUTPUT_PATTERN = '.*?ใ€ๅ›ž็ญ”ใ€‘\\s*(.*?)\\s*ใ€ๅŽŸๅ› ใ€‘\\s*(.*)'#
__init__(api_model: str = 'gpt-4o', *, api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, output_pattern: str | None = None, rejected_key: str = 'rejected_response', reason_key: str = 'reason', try_num: Annotated[int, Gt(gt=0)] = 3, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]#

Initialization method.

Parameters:
  • api_model โ€“ API model name.

  • api_endpoint โ€“ URL endpoint for the API.

  • response_path โ€“ Path to extract content from the API response. Defaults to โ€˜choices.0.message.contentโ€™.

  • system_prompt โ€“ System prompt for guiding the generation task.

  • input_template โ€“ Template for building the model input. It must contain placeholders โ€˜{query}โ€™ and โ€˜{response}โ€™, and can optionally include โ€˜{reference}โ€™.

  • output_pattern โ€“ Regular expression for parsing model output.

  • rejected_key โ€“ The field name in the sample to store the generated rejected response. Defaults to โ€˜rejected_responseโ€™.

  • reason_key โ€“ The field name in the sample to store the reason for generating the response. Defaults to โ€˜reasonโ€™.

  • try_num โ€“ The number of retries for the API call in case of response parsing failure. Defaults to 3.

  • model_params โ€“ Parameters for initializing the API model.

  • sampling_params โ€“ Extra parameters passed to the API call. e.g {โ€˜temperatureโ€™: 0.9, โ€˜top_pโ€™: 0.95}

  • kwargs โ€“ Extra keyword arguments.

build_input(sample)[source]#
parse_output(raw_output)[source]#
process_single(sample, rank=None)[source]#

For sample level, sample โ€“> sample

Parameters:

sample โ€“ sample to process

Returns:

processed sample

class data_juicer.ops.mapper.PunctuationNormalizationMapper(*args, **kwargs)[source]#

Bases: Mapper

Normalizes unicode punctuations to their English equivalents in text samples.

This operator processes a batch of text samples and replaces any unicode punctuation with its corresponding English punctuation. The mapping includes common substitutions like โ€œ๏ผŒโ€ to โ€œ,โ€, โ€œใ€‚โ€ to โ€œ.โ€, and โ€œโ€œโ€ to โ€œ. It iterates over each character in the text, replacing it if it is found in the predefined punctuation map. The result is a set of text samples with consistent punctuation formatting.

__init__(*args, **kwargs)[source]#

Initialization method.

Parameters:
  • args โ€“ extra args

  • kwargs โ€“ extra args

process_batched(samples)[source]#
class data_juicer.ops.mapper.PythonFileMapper(file_path: str = '', function_name: str = 'process_single', batched: bool = False, **kwargs)[source]#

Bases: Mapper

Executes a Python function defined in a file on input data.

This operator loads a specified Python function from a given file and applies it to the input data. The function must take exactly one argument and return a dictionary. The operator can process data either sample by sample or in batches, depending on the batched parameter. If the file path is not provided, the operator acts as an identity function, returning the input sample unchanged. The function is loaded dynamically, and its name and file path are configurable. Important notes: - The file must be a valid Python file (.py). - The function must be callable and accept exactly one argument. - The functionโ€™s return value must be a dictionary.

__init__(file_path: str = '', function_name: str = 'process_single', batched: bool = False, **kwargs)[source]#

Initialization method.

Parameters:
  • file_path โ€“ The path to the Python file containing the function to be executed.

  • function_name โ€“ The name of the function defined in the file to be executed.

  • batched โ€“ A boolean indicating whether to process input data in batches.

  • kwargs โ€“ Additional keyword arguments passed to the parent class.

process_single(sample)[source]#

Invoke the loaded function with the provided sample.

process_batched(samples)[source]#

Invoke the loaded function with the provided samples.

class data_juicer.ops.mapper.PythonLambdaMapper(lambda_str: str = '', batched: bool = False, **kwargs)[source]#

Bases: Mapper

Mapper for applying a Python lambda function to data samples.

This operator allows users to define a custom transformation using a Python lambda function. The lambda function is applied to each sample, and the result must be a dictionary. If the batched parameter is set to True, the lambda function will process a batch of samples at once. If no lambda function is provided, the identity function is used, which returns the input sample unchanged. The operator validates the lambda function to ensure it has exactly one argument and compiles it safely.

__init__(lambda_str: str = '', batched: bool = False, **kwargs)[source]#

Initialization method.

Parameters:
  • lambda_str โ€“ A string representation of the lambda function to be executed on data samples. If empty, the identity function is used.

  • batched โ€“ A boolean indicating whether to process input data in batches.

  • kwargs โ€“ Additional keyword arguments passed to the parent class.

process_single(sample)[source]#

For sample level, sample โ€“> sample

Parameters:

sample โ€“ sample to process

Returns:

processed sample

process_batched(samples)[source]#
class data_juicer.ops.mapper.QuerySentimentDetectionMapper(hf_model: str = 'mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis', zh_to_en_hf_model: str | None = 'Helsinki-NLP/opus-mt-zh-en', model_params: Dict = {}, zh_to_en_model_params: Dict = {}, *, label_key: str = 'query_sentiment_label', score_key: str = 'query_sentiment_label_score', **kwargs)[source]#

Bases: Mapper

Predicts userโ€™s sentiment label (โ€˜negativeโ€™, โ€˜neutralโ€™, โ€˜positiveโ€™) in a query.

This mapper takes input from the specified query key and outputs the predicted sentiment label and its corresponding score. The results are stored in the Data-Juicer meta field under โ€˜query_sentiment_labelโ€™ and โ€˜query_sentiment_label_scoreโ€™. It uses a Hugging Face model for sentiment detection. If a Chinese-to-English translation model is provided, it first translates the query from Chinese to English before performing sentiment analysis.

__init__(hf_model: str = 'mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis', zh_to_en_hf_model: str | None = 'Helsinki-NLP/opus-mt-zh-en', model_params: Dict = {}, zh_to_en_model_params: Dict = {}, *, label_key: str = 'query_sentiment_label', score_key: str = 'query_sentiment_label_score', **kwargs)[source]#

Initialization method.

Parameters:
  • hf_model โ€“ Huggingface model ID to predict sentiment label.

  • zh_to_en_hf_model โ€“ Translation model from Chinese to English. If not None, translate the query from Chinese to English.

  • model_params โ€“ model param for hf_model.

  • zh_to_en_model_params โ€“ model param for zh_to_hf_model.

  • label_key โ€“ The key name in the meta field to store the output label. It is โ€˜query_sentiment_labelโ€™ in default.

  • score_key โ€“ The key name in the meta field to store the corresponding label score. It is โ€˜query_sentiment_label_scoreโ€™ in default.

  • kwargs โ€“ Extra keyword arguments.

process_batched(samples, rank=None)[source]#
class data_juicer.ops.mapper.QueryIntentDetectionMapper(hf_model: str = 'bespin-global/klue-roberta-small-3i4k-intent-classification', zh_to_en_hf_model: str | None = 'Helsinki-NLP/opus-mt-zh-en', model_params: Dict = {}, zh_to_en_model_params: Dict = {}, *, label_key: str = 'query_intent_label', score_key: str = 'query_intent_label_score', **kwargs)[source]#

Bases: Mapper

Predicts the userโ€™s intent label and corresponding score for a given query. The operator uses a Hugging Face model to classify the intent of the input query. If the query is in Chinese, it can optionally be translated to English using another Hugging Face translation model before classification. The predicted intent label and its confidence score are stored in the meta field with the keys โ€˜query_intent_labelโ€™ and โ€˜query_intent_scoreโ€™, respectively. If these keys already exist in the meta field, the operator will skip processing for those samples.

__init__(hf_model: str = 'bespin-global/klue-roberta-small-3i4k-intent-classification', zh_to_en_hf_model: str | None = 'Helsinki-NLP/opus-mt-zh-en', model_params: Dict = {}, zh_to_en_model_params: Dict = {}, *, label_key: str = 'query_intent_label', score_key: str = 'query_intent_label_score', **kwargs)[source]#

Initialization method.

Parameters:
  • hf_model โ€“ Huggingface model ID to predict intent label.

  • zh_to_en_hf_model โ€“ Translation model from Chinese to English. If not None, translate the query from Chinese to English.

  • model_params โ€“ model param for hf_model.

  • zh_to_en_model_params โ€“ model param for zh_to_hf_model.

  • label_key โ€“ The key name in the meta field to store the output label. It is โ€˜query_intent_labelโ€™ in default.

  • score_key โ€“ The key name in the meta field to store the corresponding label score. It is โ€˜query_intent_label_scoreโ€™ in default.

  • kwargs โ€“ Extra keyword arguments.

process_batched(samples, rank=None)[source]#
class data_juicer.ops.mapper.QueryTopicDetectionMapper(hf_model: str = 'dstefa/roberta-base_topic_classification_nyt_news', zh_to_en_hf_model: str | None = 'Helsinki-NLP/opus-mt-zh-en', model_params: Dict = {}, zh_to_en_model_params: Dict = {}, *, label_key: str = 'query_topic_label', score_key: str = 'query_topic_label_score', **kwargs)[source]#

Bases: Mapper

Predicts the topic label and its corresponding score for a given query. The input is taken from the specified query key. The output, which includes the predicted topic label and its score, is stored in the โ€˜query_topic_labelโ€™ and โ€˜query_topic_label_scoreโ€™ fields of the Data-Juicer meta field. This operator uses a Hugging Face model for topic classification. If a Chinese to English translation model is provided, it will first translate the query from Chinese to English before predicting the topic.

  • Uses a Hugging Face model for topic classification.

  • Optionally translates Chinese queries to English using another Hugging Face

model. - Stores the predicted topic label in โ€˜query_topic_labelโ€™. - Stores the corresponding score in โ€˜query_topic_label_scoreโ€™.

__init__(hf_model: str = 'dstefa/roberta-base_topic_classification_nyt_news', zh_to_en_hf_model: str | None = 'Helsinki-NLP/opus-mt-zh-en', model_params: Dict = {}, zh_to_en_model_params: Dict = {}, *, label_key: str = 'query_topic_label', score_key: str = 'query_topic_label_score', **kwargs)[source]#

Initialization method.

Parameters:
  • hf_model โ€“ Huggingface model ID to predict topic label.

  • zh_to_en_hf_model โ€“ Translation model from Chinese to English. If not None, translate the query from Chinese to English.

  • model_params โ€“ model param for hf_model.

  • zh_to_en_model_params โ€“ model param for zh_to_hf_model.

  • label_key โ€“ The key name in the meta field to store the output label. It is โ€˜query_topic_labelโ€™ in default.

  • score_key โ€“ The key name in the meta field to store the corresponding label score. It is โ€˜query_topic_label_scoreโ€™ in default.

  • kwargs โ€“ Extra keyword arguments.

process_batched(samples, rank=None)[source]#
class data_juicer.ops.mapper.RelationIdentityMapper(api_model: str = 'gpt-4o', source_entity: str = None, target_entity: str = None, *, output_key: str = 'role_relation', api_endpoint: str | None = None, response_path: str | None = None, system_prompt_template: str | None = None, input_template: str | None = None, output_pattern_template: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]#

Bases: Mapper

Identify the relation between two entities in a given text.

This operator uses an API model to analyze the relationship between two specified entities in the text. It constructs a prompt with the provided system and input templates, then sends it to the API model for analysis. The output is parsed using a regular expression to extract the relationship. If the two entities are the same, the relationship is identified as โ€œanother identity.โ€ The result is stored in the meta field under the key โ€˜role_relationโ€™ by default. The operator retries the API call up to a specified number of times in case of errors. If drop_text is set to True, the original text is removed from the sample after processing.

DEFAULT_SYSTEM_PROMPT_TEMPLATE = '็ป™ๅฎšๅ…ณไบŽ{entity1}ๅ’Œ{entity2}็š„ๆ–‡ๆœฌไฟกๆฏใ€‚ๅˆคๆ–ญ{entity1}ๅ’Œ{entity2}ไน‹้—ด็š„ๅ…ณ็ณปใ€‚\n่ฆๆฑ‚๏ผš\n- ๅ…ณ็ณป็”จไธ€ไธชๆˆ–ๅคšไธช่ฏ่ฏญ่กจ็คบ๏ผŒๅฟ…่ฆๆ—ถๅฏไปฅๅŠ ไธ€ไธชๅฝขๅฎน่ฏๆฅๆ่ฟฐ่ฟ™ๆฎตๅ…ณ็ณป\n- ่พ“ๅ‡บๅ…ณ็ณปๆ—ถไธ่ฆๅ‚ๆ‚ไปปไฝ•ๆ ‡็‚น็ฌฆๅท\n- ้œ€่ฆไฝ ่ฟ›่กŒๅˆ็†็š„ๆŽจ็†ๆ‰่ƒฝๅพ—ๅ‡บ็ป“่ฎบ\n- ๅฆ‚ๆžœไธคไธชไบบ็‰ฉ่บซไปฝๆ˜ฏๅŒไธ€ไธชไบบ๏ผŒ่พ“ๅ‡บๅ…ณ็ณปไธบ๏ผšๅฆไธ€ไธช่บซไปฝ\n- ่พ“ๅ‡บๆ ผๅผไธบ๏ผš\nๅˆ†ๆžๆŽจ็†๏ผš...\nๆ‰€ไปฅ{entity2}ๆ˜ฏ{entity1}็š„๏ผš...\n- ๆณจๆ„่พ“ๅ‡บ็š„ๆ˜ฏ{entity2}ๆ˜ฏ{entity1}็š„ไป€ไนˆๅ…ณ็ณป๏ผŒ่€Œไธๆ˜ฏ{entity1}ๆ˜ฏ{entity2}็š„ไป€ไนˆๅ…ณ็ณป'#
DEFAULT_INPUT_TEMPLATE = 'ๅ…ณไบŽ{entity1}ๅ’Œ{entity2}็š„ๆ–‡ๆœฌไฟกๆฏ๏ผš\n```\n{text}\n```\n'#
DEFAULT_OUTPUT_PATTERN_TEMPLATE = '\n        \\s*ๅˆ†ๆžๆŽจ็†๏ผš\\s*(.*?)\\s*\n        \\s*ๆ‰€ไปฅ{entity2}ๆ˜ฏ{entity1}็š„๏ผš\\s*(.*?)\\Z\n    '#
__init__(api_model: str = 'gpt-4o', source_entity: str = None, target_entity: str = None, *, output_key: str = 'role_relation', api_endpoint: str | None = None, response_path: str | None = None, system_prompt_template: str | None = None, input_template: str | None = None, output_pattern_template: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]#

Initialization method. :param api_model: API model name. :param source_entity: The source entity of the relation to be

identified.

Parameters:
  • target_entity โ€“ The target entity of the relation to be identified.

  • output_key โ€“ The output key in the meta field in the samples. It is โ€˜role_relationโ€™ in default.

  • api_endpoint โ€“ URL endpoint for the API.

  • response_path โ€“ Path to extract content from the API response. Defaults to โ€˜choices.0.message.contentโ€™.

  • system_prompt_template โ€“ System prompt template for the task.

  • input_template โ€“ Template for building the model input.

  • output_pattern_template โ€“ Regular expression template for parsing model output.

  • try_num โ€“ The number of retry attempts when there is an API call error or output parsing error.

  • drop_text โ€“ If drop the text in the output.

  • model_params โ€“ Parameters for initializing the API model.

  • sampling_params โ€“ Extra parameters passed to the API call. e.g {โ€˜temperatureโ€™: 0.9, โ€˜top_pโ€™: 0.95}

  • kwargs โ€“ Extra keyword arguments.

parse_output(raw_output)[source]#
process_single(sample, rank=None)[source]#

For sample level, sample โ€“> sample

Parameters:

sample โ€“ sample to process

Returns:

processed sample

class data_juicer.ops.mapper.RemoveBibliographyMapper(*args, **kwargs)[source]#

Bases: Mapper

Removes bibliography sections at the end of LaTeX documents.

This operator identifies and removes bibliography sections in LaTeX documents. It uses a regular expression to match common bibliography commands such as appendix, begin{references}, begin{thebibliography}, and bibliography. The matched sections are removed from the text. The operator processes samples in batch mode for efficiency.

__init__(*args, **kwargs)[source]#

Initialization method.

Parameters:
  • args โ€“ extra args

  • kwargs โ€“ extra args

process_batched(samples)[source]#
class data_juicer.ops.mapper.RemoveCommentsMapper(doc_type: str | List[str] = 'tex', inline: bool = True, multiline: bool = True, *args, **kwargs)[source]#

Bases: Mapper

Removes comments from documents, currently supporting only โ€˜texโ€™ format.

This operator removes inline and multiline comments from text samples. It supports both inline and multiline comment removal, controlled by the inline and multiline parameters. Currently, it is designed to work with โ€˜texโ€™ documents. The operator processes each sample in the batch and applies regular expressions to remove comments. The processed text is then updated in the original samples.

  • Inline comments are removed using the pattern [^]%.+$.

  • Multiline comments are removed using the pattern `^%.*

?`.

Important notes: - Only โ€˜texโ€™ document type is supported at present. - The operator processes the text in place and updates the original samples.

__init__(doc_type: str | List[str] = 'tex', inline: bool = True, multiline: bool = True, *args, **kwargs)[source]#

Initialization method.

Parameters:
  • doc_type โ€“ Type of document to remove comments.

  • inline โ€“ Whether to remove inline comments.

  • multiline โ€“ Whether to remove multiline comments.

  • args โ€“ extra args

  • kwargs โ€“ extra args

process_batched(samples)[source]#
class data_juicer.ops.mapper.RemoveHeaderMapper(drop_no_head: bool = True, *args, **kwargs)[source]#

Bases: Mapper

Removes headers at the beginning of documents in LaTeX samples.

This operator identifies and removes headers such as chapter, part, section, subsection, subsubsection, paragraph, and subparagraph. It uses a regular expression to match these headers. If a sample does not contain any headers and drop_no_head is set to True, the sample text will be removed. Otherwise, the sample remains unchanged. The operator processes samples in batches for efficiency.

__init__(drop_no_head: bool = True, *args, **kwargs)[source]#

Initialization method.

Parameters:
  • drop_no_head โ€“ whether to drop sample texts without headers.

  • args โ€“ extra args

  • kwargs โ€“ extra args

process_batched(samples)[source]#
class data_juicer.ops.mapper.RemoveLongWordsMapper(min_len: int = 1, max_len: int = 9223372036854775807, *args, **kwargs)[source]#

Bases: Mapper

Mapper to remove long words within a specific range.

This operator filters out words in the text that are either shorter than the specified minimum length or longer than the specified maximum length. Words are first checked with their original length, and if they do not meet the criteria, they are stripped of special characters and re-evaluated. The key metric used is the character-based length of each word. The processed text retains only the words that fall within the defined length range. This operator processes text in batches for efficiency.

__init__(min_len: int = 1, max_len: int = 9223372036854775807, *args, **kwargs)[source]#

Initialization method.

Parameters:
  • min_len โ€“ The min mapper word length in this op, words will be filtered if their length is below this parameter.

  • max_len โ€“ The max mapper word length in this op, words will be filtered if their length exceeds this parameter.

  • args โ€“ extra args

  • kwargs โ€“ extra args

should_keep_long_word(word)[source]#
process_batched(samples)[source]#
class data_juicer.ops.mapper.RemoveNonChineseCharacterlMapper(keep_alphabet: bool = True, keep_number: bool = True, keep_punc: bool = True, *args, **kwargs)[source]#

Bases: Mapper

Removes non-Chinese characters from text samples.

This mapper removes all characters that are not part of the Chinese character set. - It can optionally keep alphabets, numbers, and punctuation based on the configuration. - The removal is done using a regular expression pattern. - The pattern is constructed to exclude or include alphabets, numbers, and punctuation

as specified.

  • The key metric for this operation is the presence of non-Chinese characters, which are removed.

  • The operator processes samples in a batched manner.

__init__(keep_alphabet: bool = True, keep_number: bool = True, keep_punc: bool = True, *args, **kwargs)[source]#

Initialization method.

Parameters:
  • keep_alphabet โ€“ whether to keep alphabet

  • keep_number โ€“ whether to keep number

  • keep_punc โ€“ whether to keep punctuation

  • args โ€“ extra args

  • kwargs โ€“ extra args

process_batched(samples)[source]#
class data_juicer.ops.mapper.RemoveRepeatSentencesMapper(lowercase: bool = False, ignore_special_character: bool = True, min_repeat_sentence_length: int = 2, *args, **kwargs)[source]#

Bases: Mapper

Mapper to remove repeat sentences in text samples.

This operator processes text samples to remove duplicate sentences. It splits the text into lines and then further splits each line into sentences. Sentences are considered duplicates if they are identical after optional case normalization and special character removal. The operator uses a hash set to track unique sentences. Sentences shorter than min_repeat_sentence_length are not deduplicated. If ignore_special_character is enabled, special characters (all except Chinese, letters, and numbers) are ignored when checking for duplicates. The resulting text is reassembled with unique sentences.

__init__(lowercase: bool = False, ignore_special_character: bool = True, min_repeat_sentence_length: int = 2, *args, **kwargs)[source]#

Initialization method.

Parameters:
  • lowercase โ€“ Whether to convert sample text to lower case

  • ignore_special_character โ€“ Whether to ignore special characters when judging repeated sentences. Special characters are all characters except Chinese characters, letters and numbers.

  • min_repeat_sentence_length โ€“ Sentences shorter than this length will not be deduplicated. If ignore_special_character is set to True, then special characters are not included in this length.

  • args โ€“ extra args

  • kwargs โ€“ extra args

process_batched(samples)[source]#
class data_juicer.ops.mapper.RemoveSpecificCharsMapper(chars_to_remove: str | List[str] = 'โ—†โ—โ– โ–บโ–ผโ–ฒโ–ดโˆ†โ–ปโ–ทโ–โ™กโ–ก', *args, **kwargs)[source]#

Bases: Mapper

Removes specific characters from text samples.

This operator removes specified characters from the text. The characters to be removed can be provided as a string or a list of strings. If no characters are specified, the default set includes special and non-alphanumeric characters. The operator processes the text using a regular expression pattern that matches any of the specified characters and replaces them with an empty string. This is done in a batched manner for efficiency.

__init__(chars_to_remove: str | List[str] = 'โ—†โ—โ– โ–บโ–ผโ–ฒโ–ดโˆ†โ–ปโ–ทโ–โ™กโ–ก', *args, **kwargs)[source]#

Initialization method.

Parameters:
  • chars_to_remove โ€“ a list or a string including all characters that need to be removed from text.

  • args โ€“ extra args

  • kwargs โ€“ extra args

process_batched(samples)[source]#
class data_juicer.ops.mapper.RemoveTableTextMapper(min_col: Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=2), Le(le=20)])] = 2, max_col: Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=2), Le(le=20)])] = 20, *args, **kwargs)[source]#

Bases: Mapper

Mapper to remove table texts from text samples.

This operator uses regular expressions to identify and remove tables from the text. It targets tables with a specified range of columns, defined by the minimum and maximum number of columns. The operator iterates over each sample, applying the regex pattern to remove tables that match the column criteria. The processed text, with tables removed, is then stored back in the sample. This operation is batched for efficiency.

__init__(min_col: Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=2), Le(le=20)])] = 2, max_col: Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=2), Le(le=20)])] = 20, *args, **kwargs)[source]#

Initialization method.

Parameters:
  • min_col โ€“ The min number of columns of table to remove.

  • max_col โ€“ The max number of columns of table to remove.

  • args โ€“ extra args

  • kwargs โ€“ extra args

process_batched(samples)[source]#
class data_juicer.ops.mapper.RemoveWordsWithIncorrectSubstringsMapper(lang: str = 'en', tokenization: bool = False, substrings: List[str] | None = None, *args, **kwargs)[source]#

Bases: Mapper

Mapper to remove words containing specified incorrect substrings.

This operator processes text by removing words that contain any of the specified incorrect substrings. By default, it removes words with substrings like โ€œhttpโ€, โ€œwwwโ€, โ€œ.comโ€, โ€œhrefโ€, and โ€œ//โ€. The operator can operate in tokenized or non-tokenized mode. In tokenized mode, it uses a Hugging Face tokenizer to tokenize the text before processing. The key metric is not computed; this operator focuses on filtering out specific words.

  • If tokenization is True, the text is tokenized using a Hugging Face

tokenizer, and words are filtered based on the specified substrings. - If tokenization is False, the text is split into sentences and words, and words are filtered based on the specified substrings. - The filtered text is then merged back into a single string.

The operator processes samples in batches and updates the text in place.

__init__(lang: str = 'en', tokenization: bool = False, substrings: List[str] | None = None, *args, **kwargs)[source]#

Initialization method.

Parameters:
  • lang โ€“ sample in which language

  • tokenization โ€“ whether to use model to tokenize documents

  • substrings โ€“ The incorrect substrings in words.

  • args โ€“ extra args

  • kwargs โ€“ extra args

should_keep_word_with_incorrect_substrings(word, substrings)[source]#
process_batched(samples)[source]#
class data_juicer.ops.mapper.ReplaceContentMapper(pattern: str | List[str] | None = None, repl: str | List[str] = '', *args, **kwargs)[source]#

Bases: Mapper

Replaces content in the text that matches a specific regular expression pattern with a designated replacement string.

This operator processes text by searching for patterns defined in pattern and replacing them with the corresponding repl string. If multiple patterns and replacements are provided, each pattern is replaced by its respective replacement. The operator supports both single and multiple patterns and replacements. The regular expressions are compiled with the re.DOTALL flag to match across multiple lines. If the length of the patterns and replacements do not match, a ValueError is raised. This operation is batched, meaning it processes multiple samples at once.

__init__(pattern: str | List[str] | None = None, repl: str | List[str] = '', *args, **kwargs)[source]#

Initialization method.

Parameters:
  • pattern โ€“ regular expression pattern(s) to search for within text

  • repl โ€“ replacement string(s), default is empty string

  • args โ€“ extra args

  • kwargs โ€“ extra args

process_batched(samples)[source]#
class data_juicer.ops.mapper.SDXLPrompt2PromptMapper(hf_diffusion: str = 'stabilityai/stable-diffusion-xl-base-1.0', trust_remote_code=False, torch_dtype: str = 'fp32', num_inference_steps: float = 50, guidance_scale: float = 7.5, text_key=None, text_key_second=None, output_dir='/home/runner/.cache/data_juicer/assets', *args, **kwargs)[source]#

Bases: Mapper

Generates pairs of similar images using the SDXL model.

This operator uses a Hugging Face diffusion model to generate image pairs based on two text prompts. The quality and similarity of the generated images are controlled by parameters such as num_inference_steps and guidance_scale. The first and second text prompts are specified using text_key and text_key_second, respectively. The generated images are saved in the specified output_dir with unique filenames. The operator requires both text keys to be set for processing.

__init__(hf_diffusion: str = 'stabilityai/stable-diffusion-xl-base-1.0', trust_remote_code=False, torch_dtype: str = 'fp32', num_inference_steps: float = 50, guidance_scale: float = 7.5, text_key=None, text_key_second=None, output_dir='/home/runner/.cache/data_juicer/assets', *args, **kwargs)[source]#

Initialization method.

Parameters:
  • hf_diffusion โ€“ diffusion model name on huggingface to generate the image.

  • trust_remote_code โ€“ whether to trust the remote code of HF models.

  • torch_dtype โ€“ the floating point type used to load the diffusion model.

  • num_inference_steps โ€“ The larger the value, the better the

image generation quality; however, this also increases the time required for generation. :param guidance_scale: A higher guidance scale value encourages the

model to generate images closely linked to the text prompt at the expense of lower image quality. Guidance scale is enabled when

Parameters:
  • text_key โ€“ the key name used to store the first caption in the caption pair.

  • text_key_second โ€“ the key name used to store the second caption in the caption pair.

  • output_dir โ€“ the storage location of the generated images.

process_single(sample, rank=None, context=False)[source]#

For sample level, sample โ€“> sample

Parameters:

sample โ€“ sample to process

Returns:

processed sample

class data_juicer.ops.mapper.SentenceAugmentationMapper(hf_model: str = 'Qwen/Qwen2-7B-Instruct', system_prompt: str = None, task_sentence: str = None, max_new_tokens=256, temperature=0.2, top_p=None, num_beams=1, text_key=None, text_key_second=None, *args, **kwargs)[source]#

Bases: Mapper

Augments sentences by generating enhanced versions using a Hugging Face model. This operator enhances input sentences by generating new, augmented versions. It is designed to work best with individual sentences rather than full documents. For optimal results, ensure the input text is at the sentence level. The augmentation process uses a Hugging Face model, such as lmsys/vicuna-13b-v1.5 or Qwen/Qwen2-7B-Instruct. The operator requires specifying both the primary and secondary text keys, where the augmented sentence will be stored in the secondary key. The generation process can be customized with parameters like temperature, top-p sampling, and beam search size.

__init__(hf_model: str = 'Qwen/Qwen2-7B-Instruct', system_prompt: str = None, task_sentence: str = None, max_new_tokens=256, temperature=0.2, top_p=None, num_beams=1, text_key=None, text_key_second=None, *args, **kwargs)[source]#

Initialization method. :param hf_model: Huggingface model id. :param system_prompt: System prompt. :param task_sentence: The instruction for the current task. :param max_new_tokens: the maximum number of new tokens

generated by the model.

Parameters:
  • temperature โ€“ used to control the randomness of generated text. The higher the temperature, the more random and creative the generated text will be.

  • top_p โ€“ randomly select the next word from the group of words whose cumulative probability reaches p.

  • num_beams โ€“ the larger the beam search size, the higher the quality of the generated text.

  • text_key โ€“ the key name used to store the first sentence in the text pair. (optional, defalut=โ€™textโ€™)

  • text_key_second โ€“ the key name used to store the second sentence in the text pair.

  • args โ€“ extra args

  • kwargs โ€“ extra args

process_single(sample=None, rank=None)[source]#

For sample level, sample โ€“> sample

Parameters:

sample โ€“ sample to process

Returns:

processed sample

class data_juicer.ops.mapper.SentenceSplitMapper(lang: str = 'en', *args, **kwargs)[source]#

Bases: Mapper

Splits text samples into individual sentences based on the specified language.

This operator uses an NLTK-based tokenizer to split the input text into sentences. The language for the tokenizer is specified during initialization. The original text in each sample is replaced with a list of sentences. This operator processes samples in batches for efficiency. Ensure that the lang parameter is set to the appropriate language code (e.g., โ€œenโ€ for English) to achieve accurate sentence splitting.

__init__(lang: str = 'en', *args, **kwargs)[source]#

Initialization method.

Parameters:
  • lang โ€“ split sentence of text in which language.

  • args โ€“ extra args

  • kwargs โ€“ extra args

process_batched(samples)[source]#
class data_juicer.ops.mapper.TextChunkMapper(max_len: Annotated[int, Gt(gt=0)] | None = None, split_pattern: str | None = '\\n\\n', overlap_len: Annotated[int, Ge(ge=0)] = 0, tokenizer: str | None = None, trust_remote_code: bool = False, *args, **kwargs)[source]#

Bases: Mapper

Split input text into chunks based on specified criteria.

  • Splits the input text into multiple chunks using a specified maximum length and a split pattern.

  • If max_len is provided, the text is split into chunks with a maximum length of max_len.

  • If split_pattern is provided, the text is split at occurrences of the pattern. If the length exceeds max_len, it will force a cut.

  • The overlap_len parameter specifies the overlap length between consecutive chunks if the split does not occur at the pattern.

  • Uses a Hugging Face tokenizer to calculate the text length in tokens if a tokenizer name is provided; otherwise, it uses the string length.

  • Caches the following stats: โ€˜chunk_countโ€™ (number of chunks generated for each sample).

  • Raises a ValueError if both max_len and split_pattern are None or if overlap_len is greater than or equal to max_len.

__init__(max_len: Annotated[int, Gt(gt=0)] | None = None, split_pattern: str | None = '\\n\\n', overlap_len: Annotated[int, Ge(ge=0)] = 0, tokenizer: str | None = None, trust_remote_code: bool = False, *args, **kwargs)[source]#

Initialization method.

Parameters:
  • max_len โ€“ Split text into multi texts with this max len if not None.

  • split_pattern โ€“ Make sure split in this pattern if it is not None and force cut if the length exceeds max_len.

  • overlap_len โ€“ Overlap length of the split texts if not split in the split pattern.

  • tokenizer โ€“ The tokenizer name of Hugging Face tokenizers. The text length will be calculate as the token num if it is offered. Otherwise, the text length equals to string length. Support tiktoken tokenizer (such as gpt-4o), dashscope tokenizer ( such as qwen2.5-72b-instruct) and huggingface tokenizer.

  • trust_remote_code โ€“ whether to trust the remote code of HF models.

  • args โ€“ extra args

  • kwargs โ€“ extra args

recursively_chunk(text)[source]#
get_text_chunks(text, rank=None)[source]#
process_batched(samples, rank=None)[source]#
class data_juicer.ops.mapper.VggtMapper(vggt_model_path: str = 'facebook/VGGT-1B', frame_num: Annotated[int, Gt(gt=0)] = 3, duration: float = 0, tag_field_name: str = 'vggt_tags', frame_dir: str = '/home/runner/.cache/data_juicer/assets', if_output_camera_parameters: bool = True, if_output_depth_maps: bool = True, if_output_point_maps_from_projection: bool = True, if_output_point_maps_from_unprojection: bool = True, if_output_point_tracks: bool = True, *args, **kwargs)[source]#

Bases: Mapper

Input a video of a single scene, and use VGGT to extract information including Camera Pose, Depth Maps, Point Maps, and 3D Point Tracks.

  • The operator processes a video and extracts frames based on the specified frame number and duration.

  • It uses the VGGT model to analyze the extracted frames and generate various outputs such as camera parameters, depth maps, point maps, and 3D point tracks.

  • If 3D point tracks are required, the user must provide query points in the format [x, y], relative to the top-left corner.

  • The results are stored in the sampleโ€™s metadata under the specified tag field name, which defaults to โ€˜vggt_tagsโ€™.

  • The operator can output camera parameters, depth maps, point maps from projection, point maps from unprojection, and 3D point tracks, depending on the configuration.

  • The VGGT model is loaded from the provided path, and the operator runs in CUDA mode if available.

__init__(vggt_model_path: str = 'facebook/VGGT-1B', frame_num: Annotated[int, Gt(gt=0)] = 3, duration: float = 0, tag_field_name: str = 'vggt_tags', frame_dir: str = '/home/runner/.cache/data_juicer/assets', if_output_camera_parameters: bool = True, if_output_depth_maps: bool = True, if_output_point_maps_from_projection: bool = True, if_output_point_maps_from_unprojection: bool = True, if_output_point_tracks: bool = True, *args, **kwargs)[source]#

Initialization method.

Parameters:
  • vggt_model_path โ€“ The path to the VGGT model.

  • frame_num โ€“ The number of frames to be extracted uniformly from the video. If itโ€™s 1, only the middle frame will be extracted. If itโ€™s 2, only the first and the last frames will be extracted. If itโ€™s larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration. If โ€œdurationโ€ > 0, frame_num is the number of frames per segment.

  • duration โ€“ The duration of each segment in seconds. If 0, frames are extracted from the entire video. If duration > 0, the video is segmented into multiple segments based on duration, and frames are extracted from each segment.

  • tag_field_name โ€“ The field name to store the tags. Itโ€™s โ€œvggt_tagsโ€ in default.

  • frame_dir โ€“ Output directory to save extracted frames.

  • if_output_camera_parameters โ€“ Determines whether to output camera parameters.

  • if_output_depth_maps โ€“ Determines whether to output depth maps.

  • if_output_point_maps_from_projection โ€“ Determines whether to output point maps directly inferred by VGGT.

  • if_output_point_maps_from_unprojection โ€“ Determines whether to output point maps constructed from depth maps and camera parameters.

  • if_output_point_tracks โ€“ Determines whether to output point tracks. If point tracks are required, the user should provide a list where each element consists of 2D point coordinates (list shape: (N, 2)). The point coordinates should be specified in the format [x, y], relative to the top-left corner, where x/y values are non-normalized.

  • args โ€“ extra args

  • kwargs โ€“ extra args

process_single(sample=None, rank=None)[source]#

For sample level, sample โ€“> sample

Parameters:

sample โ€“ sample to process

Returns:

processed sample

class data_juicer.ops.mapper.VideoCaptioningFromAudioMapper(keep_original_sample: bool = True, *args, **kwargs)[source]#

Bases: Mapper

Mapper to caption a video according to its audio streams based on Qwen-Audio model.

__init__(keep_original_sample: bool = True, *args, **kwargs)[source]#

Initialization method.

Parameters:
  • keep_original_sample โ€“ whether to keep the original sample. If itโ€™s set to False, there will be only captioned sample in the final datasets and the original sample will be removed. Itโ€™s True in default.

  • args โ€“ extra args

  • kwargs โ€“ extra args

process_batched(samples, rank=None)[source]#
class data_juicer.ops.mapper.VideoCaptioningFromFramesMapper(hf_img2seq: str = 'Salesforce/blip2-opt-2.7b', trust_remote_code: bool = False, caption_num: Annotated[int, Gt(gt=0)] = 1, keep_candidate_mode: str = 'random_any', keep_original_sample: bool = True, prompt: str | None = None, prompt_key: str | None = None, frame_sampling_method: str = 'all_keyframes', frame_num: Annotated[int, Gt(gt=0)] = 3, horizontal_flip: bool = False, vertical_flip: bool = False, *args, **kwargs)[source]#

Bases: Mapper

Generates video captions from sampled frames using an image-to-text model. Captions from different frames are concatenated into a single string.

  • Uses a Hugging Face image-to-text model to generate captions for sampled video frames.

  • Supports different frame sampling methods: โ€˜all_keyframesโ€™ or โ€˜uniformโ€™.

  • Can apply horizontal and vertical flips to the frames before captioning.

  • Offers multiple strategies for retaining generated captions: โ€˜random_anyโ€™,

โ€˜similar_one_simhashโ€™, or โ€˜allโ€™. - Optionally keeps the original sample in the final dataset. - Allows setting a global prompt or per-sample prompts to guide caption generation. - Generates a specified number of candidate captions per video, which can be reduced based on the selected retention strategy. - The number of output samples depends on the retention strategy and whether original samples are kept.

__init__(hf_img2seq: str = 'Salesforce/blip2-opt-2.7b', trust_remote_code: bool = False, caption_num: Annotated[int, Gt(gt=0)] = 1, keep_candidate_mode: str = 'random_any', keep_original_sample: bool = True, prompt: str | None = None, prompt_key: str | None = None, frame_sampling_method: str = 'all_keyframes', frame_num: Annotated[int, Gt(gt=0)] = 3, horizontal_flip: bool = False, vertical_flip: bool = False, *args, **kwargs)[source]#

Initialization method.

Parameters:
  • hf_img2seq โ€“ model name on huggingface to generate caption

  • trust_remote_code โ€“ whether to trust the remote code of HF models.

  • caption_num โ€“ how many candidate captions to generate for each video

  • keep_candidate_mode โ€“

    retain strategy for the generated $caption_num$ candidates.

    โ€™random_anyโ€™: Retain the random one from generated captions

    โ€™similar_one_simhashโ€™: Retain the generated one that is most

    similar to the original caption

    โ€™allโ€™: Retain all generated captions by concatenation

Note

This is a batched_OP, whose input and output type are both list. Suppose there are $N$ list of input samples, whose batch size is $b$, and denote caption_num as $M$. The number of total samples after generation is $2Nb$ when keep_original_sample is True and $Nb$ when keep_original_sample is False. For โ€˜random_anyโ€™ and โ€˜similar_one_simhashโ€™ mode, itโ€™s $(1+M)Nb$ for โ€˜allโ€™ mode when keep_original_sample is True and $MNb$ when keep_original_sample is False.

Parameters:
  • keep_original_sample โ€“ whether to keep the original sample. If itโ€™s set to False, there will be only generated captions in the final datasets and the original captions will be removed. Itโ€™s True in default.

  • prompt โ€“ a string prompt to guide the generation of image-to-text model for all samples globally. Itโ€™s None in default, which means no prompt provided.

  • prompt_key โ€“ the key name of fields in samples to store prompts for each sample. Itโ€™s used for set different prompts for different samples. If itโ€™s none, use prompt in parameter โ€œpromptโ€. Itโ€™s None in default.

  • frame_sampling_method โ€“ sampling method of extracting frame videos from the videos. Should be one of [โ€œall_keyframesโ€, โ€œuniformโ€]. The former one extracts all key frames (the number of which depends on the duration of the video) and the latter one extract specified number of frames uniformly from the video. Default: โ€œall_keyframesโ€.

  • frame_num โ€“ the number of frames to be extracted uniformly from the video. Only works when frame_sampling_method is โ€œuniformโ€. If itโ€™s 1, only the middle frame will be extracted. If itโ€™s 2, only the first and the last frames will be extracted. If itโ€™s larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration.

  • horizontal_flip โ€“ flip frame video horizontally (left to right).

  • vertical_flip โ€“ flip frame video vertically (top to bottom).

  • args โ€“ extra args

  • kwargs โ€“ extra args

process_batched(samples, rank=None, context=False)[source]#
Parameters:

samples

Returns:

Note

This is a batched_OP, whose the input and output type are both list. Suppose there are $N$ input sample list with batch size as $b$, and denote caption_num as $M$. the number of total samples after generation is $2Nb$ for โ€˜random_anyโ€™ and โ€˜similar_oneโ€™ mode, and $(1+M)Nb$ for โ€˜allโ€™ mode.

class data_juicer.ops.mapper.VideoCaptioningFromSummarizerMapper(hf_summarizer: str = None, trust_remote_code: bool = False, consider_video_caption_from_video: bool = True, consider_video_caption_from_audio: bool = True, consider_video_caption_from_frames: bool = True, consider_video_tags_from_audio: bool = True, consider_video_tags_from_frames: bool = True, vid_cap_from_vid_args: Dict | None = None, vid_cap_from_frm_args: Dict | None = None, vid_tag_from_aud_args: Dict | None = None, vid_tag_from_frm_args: Dict | None = None, keep_tag_num: Annotated[int, Gt(gt=0)] = 5, keep_original_sample: bool = True, *args, **kwargs)[source]#

Bases: Mapper

Mapper to generate video captions by summarizing several kinds of generated texts (captions from video/audio/frames, tags from audio/frames, โ€ฆ)

__init__(hf_summarizer: str = None, trust_remote_code: bool = False, consider_video_caption_from_video: bool = True, consider_video_caption_from_audio: bool = True, consider_video_caption_from_frames: bool = True, consider_video_tags_from_audio: bool = True, consider_video_tags_from_frames: bool = True, vid_cap_from_vid_args: Dict | None = None, vid_cap_from_frm_args: Dict | None = None, vid_tag_from_aud_args: Dict | None = None, vid_tag_from_frm_args: Dict | None = None, keep_tag_num: Annotated[int, Gt(gt=0)] = 5, keep_original_sample: bool = True, *args, **kwargs)[source]#

Initialization method.

Parameters:
  • hf_summarizer โ€“ the summarizer model used to summarize texts generated by other methods.

  • trust_remote_code โ€“ whether to trust the remote code of HF models.

  • consider_video_caption_from_video โ€“ whether to consider the video caption generated from video directly in the summarization process. Default: True.

  • consider_video_caption_from_audio โ€“ whether to consider the video caption generated from audio streams in the video in the summarization process. Default: True.

  • consider_video_caption_from_frames โ€“ whether to consider the video caption generated from sampled frames from the video in the summarization process. Default: True.

  • consider_video_tags_from_audio โ€“ whether to consider the video tags generated from audio streams in the video in the summarization process. Default: True.

  • consider_video_tags_from_frames โ€“ whether to consider the video tags generated from sampled frames from the video in the summarization process. Default: True.

  • vid_cap_from_vid_args โ€“ the arg dict for video captioning from video directly with keys are the arg names and values are the arg values. Default: None.

  • vid_cap_from_frm_args โ€“ the arg dict for video captioning from sampled frames from the video with keys are the arg names and values are the arg values. Default: None.

  • vid_tag_from_aud_args โ€“ the arg dict for video tagging from audio streams in the video with keys are the arg names and values are the arg values. Default: None.

  • vid_tag_from_frm_args โ€“ the arg dict for video tagging from sampled frames from the video with keys are the arg names and values are the arg values. Default: None.

  • keep_tag_num โ€“ max number N of tags from sampled frames to keep. Too many tags might bring negative influence to summarized text, so we consider to only keep the N most frequent tags. Default: 5.

  • keep_original_sample โ€“ whether to keep the original sample. If itโ€™s set to False, there will be only summarized captions in the final datasets and the original captions will be removed. Itโ€™s True in default.

  • args โ€“ extra args

  • kwargs โ€“ extra args

process_batched(samples, rank=None)[source]#
class data_juicer.ops.mapper.VideoCaptioningFromVideoMapper(hf_video_blip: str = 'kpyu/video-blip-opt-2.7b-ego4d', trust_remote_code: bool = False, caption_num: Annotated[int, Gt(gt=0)] = 1, keep_candidate_mode: str = 'random_any', keep_original_sample: bool = True, prompt: str | None = None, prompt_key: str | None = None, frame_sampling_method: str = 'all_keyframes', frame_num: Annotated[int, Gt(gt=0)] = 3, horizontal_flip: bool = False, vertical_flip: bool = False, *args, **kwargs)[source]#

Bases: Mapper

Generates video captions using a Hugging Face video-to-text model and sampled video frames.

This operator processes video samples to generate captions based on the provided video frames. It uses a Hugging Face video-to-text model, such as โ€˜kpyu/video-blip-opt-2.7b-ego4dโ€™, to generate multiple caption candidates for each video. The number of generated captions and the strategy to keep or filter these candidates can be configured. The operator supports different frame sampling methods, including extracting all keyframes or uniformly sampling a specified number of frames. Additionally, it allows for horizontal and vertical flipping of the frames. The final output can include both the original sample and the generated captions, depending on the configuration.

__init__(hf_video_blip: str = 'kpyu/video-blip-opt-2.7b-ego4d', trust_remote_code: bool = False, caption_num: Annotated[int, Gt(gt=0)] = 1, keep_candidate_mode: str = 'random_any', keep_original_sample: bool = True, prompt: str | None = None, prompt_key: str | None = None, frame_sampling_method: str = 'all_keyframes', frame_num: Annotated[int, Gt(gt=0)] = 3, horizontal_flip: bool = False, vertical_flip: bool = False, *args, **kwargs)[source]#

Initialization method.

Parameters:
  • hf_video_blip โ€“ video-blip model name on huggingface to generate caption

  • trust_remote_code โ€“ whether to trust the remote code of HF models.

  • caption_num โ€“ how many candidate captions to generate for each video

  • keep_candidate_mode โ€“

    retain strategy for the generated $caption_num$ candidates.

    โ€™random_anyโ€™: Retain the random one from generated captions

    โ€™similar_one_simhashโ€™: Retain the generated one that is most

    similar to the original caption

    โ€™allโ€™: Retain all generated captions by concatenation

Note

This is a batched_OP, whose input and output type are both list. Suppose there are $N$ list of input samples, whose batch size is $b$, and denote caption_num as $M$. The number of total samples after generation is $2Nb$ when keep_original_sample is True and $Nb$ when keep_original_sample is False. For โ€˜random_anyโ€™ and โ€˜similar_one_simhashโ€™ mode, itโ€™s $(1+M)Nb$ for โ€˜allโ€™ mode when keep_original_sample is True and $MNb$ when keep_original_sample is False.

Parameters:
  • keep_original_sample โ€“ whether to keep the original sample. If itโ€™s set to False, there will be only generated captions in the final datasets and the original captions will be removed. Itโ€™s True in default.

  • prompt โ€“ a string prompt to guide the generation of video-blip model for all samples globally. Itโ€™s None in default, which means no prompt provided.

  • prompt_key โ€“ the key name of fields in samples to store prompts for each sample. Itโ€™s used for set different prompts for different samples. If itโ€™s none, use prompt in parameter โ€œpromptโ€. Itโ€™s None in default.

  • frame_sampling_method โ€“ sampling method of extracting frame videos from the videos. Should be one of [โ€œall_keyframesโ€, โ€œuniformโ€]. The former one extracts all key frames (the number of which depends on the duration of the video) and the latter one extract specified number of frames uniformly from the video. Default: โ€œall_keyframesโ€.

  • frame_num โ€“ the number of frames to be extracted uniformly from the video. Only works when frame_sampling_method is โ€œuniformโ€. If itโ€™s 1, only the middle frame will be extracted. If itโ€™s 2, only the first and the last frames will be extracted. If itโ€™s larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration.

  • horizontal_flip โ€“ flip frame video horizontally (left to right).

  • vertical_flip โ€“ flip frame video vertically (top to bottom).

  • args โ€“ extra args

  • kwargs โ€“ extra args

process_batched(samples, rank=None, context=False)[source]#
Parameters:

samples

Returns:

Note

This is a batched_OP, whose the input and output type are both list. Suppose there are $N$ input sample list with batch size as $b$, and denote caption_num as $M$. the number of total samples after generation is $2Nb$ for โ€˜random_anyโ€™ and โ€˜similar_oneโ€™ mode, and $(1+M)Nb$ for โ€˜allโ€™ mode.

class data_juicer.ops.mapper.VideoExtractFramesMapper(frame_sampling_method: str = 'all_keyframes', output_format: str = 'path', frame_num: Annotated[int, Gt(gt=0)] = 3, duration: float = 0, frame_dir: str = None, frame_key: str = 'video_frames', *args, **kwargs)[source]#

Bases: Mapper

Mapper to extract frames from video files according to specified methods.

Extracts frames from video files using either all keyframes or a uniform sampling method.

Supported output formats are: [โ€œpathโ€, โ€œbytesโ€]. If format is โ€œpathโ€, the output is a list of lists, where each inner list contains the path of the frames of a single video. e.g.[

[video1_frame1_path, video1_frame2_path, โ€ฆ], [video2_frame1_path, video2_frame2_path, โ€ฆ], โ€ฆ

] (In the order of the videos).

If format is โ€œbytesโ€, the output is a list of lists, where each inner list contains the bytes of the frames of a single video. e.g. [

[video1_byte1, video1_byte2, โ€ฆ], [video2_byte1, video2_byte2, โ€ฆ], โ€ฆ

] (In the order of the videos).

  • Frame Sampling Methods:

  • โ€œall_keyframesโ€: Extracts all keyframes from the video.

  • โ€œuniformโ€: Extracts a specified number of frames uniformly from the video.

  • If duration is set, the video is segmented into multiple segments based on the duration, and frames are extracted from each segment.

  • The output directory for the frames can be specified if output format is โ€œpathโ€, else left to None.

  • The field name in the sampleโ€™s metadata where the frame information is stored can be customized.

__init__(frame_sampling_method: str = 'all_keyframes', output_format: str = 'path', frame_num: Annotated[int, Gt(gt=0)] = 3, duration: float = 0, frame_dir: str = None, frame_key: str = 'video_frames', *args, **kwargs)[source]#

Initialization method. :param frame_sampling_method: sampling method of extracting frame

videos from the videos. Should be one of [โ€œall_keyframesโ€, โ€œuniformโ€]. The former one extracts all key frames (the number of which depends on the duration of the video) and the latter one extract specified number of frames uniformly from the video. If โ€œdurationโ€ > 0, frame_sampling_method acts on every segment. Default: โ€œall_keyframesโ€.

Parameters:
  • output_format โ€“

    The output format of the frame videos. Supported formats are: [โ€œpathโ€, โ€œbytesโ€]. If format is โ€œpathโ€, the output is a list of lists, where each inner list contains the path of the frames of a single video. e.g.[

    [video1_frame1_path, video1_frame2_path, โ€ฆ], [video2_frame1_path, video2_frame2_path, โ€ฆ], โ€ฆ

    ] (In the order of the videos).

    If format is โ€œbytesโ€, the output is a list of lists, where each inner list contains the bytes of the frames of a single video. e.g. [

    [video1_byte1, video1_byte2, โ€ฆ], [video2_byte1, video2_byte2, โ€ฆ], โ€ฆ

    ] (In the order of the videos).

  • frame_num โ€“ the number of frames to be extracted uniformly from the video. Only works when frame_sampling_method is โ€œuniformโ€. If itโ€™s 1, only the middle frame will be extracted. If itโ€™s 2, only the first and the last frames will be extracted. If itโ€™s larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration. If โ€œdurationโ€ > 0, frame_num is the number of frames per segment.

  • duration โ€“ The duration of each segment in seconds. If 0, frames are extracted from the entire video. If duration > 0, the video is segmented into multiple segments based on duration, and frames are extracted from each segment.

  • frame_dir โ€“ Output directory to save extracted frames. If output_format is โ€œpathโ€, must specify a directory.

  • frame_key โ€“ The name of field to save generated frames info.

  • args โ€“ extra args

  • kwargs โ€“ extra args

extract_frames(video)[source]#
process_single(sample, context=False)[source]#

For sample level, sample โ€“> sample

Parameters:

sample โ€“ sample to process

Returns:

processed sample

class data_juicer.ops.mapper.VideoFFmpegWrappedMapper(filter_name: str | None = None, filter_kwargs: Dict | None = None, global_args: List[str] | None = None, capture_stderr: bool = True, overwrite_output: bool = True, save_dir: str = None, *args, **kwargs)[source]#

Bases: Mapper

Wraps FFmpeg video filters for processing video files in a dataset.

This operator applies a specified FFmpeg video filter to each video file in the dataset. It supports passing keyword arguments to the filter and global arguments to the FFmpeg command line. The processed videos are saved in a specified directory or the same directory as the input files. If no filter name is provided, the videos remain unmodified. The operator updates the source file paths in the dataset to reflect any changes.

__init__(filter_name: str | None = None, filter_kwargs: Dict | None = None, global_args: List[str] | None = None, capture_stderr: bool = True, overwrite_output: bool = True, save_dir: str = None, *args, **kwargs)[source]#

Initialization method.

Parameters:
  • filter_name โ€“ ffmpeg video filter name.

  • filter_kwargs โ€“ keyword-arguments passed to ffmpeg filter.

  • global_args โ€“ list-arguments passed to ffmpeg command-line.

  • capture_stderr โ€“ whether to capture stderr.

  • overwrite_output โ€“ whether to overwrite output file.

  • save_dir โ€“ The directory where generated video files will be stored. If not specified, outputs will be saved in the same directory as their corresponding input files. This path can alternatively be defined by setting the DJ_PRODUCED_DATA_DIR environment variable.

  • args โ€“ extra args

  • kwargs โ€“ extra args

process_single(sample)[source]#

For sample level, sample โ€“> sample

Parameters:

sample โ€“ sample to process

Returns:

processed sample

class data_juicer.ops.mapper.VideoHandReconstructionMapper(wilor_model_path: str = 'wilor_final.ckpt', wilor_model_config: str = 'model_config.yaml', detector_model_path: str = 'detector.pt', mano_right_path: str = 'path_to_mano_right_pkl', frame_num: Annotated[int, Gt(gt=0)] = 3, duration: float = 0, batch_size: int = 16, tag_field_name: str = 'hand_reconstruction_tags', frame_dir: str = '/home/runner/.cache/data_juicer/assets', if_save_visualization: bool = True, save_visualization_dir: str = '/home/runner/.cache/data_juicer/assets', if_save_mesh: bool = True, save_mesh_dir: str = '/home/runner/.cache/data_juicer/assets', *args, **kwargs)[source]#

Bases: Mapper

Use the WiLoR model for hand localization and reconstruction.

__init__(wilor_model_path: str = 'wilor_final.ckpt', wilor_model_config: str = 'model_config.yaml', detector_model_path: str = 'detector.pt', mano_right_path: str = 'path_to_mano_right_pkl', frame_num: Annotated[int, Gt(gt=0)] = 3, duration: float = 0, batch_size: int = 16, tag_field_name: str = 'hand_reconstruction_tags', frame_dir: str = '/home/runner/.cache/data_juicer/assets', if_save_visualization: bool = True, save_visualization_dir: str = '/home/runner/.cache/data_juicer/assets', if_save_mesh: bool = True, save_mesh_dir: str = '/home/runner/.cache/data_juicer/assets', *args, **kwargs)[source]#

Initialization method.

Parameters:
  • wilor_model_path โ€“ The path to โ€˜wilor_final.ckptโ€™.

  • wilor_model_config โ€“ The path to โ€˜model_config.yamlโ€™ for the WiLOR model.

  • detector_model_path โ€“ The path to โ€˜detector.ptโ€™ for the WiLOR model.

  • mano_right_path โ€“ The path to โ€˜MANO_RIGHT.pklโ€™. Users need to download this file from https://mano.is.tue.mpg.de/ and comply with the MANO license.

  • frame_num โ€“ The number of frames to be extracted uniformly from the video. If itโ€™s 1, only the middle frame will be extracted. If itโ€™s 2, only the first and the last frames will be extracted. If itโ€™s larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration. If โ€œdurationโ€ > 0, frame_num is the number of frames per segment.

  • duration โ€“ The duration of each segment in seconds. If 0, frames are extracted from the entire video. If duration > 0, the video is segmented into multiple segments based on duration, and frames are extracted from each segment.

  • batch_size โ€“ Batch size for simultaneous hand inference.

  • tag_field_name โ€“ The field name to store the tags. Itโ€™s โ€œhand_reconstruction_tagsโ€ in default.

  • frame_dir โ€“ Output directory to save extracted frames.

  • if_save_visualization โ€“ Whether to save overlay images.

  • save_visualization_dir โ€“ The path for saving overlay images.

  • if_save_mesh โ€“ Whether to save images of the hand mesh.

  • save_mesh_dir โ€“ The path for saving images of the hand mesh.

  • args โ€“ extra args

  • kwargs โ€“ extra args

project_full_img(points, cam_trans, focal_length, img_res)[source]#
process_single(sample=None, rank=None)[source]#

For sample level, sample โ€“> sample

Parameters:

sample โ€“ sample to process

Returns:

processed sample

class data_juicer.ops.mapper.VideoFaceBlurMapper(cv_classifier: str = '', blur_type: str = 'gaussian', radius: float = 2, save_dir: str = None, *args, **kwargs)[source]#

Bases: Mapper

Mapper to blur faces detected in videos.

This operator uses an OpenCV classifier for face detection and applies a specified blur type to the detected faces. The default classifier is โ€˜haarcascade_frontalface_alt.xmlโ€™. Supported blur types include โ€˜meanโ€™, โ€˜boxโ€™, and โ€˜gaussianโ€™. The radius of the blur kernel can be adjusted. If a save directory is not provided, the processed videos will be saved in the same directory as the input files. The DJ_PRODUCED_DATA_DIR environment variable can also be used to specify the save directory.

__init__(cv_classifier: str = '', blur_type: str = 'gaussian', radius: float = 2, save_dir: str = None, *args, **kwargs)[source]#

Initialization method.

Parameters:
  • cv_classifier โ€“ OpenCV classifier path for face detection. By default, we will use โ€˜haarcascade_frontalface_alt.xmlโ€™.

  • blur_type โ€“ Type of blur kernel, including [โ€˜meanโ€™, โ€˜boxโ€™, โ€˜gaussianโ€™].

  • radius โ€“ Radius of blur kernel.

  • save_dir โ€“ The directory where generated video files will be stored. If not specified, outputs will be saved in the same directory as their corresponding input files. This path can alternatively be defined by setting the DJ_PRODUCED_DATA_DIR environment variable.

  • args โ€“ extra args

  • kwargs โ€“ extra args

process_single(sample, context=False)[source]#

For sample level, sample โ€“> sample

Parameters:

sample โ€“ sample to process

Returns:

processed sample

class data_juicer.ops.mapper.VideoRemoveWatermarkMapper(roi_strings: List[str] = ['0,0,0.1,0.1'], roi_type: str = 'ratio', roi_key: str | None = None, frame_num: Annotated[int, Gt(gt=0)] = 10, min_frame_threshold: Annotated[int, Gt(gt=0)] = 7, detection_method: str = 'pixel_value', save_dir: str = None, *args, **kwargs)[source]#

Bases: Mapper

Remove watermarks from videos based on specified regions.

This operator removes watermarks from video frames by detecting and masking the watermark areas. It supports two detection methods: โ€˜pixel_valueโ€™ and โ€˜pixel_diversityโ€™. The regions of interest (ROIs) for watermark detection can be specified as either pixel coordinates or ratios of the frame dimensions. The operator extracts a set number of frames uniformly from the video to detect watermark pixels. A pixel is considered part of a watermark if it meets the detection criteria in a minimum number of frames. The cleaned video is saved in the specified directory or the same directory as the input file if no save directory is provided.

__init__(roi_strings: List[str] = ['0,0,0.1,0.1'], roi_type: str = 'ratio', roi_key: str | None = None, frame_num: Annotated[int, Gt(gt=0)] = 10, min_frame_threshold: Annotated[int, Gt(gt=0)] = 7, detection_method: str = 'pixel_value', save_dir: str = None, *args, **kwargs)[source]#

Initialization method.

Parameters:
  • roi_strings โ€“ a given list of regions the watermarks locate. The format of each can be โ€œx1, y1, x2, y2โ€, โ€œ(x1, y1, x2, y2)โ€, or โ€œ[x1, y1, x2, y2]โ€.

  • roi_type โ€“ the roi string type. When the type is โ€˜pixelโ€™, (x1, y1), (x2, y2) are the locations of pixels in the top left corner and the bottom right corner respectively. If the roi_type is โ€˜ratioโ€™, the coordinates are normalized by widths and heights.

  • roi_key โ€“ the key name of fields in samples to store roi_strings for each sample. Itโ€™s used for set different rois for different samples. If itโ€™s none, use rois in parameter โ€œroi_stringsโ€. Itโ€™s None in default.

  • frame_num โ€“ the number of frames to be extracted uniformly from the video to detect the pixels of watermark.

  • min_frame_threshold โ€“ a coordination is considered as the location of a watermark pixel when it is that in no less min_frame_threshold frames.

  • detection_method โ€“ the method to detect the pixels of watermark. If it is โ€˜pixel_valueโ€™, we consider the distribution of pixel value in each frame. If it is โ€˜pixel_diversityโ€™, we will consider the pixel diversity in different frames. The min_frame_threshold is useless and frame_num must be greater than 1 in โ€˜pixel_diversityโ€™ mode.

  • save_dir โ€“ The directory where generated video files will be stored. If not specified, outputs will be saved in the same directory as their corresponding input files. This path can alternatively be defined by setting the DJ_PRODUCED_DATA_DIR environment variable.

  • args โ€“ extra args

  • kwargs โ€“ extra args

process_single(sample, context=False)[source]#

For sample level, sample โ€“> sample

Parameters:

sample โ€“ sample to process

Returns:

processed sample

class data_juicer.ops.mapper.VideoResizeAspectRatioMapper(min_ratio: str = '9/21', max_ratio: str = '21/9', strategy: str = 'increase', save_dir: str = None, *args, **kwargs)[source]#

Bases: Mapper

Resizes videos to fit within a specified aspect ratio range. This operator adjusts the dimensions of videos to ensure their aspect ratios fall within a defined range. It can either increase or decrease the video dimensions based on the specified strategy. The aspect ratio is calculated as width divided by height. If a videoโ€™s aspect ratio is outside the given range, it will be resized to match the closest boundary (either the minimum or maximum ratio). The min_ratio and max_ratio should be provided as strings in the format โ€œ9:21โ€ or โ€œ9/21โ€. The resizing process uses the ffmpeg library to handle the actual video scaling. Videos that do not need resizing are left unchanged. The operator supports saving the modified videos to a specified directory or the same directory as the input files.

STRATEGY = ['decrease', 'increase']#
__init__(min_ratio: str = '9/21', max_ratio: str = '21/9', strategy: str = 'increase', save_dir: str = None, *args, **kwargs)[source]#

Initialization method.

Parameters:
  • min_ratio โ€“ The minimum aspect ratio to enforce videos with an aspect ratio below min_ratio will be resized to match this minimum ratio. The ratio should be provided as a string in the format โ€œ9:21โ€ or โ€œ9/21โ€.

  • max_ratio โ€“ The maximum aspect ratio to enforce videos with an aspect ratio above max_ratio will be resized to match this maximum ratio. The ratio should be provided as a string in the format โ€œ21:9โ€ or โ€œ21/9โ€.

  • strategy โ€“ The resizing strategy to apply when adjusting the video dimensions. It can be either โ€˜decreaseโ€™ to reduce the dimension or โ€˜increaseโ€™ to enlarge it. Accepted values are [โ€˜decreaseโ€™, โ€˜increaseโ€™].

  • save_dir โ€“ The directory where generated video files will be stored. If not specified, outputs will be saved in the same directory as their corresponding input files. This path can alternatively be defined by setting the DJ_PRODUCED_DATA_DIR environment variable.

  • args โ€“ extra args

  • kwargs โ€“ extra args

process_single(sample)[source]#

For sample level, sample โ€“> sample

Parameters:

sample โ€“ sample to process

Returns:

processed sample

class data_juicer.ops.mapper.VideoResizeResolutionMapper(min_width: int = 1, max_width: int = 9223372036854775807, min_height: int = 1, max_height: int = 9223372036854775807, force_original_aspect_ratio: str = 'disable', force_divisible_by: Annotated[int, Gt(gt=0)] = 2, save_dir: str = None, *args, **kwargs)[source]#

Bases: Mapper

Resizes video resolution based on specified width and height constraints.

This operator resizes videos to fit within the provided minimum and maximum width and height limits. It can optionally maintain the original aspect ratio by adjusting the dimensions accordingly. The resized videos are saved in the specified directory or the same directory as the input if no save directory is provided. The key metric for resizing is the videoโ€™s width and height, which are adjusted to meet the constraints while maintaining the aspect ratio if configured. The force_divisible_by parameter ensures that the output dimensions are divisible by a specified integer, which must be a positive even number when used with aspect ratio adjustments.

__init__(min_width: int = 1, max_width: int = 9223372036854775807, min_height: int = 1, max_height: int = 9223372036854775807, force_original_aspect_ratio: str = 'disable', force_divisible_by: Annotated[int, Gt(gt=0)] = 2, save_dir: str = None, *args, **kwargs)[source]#

Initialization method.

Parameters:
  • min_width โ€“ Videos with width less than โ€˜min_widthโ€™ will be mapped to videos with equal or bigger width.

  • max_width โ€“ Videos with width more than โ€˜max_widthโ€™ will be mapped to videos with equal of smaller width.

  • min_height โ€“ Videos with height less than โ€˜min_heightโ€™ will be mapped to videos with equal or bigger height.

  • max_height โ€“ Videos with height more than โ€˜max_heightโ€™ will be mapped to videos with equal or smaller height.

  • force_original_aspect_ratio โ€“ Enable decreasing or increasing output video width or height if necessary to keep the original aspect ratio, including [โ€˜disableโ€™, โ€˜decreaseโ€™, โ€˜increaseโ€™].

  • force_divisible_by โ€“ Ensures that both the output dimensions, width and height, are divisible by the given integer when used together with force_original_aspect_ratio, must be a positive even number.

  • save_dir โ€“ The directory where generated video files will be stored. If not specified, outputs will be saved in the same directory as their corresponding input files. This path can alternatively be defined by setting the DJ_PRODUCED_DATA_DIR environment variable.

  • args โ€“ extra args

  • kwargs โ€“ extra args

process_single(sample, context=False)[source]#

For sample level, sample โ€“> sample

Parameters:

sample โ€“ sample to process

Returns:

processed sample

class data_juicer.ops.mapper.VideoSplitByDurationMapper(split_duration: float = 10, min_last_split_duration: float = 0, keep_original_sample: bool = True, save_dir: str = None, video_backend: str = 'ffmpeg', *args, **kwargs)[source]#

Bases: Mapper

Splits videos into segments based on a specified duration.

This operator splits each video in the dataset into smaller segments, each with a fixed duration. The last segment is discarded if its duration is less than the specified minimum last split duration. The original sample can be kept or removed based on the keep_original_sample parameter. The generated video files are saved in the specified directory or, if not provided, in the same directory as the input files. The key metric for this operation is the duration of each segment, which is character-based (seconds).

  • Splits videos into segments of a specified duration.

  • Discards the last segment if it is shorter than the minimum allowed duration.

  • Keeps or removes the original sample based on the keep_original_sample parameter.

  • Saves the generated video files in the specified directory or the input fileโ€™s directory.

  • Uses the duration in seconds to determine the segment boundaries.

__init__(split_duration: float = 10, min_last_split_duration: float = 0, keep_original_sample: bool = True, save_dir: str = None, video_backend: str = 'ffmpeg', *args, **kwargs)[source]#

Initialization method.

Parameters:
  • split_duration โ€“ duration of each video split in seconds.

  • min_last_split_duration โ€“ The minimum allowable duration in seconds for the last video split. If the duration of the last split is less than this value, it will be discarded.

  • keep_original_sample โ€“ whether to keep the original sample. If itโ€™s set to False, there will be only cut sample in the final datasets and the original sample will be removed. Itโ€™s True in default.

  • save_dir โ€“ The directory where generated video files will be stored. If not specified, outputs will be saved in the same directory as their corresponding input files. This path can alternatively be defined by setting the DJ_PRODUCED_DATA_DIR environment variable.

  • video_backend โ€“ video backend, can be ffmpeg, av.

  • args โ€“ extra args

  • kwargs โ€“ extra args

split_videos_by_duration(video_key, container)[source]#
process_batched(samples)[source]#
class data_juicer.ops.mapper.VideoSplitByKeyFrameMapper(keep_original_sample: bool = True, save_dir: str = None, video_backend: str = 'ffmpeg', *args, **kwargs)[source]#

Bases: Mapper

Splits a video into segments based on key frames.

This operator processes video data by splitting it into multiple segments at key frame boundaries. It uses the key frames to determine where to make the splits. The original sample can be kept or discarded based on the keep_original_sample parameter. If save_dir is specified, the split video files will be saved in that directory; otherwise, they will be saved in the same directory as the input files. The operator processes each video in the sample and updates the sample with the new video keys and text placeholders. The Fields.source_file field is updated to reflect the new video segments. This operator works in batch mode, processing multiple samples at once.

__init__(keep_original_sample: bool = True, save_dir: str = None, video_backend: str = 'ffmpeg', *args, **kwargs)[source]#

Initialization method.

Parameters:
  • keep_original_sample โ€“ whether to keep the original sample. If itโ€™s set to False, there will be only split sample in the final datasets and the original sample will be removed. Itโ€™s True in default.

  • save_dir โ€“ The directory where generated video files will be stored. If not specified, outputs will be saved in the same directory as their corresponding input files. This path can alternatively be defined by setting the DJ_PRODUCED_DATA_DIR environment variable.

  • video_backend โ€“ video backend, can be ffmpeg, av.

  • args โ€“ extra args

  • kwargs โ€“ extra args

get_split_key_frame(video_key, container)[source]#
process_batched(samples)[source]#
class data_juicer.ops.mapper.VideoSplitBySceneMapper(detector: str = 'ContentDetector', threshold: Annotated[float, Ge(ge=0)] = 27.0, min_scene_len: Annotated[int, Ge(ge=0)] = 15, show_progress: bool = False, save_dir: str = None, *args, **kwargs)[source]#

Bases: Mapper

Splits videos into scene clips based on detected scene changes.

This operator uses a specified scene detector to identify and split video scenes. It supports three types of detectors: ContentDetector, ThresholdDetector, and AdaptiveDetector. The operator processes each video in the sample, detects scenes, and splits the video into individual clips. The minimum length of a scene can be set, and progress can be shown during processing. The resulting clips are saved in the specified directory or the same directory as the input files if no save directory is provided. The operator also updates the text field in the sample to reflect the new video clips. If a video does not contain any scenes, it remains unchanged.

available_detectors = {'AdaptiveDetector': ['window_width', 'min_content_val', 'weights', 'luma_only', 'kernel_size', 'video_manager', 'min_delta_hsv'], 'ContentDetector': ['weights', 'luma_only', 'kernel_size'], 'ThresholdDetector': ['fade_bias', 'add_final_scene', 'method', 'block_size']}#
__init__(detector: str = 'ContentDetector', threshold: Annotated[float, Ge(ge=0)] = 27.0, min_scene_len: Annotated[int, Ge(ge=0)] = 15, show_progress: bool = False, save_dir: str = None, *args, **kwargs)[source]#

Initialization method.

Parameters:
  • detector โ€“ Algorithm from scenedetect.detectors. Should be one of [โ€˜ContentDetectorโ€™, โ€˜ThresholdDetectorโ€™, โ€˜AdaptiveDetector`].

  • threshold โ€“ Threshold passed to the detector.

  • min_scene_len โ€“ Minimum length of any scene.

  • show_progress โ€“ Whether to show progress from scenedetect.

  • save_dir โ€“ The directory where generated video files will be stored. If not specified, outputs will be saved in the same directory as their corresponding input files. This path can alternatively be defined by setting the DJ_PRODUCED_DATA_DIR environment variable.

  • args โ€“ extra args

  • kwargs โ€“ extra args

process_single(sample, context=False)[source]#

For sample level, sample โ€“> sample

Parameters:

sample โ€“ sample to process

Returns:

processed sample

class data_juicer.ops.mapper.VideoTaggingFromAudioMapper(hf_ast: str = 'MIT/ast-finetuned-audioset-10-10-0.4593', trust_remote_code: bool = False, tag_field_name: str = 'video_audio_tags', *args, **kwargs)[source]#

Bases: Mapper

Generates video tags from audio streams using the Audio Spectrogram Transformer.

This operator extracts audio streams from videos and uses a Hugging Face Audio Spectrogram Transformer (AST) model to generate tags. The tags are stored in the specified metadata field, defaulting to โ€˜video_audio_tagsโ€™. If no valid audio stream is found, the tag is set to โ€˜EMPTYโ€™. The operator resamples audio to match the modelโ€™s required sampling rate if necessary. The tags are inferred based on the highest logit value from the modelโ€™s output. If the tags are already present in the sample, the operator skips processing for that sample.

__init__(hf_ast: str = 'MIT/ast-finetuned-audioset-10-10-0.4593', trust_remote_code: bool = False, tag_field_name: str = 'video_audio_tags', *args, **kwargs)[source]#

Initialization method.

Parameters:
  • hf_ast โ€“ path to the HF model to tag from audios.

  • trust_remote_code โ€“ whether to trust the remote code of HF models

  • tag_field_name โ€“ the field name to store the tags. Itโ€™s โ€œvideo_audio_tagsโ€ in default.

  • args โ€“ extra args

  • kwargs โ€“ extra args

process_single(sample, rank=None)[source]#

For sample level, sample โ€“> sample

Parameters:

sample โ€“ sample to process

Returns:

processed sample

class data_juicer.ops.mapper.VideoTaggingFromFramesMapper(frame_sampling_method: str = 'all_keyframes', frame_num: Annotated[int, Gt(gt=0)] = 3, tag_field_name: str = 'video_frame_tags', *args, **kwargs)[source]#

Bases: Mapper

Generates video tags from frames extracted from videos.

This operator extracts frames from videos and generates tags based on the content of these frames. The frame extraction method can be either โ€œall_keyframesโ€ or โ€œuniformโ€. For โ€œall_keyframesโ€, all keyframes are extracted, while for โ€œuniformโ€, a specified number of frames are extracted uniformly across the video. The tags are generated using a pre-trained model and stored in the specified field name. If the tags are already present in the sample, the operator skips processing. Important notes: - Uses a Hugging Face tokenizer and a pre-trained model for tag generation. - If no video is present in the sample, an empty tag array is stored. - Frame tensors are processed to generate tags, which are then sorted by frequency and stored.

__init__(frame_sampling_method: str = 'all_keyframes', frame_num: Annotated[int, Gt(gt=0)] = 3, tag_field_name: str = 'video_frame_tags', *args, **kwargs)[source]#

Initialization method.

Parameters:
  • frame_sampling_method โ€“ sampling method of extracting frame images from the videos. Should be one of [โ€œall_keyframesโ€, โ€œuniformโ€]. The former one extracts all key frames (the number of which depends on the duration of the video) and the latter one extract specified number of frames uniformly from the video. Default: โ€œall_keyframesโ€.

  • frame_num โ€“ the number of frames to be extracted uniformly from the video. Only works when frame_sampling_method is โ€œuniformโ€. If itโ€™s 1, only the middle frame will be extracted. If itโ€™s 2, only the first and the last frames will be extracted. If itโ€™s larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration.

  • tag_field_name โ€“ the field name to store the tags. Itโ€™s โ€œvideo_frame_tagsโ€ in default.

  • args โ€“ extra args

  • kwargs โ€“ extra args

process_single(sample, rank=None, context=False)[source]#

For sample level, sample โ€“> sample

Parameters:

sample โ€“ sample to process

Returns:

processed sample

class data_juicer.ops.mapper.VideoWholeBodyPoseEstimationMapper(onnx_det_model: str = 'yolox_l.onnx', onnx_pose_model: str = 'dw-ll_ucoco_384.onnx', frame_num: Annotated[int, Gt(gt=0)] = 3, duration: float = 0, tag_field_name: str = 'pose_estimation_tags', frame_dir: str = '/home/runner/.cache/data_juicer/assets', if_save_visualization: bool = False, save_visualization_dir: str = '/home/runner/.cache/data_juicer/assets', *args, **kwargs)[source]#

Bases: Mapper

Input a video containing people, and use the DWPose model to extract the body, hand, feet, and face keypoints of the human subjects in the video, i.e., 2D Whole-body Pose Estimation.

__init__(onnx_det_model: str = 'yolox_l.onnx', onnx_pose_model: str = 'dw-ll_ucoco_384.onnx', frame_num: Annotated[int, Gt(gt=0)] = 3, duration: float = 0, tag_field_name: str = 'pose_estimation_tags', frame_dir: str = '/home/runner/.cache/data_juicer/assets', if_save_visualization: bool = False, save_visualization_dir: str = '/home/runner/.cache/data_juicer/assets', *args, **kwargs)[source]#

Initialization method.

Parameters:
  • onnx_det_model โ€“ The path to โ€˜yolox_l.onnxโ€™.

  • onnx_pose_model โ€“ The path to โ€˜dw-ll_ucoco_384.onnxโ€™.

  • frame_num โ€“ The number of frames to be extracted uniformly from the video. If itโ€™s 1, only the middle frame will be extracted. If itโ€™s 2, only the first and the last frames will be extracted. If itโ€™s larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration. If โ€œdurationโ€ > 0, frame_num is the number of frames per segment.

  • duration โ€“ The duration of each segment in seconds. If 0, frames are extracted from the entire video. If duration > 0, the video is segmented into multiple segments based on duration, and frames are extracted from each segment.

  • tag_field_name โ€“ The field name to store the tags. Itโ€™s โ€œpose_estimation_tagsโ€ in default.

  • frame_dir โ€“ Output directory to save extracted frames.

  • if_save_visualization โ€“ Whether to save visualization results.

  • save_visualization_dir โ€“ The path for saving visualization results.

  • args โ€“ extra args

  • kwargs โ€“ extra args

process_single(sample=None, rank=None)[source]#

For sample level, sample โ€“> sample

Parameters:

sample โ€“ sample to process

Returns:

processed sample

class data_juicer.ops.mapper.WhitespaceNormalizationMapper(*args, **kwargs)[source]#

Bases: Mapper

Normalizes various types of whitespace characters to standard spaces in text samples.

This mapper converts all non-standard whitespace characters, such as tabs and newlines, to the standard space character (โ€™ โ€˜, 0x20). It also trims leading and trailing whitespace from the text. This ensures consistent spacing across all text samples, improving readability and consistency. The normalization process is based on a comprehensive list of whitespace characters, which can be found at https://en.wikipedia.org/wiki/Whitespace_character.

__init__(*args, **kwargs)[source]#

Initialization method.

Parameters:
  • args โ€“ extra args

  • kwargs โ€“ extra args

process_batched(samples)[source]#