data_juicer.ops.mapper.generate_qa_from_examples_mapper module#

class data_juicer.ops.mapper.generate_qa_from_examples_mapper.GenerateQAFromExamplesMapper(hf_model: str = 'Qwen/Qwen2.5-7B-Instruct', *, seed_file: str = '', example_num: Annotated[int, Gt(gt=0)] = 3, similarity_threshold: float = 0.7, system_prompt: str | None = None, input_template: str | None = None, example_template: str | None = None, qa_pair_template: str | None = None, output_pattern: str | None = None, enable_vllm: bool = False, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[source]#

Bases: Mapper

Generates question and answer pairs from examples using a Hugging Face model.

This operator generates QA pairs based on provided seed examples. The number of generated samples is determined by the length of the empty dataset configured in the YAML file. The operator uses a Hugging Face model to generate new QA pairs, which are then filtered based on their similarity to the seed examples. Samples with a similarity score below the specified threshold are kept. The similarity is computed using the ROUGE-L metric. The operator requires a seed file in chatml format, which provides the initial QA examples. The generated QA pairs must follow specific formatting rules, such as maintaining the same format as the input examples and ensuring that questions and answers are paired correctly.

DEFAULT_SYSTEM_PROMPT = '่ฏทไฝ ไป”็ป†่ง‚ๅฏŸๅคšไธช็คบไพ‹ๆ•ฐๆฎ็š„่พ“ๅ…ฅๅ’Œ่พ“ๅ‡บ๏ผŒๆŒ‰็…งไฝ ็š„็†่งฃ๏ผŒๆ€ป็ป“ๅ‡บ็›ธๅบ”่ง„็Ÿฉ๏ผŒ็„ถๅŽๅ†™ๅ‡บไธ€ไธชๆ–ฐ็š„ใ€้—ฎ้ข˜ใ€‘ๅ’Œใ€ๅ›ž็ญ”ใ€‘ใ€‚ๆณจๆ„๏ผŒๆ–ฐ็”Ÿๆˆ็š„ใ€้—ฎ้ข˜ใ€‘ๅ’Œใ€ๅ›ž็ญ”ใ€‘้œ€่ฆๆปก่ถณๅฆ‚ไธ‹่ฆๆฑ‚๏ผš\n1. ็”Ÿๆˆ็š„ใ€้—ฎ้ข˜ใ€‘ๅ’Œใ€ๅ›ž็ญ”ใ€‘ไธ่ƒฝไธŽ่พ“ๅ…ฅ็š„ใ€้—ฎ้ข˜ใ€‘ๅ’Œใ€ๅ›ž็ญ”ใ€‘ไธ€่‡ด๏ผŒไฝ†ๆ˜ฏ้œ€่ฆไฟๆŒๆ ผๅผ็›ธๅŒใ€‚\n2. ็”Ÿๆˆ็š„ใ€้—ฎ้ข˜ใ€‘ไธไธ€ๅฎš่ฆๅฑ€้™ไบŽ่พ“ๅ…ฅใ€้—ฎ้ข˜ใ€‘็š„่ฏ้ข˜ๆˆ–้ข†ๅŸŸ๏ผŒ็”Ÿๆˆ็š„ใ€ๅ›ž็ญ”ใ€‘้œ€่ฆๆญฃ็กฎๅ›ž็ญ”็”Ÿๆˆ็š„ใ€้—ฎ้ข˜ใ€‘ใ€‚\n3. ๆไพ›็š„ใ€้—ฎ้ข˜ใ€‘ๅ’Œใ€ๅ›ž็ญ”ใ€‘ๅฏ่ƒฝๆ˜ฏๅคš่ฝฎๅฏน่ฏ๏ผŒ็”Ÿๆˆ็š„ใ€้—ฎ้ข˜ใ€‘ๅ’Œใ€ๅ›ž็ญ”ใ€‘ไนŸๅฏไปฅๆ˜ฏๅคš่ฝฎ๏ผŒไฝ†ๆ˜ฏ้œ€่ฆไฟๆŒๆ ผๅผ็›ธๅŒใ€‚\n4. ็”Ÿๆˆ็š„ใ€้—ฎ้ข˜ใ€‘ๅ’Œใ€ๅ›ž็ญ”ใ€‘ๅฟ…้กปๆˆๅฏนๅ‡บ็Žฐ๏ผŒ่€Œไธ”ใ€้—ฎ้ข˜ใ€‘้œ€่ฆๅœจใ€ๅ›ž็ญ”ใ€‘ไน‹ๅ‰ใ€‚\n'#
DEFAULT_INPUT_TEMPLATE = '{}'#
DEFAULT_EXAMPLE_TEMPLATE = '\nๅฆ‚ไธ‹ๆ˜ฏไธ€ๆก็คบไพ‹ๆ•ฐๆฎ๏ผš\n{}'#
DEFAULT_QA_PAIR_TEMPLATE = 'ใ€้—ฎ้ข˜ใ€‘\n{}\nใ€ๅ›ž็ญ”ใ€‘\n{}\n'#
DEFAULT_OUTPUT_PATTERN = 'ใ€้—ฎ้ข˜ใ€‘(.*?)ใ€ๅ›ž็ญ”ใ€‘(.*?)(?=ใ€้—ฎ้ข˜ใ€‘|$)'#
__init__(hf_model: str = 'Qwen/Qwen2.5-7B-Instruct', *, seed_file: str = '', example_num: Annotated[int, Gt(gt=0)] = 3, similarity_threshold: float = 0.7, system_prompt: str | None = None, input_template: str | None = None, example_template: str | None = None, qa_pair_template: str | None = None, output_pattern: str | None = None, enable_vllm: bool = False, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[source]#

Initialization method.

Parameters:
  • hf_model โ€“ Huggingface model ID.

  • seed_file โ€“ Path to the seed file in chatml format.

  • example_num โ€“ The number of selected examples. Randomly select N examples from โ€œseed_fileโ€ and put them into prompt as QA examples.

  • similarity_threshold โ€“ The similarity score threshold between the generated samples and the seed examples. Range from 0 to 1. Samples with similarity score less than this threshold will be kept.

  • system_prompt โ€“ System prompt for guiding the generation task.

  • input_template โ€“ Template for building the input prompt. It must include one placeholder โ€˜{}โ€™, which will be replaced by example_num formatted examples defined by example_template.

  • example_template โ€“ Template for formatting one QA example. It must include one placeholder โ€˜{}โ€™, which will be replaced by one formatted qa_pair.

  • qa_pair_template โ€“ Template for formatting a single QA pair within each example. Must include two placeholders โ€˜{}โ€™ for the question and answer.

  • output_pattern โ€“ Regular expression pattern to extract questions and answers from model response.

  • enable_vllm โ€“ Whether to use vllm for inference acceleration.

  • model_params โ€“ Parameters for initializing the model.

  • sampling_params โ€“ Sampling parameters for text generation. e.g {โ€˜temperatureโ€™: 0.9, โ€˜top_pโ€™: 0.95}

  • kwargs โ€“ Extra keyword arguments.

build_input(qa_examples)[source]#
parse_output(raw_output)[source]#
process_single(sample, rank=None)[source]#

For sample level, sample โ€“> sample

Parameters:

sample โ€“ sample to process

Returns:

processed sample