data_juicer.ops.mapper.generate_qa_from_examples_mapper module#

class data_juicer.ops.mapper.generate_qa_from_examples_mapper.GenerateQAFromExamplesMapper(hf_model: str = 'Qwen/Qwen2.5-7B-Instruct', *, seed_file: str = '', example_num: Annotated[int, Gt(gt=0)] = 3, similarity_threshold: float = 0.7, system_prompt: str | None = None, input_template: str | None = None, example_template: str | None = None, qa_pair_template: str | None = None, output_pattern: str | None = None, enable_vllm: bool = False, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[source]#

Bases: Mapper

Mapper to generate question and answer pairs from examples. You should configure an empty dataset in your yaml config file: ``` generated_dataset_config:

type: โ€˜EmptyFormatterโ€™ # use RayEmptyFormatter when enable ray length: ${The number of generated samples} feature_keys: ${text key}

``` The number of samples generated is determined by the length of the empty dataset.

DEFAULT_SYSTEM_PROMPT = '่ฏทไฝ ไป”็ป†่ง‚ๅฏŸๅคšไธช็คบไพ‹ๆ•ฐๆฎ็š„่พ“ๅ…ฅๅ’Œ่พ“ๅ‡บ๏ผŒๆŒ‰็…งไฝ ็š„็†่งฃ๏ผŒๆ€ป็ป“ๅ‡บ็›ธๅบ”่ง„็Ÿฉ๏ผŒ็„ถๅŽๅ†™ๅ‡บไธ€ไธชๆ–ฐ็š„ใ€้—ฎ้ข˜ใ€‘ๅ’Œใ€ๅ›ž็ญ”ใ€‘ใ€‚ๆณจๆ„๏ผŒๆ–ฐ็”Ÿๆˆ็š„ใ€้—ฎ้ข˜ใ€‘ๅ’Œใ€ๅ›ž็ญ”ใ€‘้œ€่ฆๆปก่ถณๅฆ‚ไธ‹่ฆๆฑ‚๏ผš\n1. ็”Ÿๆˆ็š„ใ€้—ฎ้ข˜ใ€‘ๅ’Œใ€ๅ›ž็ญ”ใ€‘ไธ่ƒฝไธŽ่พ“ๅ…ฅ็š„ใ€้—ฎ้ข˜ใ€‘ๅ’Œใ€ๅ›ž็ญ”ใ€‘ไธ€่‡ด๏ผŒไฝ†ๆ˜ฏ้œ€่ฆไฟๆŒๆ ผๅผ็›ธๅŒใ€‚\n2. ็”Ÿๆˆ็š„ใ€้—ฎ้ข˜ใ€‘ไธไธ€ๅฎš่ฆๅฑ€้™ไบŽ่พ“ๅ…ฅใ€้—ฎ้ข˜ใ€‘็š„่ฏ้ข˜ๆˆ–้ข†ๅŸŸ๏ผŒ็”Ÿๆˆ็š„ใ€ๅ›ž็ญ”ใ€‘้œ€่ฆๆญฃ็กฎๅ›ž็ญ”็”Ÿๆˆ็š„ใ€้—ฎ้ข˜ใ€‘ใ€‚\n3. ๆไพ›็š„ใ€้—ฎ้ข˜ใ€‘ๅ’Œใ€ๅ›ž็ญ”ใ€‘ๅฏ่ƒฝๆ˜ฏๅคš่ฝฎๅฏน่ฏ๏ผŒ็”Ÿๆˆ็š„ใ€้—ฎ้ข˜ใ€‘ๅ’Œใ€ๅ›ž็ญ”ใ€‘ไนŸๅฏไปฅๆ˜ฏๅคš่ฝฎ๏ผŒไฝ†ๆ˜ฏ้œ€่ฆไฟๆŒๆ ผๅผ็›ธๅŒใ€‚\n4. ็”Ÿๆˆ็š„ใ€้—ฎ้ข˜ใ€‘ๅ’Œใ€ๅ›ž็ญ”ใ€‘ๅฟ…้กปๆˆๅฏนๅ‡บ็Žฐ๏ผŒ่€Œไธ”ใ€้—ฎ้ข˜ใ€‘้œ€่ฆๅœจใ€ๅ›ž็ญ”ใ€‘ไน‹ๅ‰ใ€‚\n'#
DEFAULT_INPUT_TEMPLATE = '{}'#
DEFAULT_EXAMPLE_TEMPLATE = '\nๅฆ‚ไธ‹ๆ˜ฏไธ€ๆก็คบไพ‹ๆ•ฐๆฎ๏ผš\n{}'#
DEFAULT_QA_PAIR_TEMPLATE = 'ใ€้—ฎ้ข˜ใ€‘\n{}\nใ€ๅ›ž็ญ”ใ€‘\n{}\n'#
DEFAULT_OUTPUT_PATTERN = 'ใ€้—ฎ้ข˜ใ€‘(.*?)ใ€ๅ›ž็ญ”ใ€‘(.*?)(?=ใ€้—ฎ้ข˜ใ€‘|$)'#
__init__(hf_model: str = 'Qwen/Qwen2.5-7B-Instruct', *, seed_file: str = '', example_num: Annotated[int, Gt(gt=0)] = 3, similarity_threshold: float = 0.7, system_prompt: str | None = None, input_template: str | None = None, example_template: str | None = None, qa_pair_template: str | None = None, output_pattern: str | None = None, enable_vllm: bool = False, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[source]#

Initialization method.

Parameters:
  • hf_model โ€“ Huggingface model ID.

  • seed_file โ€“ Path to the seed file in chatml format.

  • example_num โ€“ The number of selected examples. Randomly select N examples from โ€œseed_fileโ€ and put them into prompt as QA examples.

  • similarity_threshold โ€“ The similarity score threshold between the generated samples and the seed examples. Range from 0 to 1. Samples with similarity score less than this threshold will be kept.

  • system_prompt โ€“ System prompt for guiding the generation task.

  • input_template โ€“ Template for building the input prompt. It must include one placeholder โ€˜{}โ€™, which will be replaced by example_num formatted examples defined by example_template.

  • example_template โ€“ Template for formatting one QA example. It must include one placeholder โ€˜{}โ€™, which will be replaced by one formatted qa_pair.

  • qa_pair_template โ€“ Template for formatting a single QA pair within each example. Must include two placeholders โ€˜{}โ€™ for the question and answer.

  • output_pattern โ€“ Regular expression pattern to extract questions and answers from model response.

  • enable_vllm โ€“ Whether to use vllm for inference acceleration.

  • model_params โ€“ Parameters for initializing the model.

  • sampling_params โ€“ Sampling parameters for text generation. e.g {โ€˜temperatureโ€™: 0.9, โ€˜top_pโ€™: 0.95}

  • kwargs โ€“ Extra keyword arguments.

build_input(qa_examples)[source]#
parse_output(raw_output)[source]#
process_single(sample, rank=None)[source]#

For sample level, sample โ€“> sample

Parameters:

sample โ€“ sample to process

Returns:

processed sample