data_juicer.ops.mapper.generate_qa_from_examples_mapper module#

class data_juicer.ops.mapper.generate_qa_from_examples_mapper.GenerateQAFromExamplesMapper(hf_model: str = 'Qwen/Qwen2.5-7B-Instruct', *, seed_file: str = '', example_num: Annotated[int, Gt(gt=0)] = 3, similarity_threshold: float = 0.7, system_prompt: str | None = None, input_template: str | None = None, example_template: str | None = None, qa_pair_template: str | None = None, output_pattern: str | None = None, enable_vllm: bool = False, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[source]#

Bases: Mapper

Mapper to generate question and answer pairs from examples. You should configure an empty dataset in your yaml config file: ``` generated_dataset_config:

type: ‘EmptyFormatter’ # use RayEmptyFormatter when enable ray length: ${The number of generated samples} feature_keys: ${text key}

``` The number of samples generated is determined by the length of the empty dataset.

DEFAULT_SYSTEM_PROMPT = '请你仔细观察多个示例数据的输入和输出，按照你的理解，总结出相应规矩，然后写出一个新的【问题】和【回答】。注意，新生成的【问题】和【回答】需要满足如下要求：\n1. 生成的【问题】和【回答】不能与输入的【问题】和【回答】一致，但是需要保持格式相同。\n2. 生成的【问题】不一定要局限于输入【问题】的话题或领域，生成的【回答】需要正确回答生成的【问题】。\n3. 提供的【问题】和【回答】可能是多轮对话，生成的【问题】和【回答】也可以是多轮，但是需要保持格式相同。\n4. 生成的【问题】和【回答】必须成对出现，而且【问题】需要在【回答】之前。\n'#

DEFAULT_INPUT_TEMPLATE = '{}'#

DEFAULT_EXAMPLE_TEMPLATE = '\n如下是一条示例数据：\n{}'#

DEFAULT_QA_PAIR_TEMPLATE = '【问题】\n{}\n【回答】\n{}\n'#

DEFAULT_OUTPUT_PATTERN = '【问题】(.*?)【回答】(.*?)(?=【问题】|$)'#

__init__(hf_model: str = 'Qwen/Qwen2.5-7B-Instruct', *, seed_file: str = '', example_num: Annotated[int, Gt(gt=0)] = 3, similarity_threshold: float = 0.7, system_prompt: str | None = None, input_template: str | None = None, example_template: str | None = None, qa_pair_template: str | None = None, output_pattern: str | None = None, enable_vllm: bool = False, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[source]#

Initialization method.

Parameters:

hf_model – Huggingface model ID.
seed_file – Path to the seed file in chatml format.
example_num – The number of selected examples. Randomly select N examples from “seed_file” and put them into prompt as QA examples.
similarity_threshold – The similarity score threshold between the generated samples and the seed examples. Range from 0 to 1. Samples with similarity score less than this threshold will be kept.
system_prompt – System prompt for guiding the generation task.
input_template – Template for building the input prompt. It must include one placeholder ‘{}’, which will be replaced by example_num formatted examples defined by example_template.
example_template – Template for formatting one QA example. It must include one placeholder ‘{}’, which will be replaced by one formatted qa_pair.
qa_pair_template – Template for formatting a single QA pair within each example. Must include two placeholders ‘{}’ for the question and answer.
output_pattern – Regular expression pattern to extract questions and answers from model response.
enable_vllm – Whether to use vllm for inference acceleration.
model_params – Parameters for initializing the model.
sampling_params – Sampling parameters for text generation. e.g {‘temperature’: 0.9, ‘top_p’: 0.95}
kwargs – Extra keyword arguments.

build_input(qa_examples)[source]#

parse_output(raw_output)[source]#

process_single(sample, rank=None)[source]#

For sample level, sample –> sample

Parameters:: sample – sample to process
Returns:: processed sample

data_juicer.ops.mapper.generate_qa_from_examples_mapper module#

This Page