data_juicer.ops.mapper.generate_qa_from_examples_mapper module#
- class data_juicer.ops.mapper.generate_qa_from_examples_mapper.GenerateQAFromExamplesMapper(hf_model: str = 'Qwen/Qwen2.5-7B-Instruct', *, seed_file: str = '', example_num: Annotated[int, Gt(gt=0)] = 3, similarity_threshold: float = 0.7, system_prompt: str | None = None, input_template: str | None = None, example_template: str | None = None, qa_pair_template: str | None = None, output_pattern: str | None = None, enable_vllm: bool = False, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[source]#
Bases:
MapperMapper to generate question and answer pairs from examples. You should configure an empty dataset in your yaml config file: ``` generated_dataset_config:
type: โEmptyFormatterโ # use RayEmptyFormatter when enable ray length: ${The number of generated samples} feature_keys: ${text key}
``` The number of samples generated is determined by the length of the empty dataset.
- DEFAULT_SYSTEM_PROMPT = '่ฏทไฝ ไป็ป่งๅฏๅคไธช็คบไพๆฐๆฎ็่พๅ ฅๅ่พๅบ๏ผๆ็ งไฝ ็็่งฃ๏ผๆป็ปๅบ็ธๅบ่ง็ฉ๏ผ็ถๅๅๅบไธไธชๆฐ็ใ้ฎ้ขใๅใๅ็ญใใๆณจๆ๏ผๆฐ็ๆ็ใ้ฎ้ขใๅใๅ็ญใ้่ฆๆปก่ถณๅฆไธ่ฆๆฑ๏ผ\n1. ็ๆ็ใ้ฎ้ขใๅใๅ็ญใไธ่ฝไธ่พๅ ฅ็ใ้ฎ้ขใๅใๅ็ญใไธ่ด๏ผไฝๆฏ้่ฆไฟๆๆ ผๅผ็ธๅใ\n2. ็ๆ็ใ้ฎ้ขใไธไธๅฎ่ฆๅฑ้ไบ่พๅ ฅใ้ฎ้ขใ็่ฏ้ขๆ้ขๅ๏ผ็ๆ็ใๅ็ญใ้่ฆๆญฃ็กฎๅ็ญ็ๆ็ใ้ฎ้ขใใ\n3. ๆไพ็ใ้ฎ้ขใๅใๅ็ญใๅฏ่ฝๆฏๅค่ฝฎๅฏน่ฏ๏ผ็ๆ็ใ้ฎ้ขใๅใๅ็ญใไนๅฏไปฅๆฏๅค่ฝฎ๏ผไฝๆฏ้่ฆไฟๆๆ ผๅผ็ธๅใ\n4. ็ๆ็ใ้ฎ้ขใๅใๅ็ญใๅฟ ้กปๆๅฏนๅบ็ฐ๏ผ่ไธใ้ฎ้ขใ้่ฆๅจใๅ็ญใไนๅใ\n'#
- DEFAULT_INPUT_TEMPLATE = '{}'#
- DEFAULT_EXAMPLE_TEMPLATE = '\nๅฆไธๆฏไธๆก็คบไพๆฐๆฎ๏ผ\n{}'#
- DEFAULT_QA_PAIR_TEMPLATE = 'ใ้ฎ้ขใ\n{}\nใๅ็ญใ\n{}\n'#
- DEFAULT_OUTPUT_PATTERN = 'ใ้ฎ้ขใ(.*?)ใๅ็ญใ(.*?)(?=ใ้ฎ้ขใ|$)'#
- __init__(hf_model: str = 'Qwen/Qwen2.5-7B-Instruct', *, seed_file: str = '', example_num: Annotated[int, Gt(gt=0)] = 3, similarity_threshold: float = 0.7, system_prompt: str | None = None, input_template: str | None = None, example_template: str | None = None, qa_pair_template: str | None = None, output_pattern: str | None = None, enable_vllm: bool = False, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[source]#
Initialization method.
- Parameters:
hf_model โ Huggingface model ID.
seed_file โ Path to the seed file in chatml format.
example_num โ The number of selected examples. Randomly select N examples from โseed_fileโ and put them into prompt as QA examples.
similarity_threshold โ The similarity score threshold between the generated samples and the seed examples. Range from 0 to 1. Samples with similarity score less than this threshold will be kept.
system_prompt โ System prompt for guiding the generation task.
input_template โ Template for building the input prompt. It must include one placeholder โ{}โ, which will be replaced by example_num formatted examples defined by example_template.
example_template โ Template for formatting one QA example. It must include one placeholder โ{}โ, which will be replaced by one formatted qa_pair.
qa_pair_template โ Template for formatting a single QA pair within each example. Must include two placeholders โ{}โ for the question and answer.
output_pattern โ Regular expression pattern to extract questions and answers from model response.
enable_vllm โ Whether to use vllm for inference acceleration.
model_params โ Parameters for initializing the model.
sampling_params โ Sampling parameters for text generation. e.g {โtemperatureโ: 0.9, โtop_pโ: 0.95}
kwargs โ Extra keyword arguments.