data_juicer.ops.mapper.llm_extract_mapper module#

LLM extract mapper: user-configurable structured extraction into sample meta.

Part of the llm_* ops family; distinguished by user-provided output_schema rather than fixed evaluation dimensions.

class data_juicer.ops.mapper.llm_extract_mapper.LLMExtractMapper(*args, **kwargs)[源代码]#

基类:Mapper

Extract structured fields from text using an LLM; write results to meta.

Input: sample[input_keys] -> concatenated as input text. Output: meta[meta_output_key] (dict) or meta[out_key] per output_schema key. Uses user-provided output_schema (key -> instruction); supports knowledge_grounding via sample key or fixed string.

__init__(input_keys: List[str], output_schema: Dict[str, str], api_or_hf_model: str = 'gpt-4o', *, meta_output_key: str | None = 'llm_extract', knowledge_grounding_key: str | None = None, knowledge_grounding_fixed: str | None = None, is_hf_model: bool = False, enable_vllm: bool = False, api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, strategy: InferenceStrategy | None = None, examples: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[源代码]#

Args: input_keys: Sample keys to build input text (e.g. ["text"] or ["query","response"]). output_schema: {output_key: "extraction instruction"}. api_or_hf_model: Model name for API or HuggingFace. meta_output_key: If set, write full result to meta[meta_output_key]. knowledge_grounding_key: Optional sample key for per-sample grounding. knowledge_grounding_fixed: Optional fixed grounding string. strategy: Prompt strategy for extraction (direct/cot/few_shot/cot_shot). examples: Optional examples text used by few-shot strategies. try_num: Retries on parse/API failure.

process_single(sample: Dict, rank: int | None = None) Dict[源代码]#

For sample level, sample --> sample

参数:

sample -- sample to process

返回:

processed sample