data_juicer.ops.mapper.llm_extract_mapper module#
LLM extract mapper: user-configurable structured extraction into sample meta.
Part of the llm_* ops family; distinguished by user-provided output_schema rather than fixed evaluation dimensions.
- class data_juicer.ops.mapper.llm_extract_mapper.LLMExtractMapper(*args, **kwargs)[source]#
Bases:
MapperExtract structured fields from text using an LLM; write results to meta.
Input: sample[input_keys] -> concatenated as input text. Output: meta[meta_output_key] (dict) or meta[out_key] per output_schema key. Uses user-provided output_schema (key -> instruction); supports knowledge_grounding via sample key or fixed string.
- __init__(input_keys: List[str], output_schema: Dict[str, str], api_or_hf_model: str = 'gpt-4o', *, meta_output_key: str | None = 'llm_extract', knowledge_grounding_key: str | None = None, knowledge_grounding_fixed: str | None = None, is_hf_model: bool = False, enable_vllm: bool = False, api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, strategy: InferenceStrategy | None = None, examples: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[source]#
Args: input_keys: Sample keys to build input text (e.g. [“text”] or [“query”,”response”]). output_schema: {output_key: “extraction instruction”}. api_or_hf_model: Model name for API or HuggingFace. meta_output_key: If set, write full result to meta[meta_output_key]. knowledge_grounding_key: Optional sample key for per-sample grounding. knowledge_grounding_fixed: Optional fixed grounding string. strategy: Prompt strategy for extraction (direct/cot/few_shot/cot_shot). examples: Optional examples text used by few-shot strategies. try_num: Retries on parse/API failure.