data_juicer.utils.llm_semantic_ops module#
LLM semantic ops: user-configurable extract/condition helpers.
Supports both structured output (JSON/schema) and unstructured input (e.g. plain text, jsonl). Shared by llm_extract_mapper, llm_condition_filter; reusable for DataFrame/SQL/DB by adapting input to (text, schema/condition). Aligns with llm_* naming.
- class data_juicer.utils.llm_semantic_ops.LLMCallUsage(prompt_tokens: int = 0, completion_tokens: int = 0, total_tokens: int = 0, cost_estimate: float | None = None)[源代码]#
基类:
objectToken usage (and optional cost) for a single LLM call.
- prompt_tokens: int = 0#
- completion_tokens: int = 0#
- total_tokens: int = 0#
- cost_estimate: float | None = None#
- classmethod from_dict(d: Dict[str, Any]) LLMCallUsage[源代码]#
- __init__(prompt_tokens: int = 0, completion_tokens: int = 0, total_tokens: int = 0, cost_estimate: float | None = None) None#
- class data_juicer.utils.llm_semantic_ops.RecordRow(**extra_data: Any)[源代码]#
基类:
BaseModelSingle row of extracted fields; schema aligns with output_schema keys.
Use model_validate(d) or RecordRow(**d) for dict -> RecordRow. Use row.model_dump() for RecordRow -> dict. Extra keys from output_schema are allowed (extra='allow').
- model_config = {'extra': 'allow'}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- data_juicer.utils.llm_semantic_ops.record_batch_from_dicts(items: List[Dict[str, Any]], schema_keys: List[str] | None = None) List[RecordRow][源代码]#
Convert list of dicts to RecordBatch (list of RecordRow).
- data_juicer.utils.llm_semantic_ops.record_batch_to_dicts(batch: List[RecordRow]) List[Dict[str, Any]][源代码]#
Convert RecordBatch to list of dicts.
- class data_juicer.utils.llm_semantic_ops.InferenceStrategy(value)[源代码]#
基类:
Enum- DIRECT = 'direct'#
- COT = 'cot'#
- FEW_SHOT = 'few_shot'#
- COT_SHOT = 'cot_shot'#
- data_juicer.utils.llm_semantic_ops.get_extract_prompt(input_text: str, output_schema: Dict[str, str], knowledge_grounding: str | None = None, strategy: InferenceStrategy | None = None, examples: str | None = None) str[源代码]#
Build user prompt for extraction. output_schema: {key: instruction}.
- data_juicer.utils.llm_semantic_ops.get_condition_prompt(text: str, condition: str, knowledge_grounding: str | None = None, strategy: InferenceStrategy | None = None, examples: str | None = None) str[源代码]#
Build user prompt for LLM condition filter (yes/no).
- data_juicer.utils.llm_semantic_ops.call_llm_sync(model: Any, messages: list, *, enable_vllm: bool = False, is_hf_model: bool = False, sampling_params: Dict | None = None) tuple[str, LLMCallUsage][源代码]#
Call LLM synchronously; return (content, usage). Compatible with DJ model_utils.
- data_juicer.utils.llm_semantic_ops.extract_one(input_text: str, output_schema: Dict[str, str], model: Any, *, system_prompt: str | None = None, knowledge_grounding: str | None = None, strategy: InferenceStrategy | None = None, examples: str | None = None, enable_vllm: bool = False, is_hf_model: bool = False, sampling_params: Dict | None = None, return_record_row: bool = False) tuple[Dict[str, Any], LLMCallUsage] | tuple[RecordRow, LLMCallUsage][源代码]#
Extract structured fields from input_text using the model.
Returns (result, usage). result is dict by default, or RecordRow if return_record_row=True. Compatible with both structured (JSON) and unstructured (e.g. plain text) input.
- data_juicer.utils.llm_semantic_ops.condition_filter_one(text: str, condition: str, model: Any, *, knowledge_grounding: str | None = None, strategy: InferenceStrategy | None = None, examples: str | None = None, enable_vllm: bool = False, is_hf_model: bool = False, sampling_params: Dict | None = None) tuple[bool, LLMCallUsage][源代码]#
True iff the model says the text satisfies the condition (yes/no). Returns (result, usage).