data_juicer.utils.llm_semantic_ops module#

LLM semantic ops: user-configurable extract/condition helpers.

Supports both structured output (JSON/schema) and unstructured input (e.g. plain text, jsonl). Shared by llm_extract_mapper, llm_condition_filter; reusable for DataFrame/SQL/DB by adapting input to (text, schema/condition). Aligns with llm_* naming.

class data_juicer.utils.llm_semantic_ops.LLMCallUsage(prompt_tokens: int = 0, completion_tokens: int = 0, total_tokens: int = 0, cost_estimate: float | None = None)[source]#

Bases: object

Token usage (and optional cost) for a single LLM call.

prompt_tokens: int = 0#

completion_tokens: int = 0#

total_tokens: int = 0#

cost_estimate: float | None = None#

to_dict() → Dict[str, Any][source]#

classmethod from_dict(d: Dict[str, Any]) → LLMCallUsage[source]#

__init__(prompt_tokens: int = 0, completion_tokens: int = 0, total_tokens: int = 0, cost_estimate: float | None = None) → None#

class data_juicer.utils.llm_semantic_ops.RecordRow(**extra_data: Any)[source]#

Bases: BaseModel

Single row of extracted fields; schema aligns with output_schema keys.

Use model_validate(d) or RecordRow(**d) for dict -> RecordRow. Use row.model_dump() for RecordRow -> dict. Extra keys from output_schema are allowed (extra=’allow’).

model_config = {'extra': 'allow'}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

classmethod from_schema_dict(d: Dict[str, Any], schema_keys: List[str] | None = None) → RecordRow[source]#: Build RecordRow from dict; optionally restrict to schema_keys.

to_dict() → Dict[str, Any][source]#

data_juicer.utils.llm_semantic_ops.record_batch_from_dicts(items: List[Dict[str, Any]], schema_keys: List[str] | None = None) → List[RecordRow][source]#: Convert list of dicts to RecordBatch (list of RecordRow).

data_juicer.utils.llm_semantic_ops.record_batch_to_dicts(batch: List[RecordRow]) → List[Dict[str, Any]][source]#: Convert RecordBatch to list of dicts.

class data_juicer.utils.llm_semantic_ops.InferenceStrategy(value)[source]#

Bases: Enum

DIRECT = 'direct'#

COT = 'cot'#

FEW_SHOT = 'few_shot'#

COT_SHOT = 'cot_shot'#

data_juicer.utils.llm_semantic_ops.get_extract_prompt(input_text: str, output_schema: Dict[str, str], knowledge_grounding: str | None = None, strategy: InferenceStrategy | None = None, examples: str | None = None) → str[source]#: Build user prompt for extraction. output_schema: {key: instruction}.

data_juicer.utils.llm_semantic_ops.get_condition_prompt(text: str, condition: str, knowledge_grounding: str | None = None, strategy: InferenceStrategy | None = None, examples: str | None = None) → str[source]#: Build user prompt for LLM condition filter (yes/no).

data_juicer.utils.llm_semantic_ops.call_llm_sync(model: Any, messages: list, *, enable_vllm: bool = False, is_hf_model: bool = False, sampling_params: Dict | None = None) → tuple[str, LLMCallUsage][source]#: Call LLM synchronously; return (content, usage). Compatible with DJ model_utils.

data_juicer.utils.llm_semantic_ops.extract_one(input_text: str, output_schema: Dict[str, str], model: Any, *, system_prompt: str | None = None, knowledge_grounding: str | None = None, strategy: InferenceStrategy | None = None, examples: str | None = None, enable_vllm: bool = False, is_hf_model: bool = False, sampling_params: Dict | None = None, return_record_row: bool = False) → tuple[Dict[str, Any], LLMCallUsage] | tuple[RecordRow, LLMCallUsage][source]#

Extract structured fields from input_text using the model.

Returns (result, usage). result is dict by default, or RecordRow if return_record_row=True. Compatible with both structured (JSON) and unstructured (e.g. plain text) input.

data_juicer.utils.llm_semantic_ops.condition_filter_one(text: str, condition: str, model: Any, *, knowledge_grounding: str | None = None, strategy: InferenceStrategy | None = None, examples: str | None = None, enable_vllm: bool = False, is_hf_model: bool = False, sampling_params: Dict | None = None) → tuple[bool, LLMCallUsage][source]#: True iff the model says the text satisfies the condition (yes/no). Returns (result, usage).

data_juicer.utils.llm_semantic_ops module#

This Page