data_juicer.utils.llm_semantic_ops module#

LLM semantic ops: user-configurable extract/condition helpers.

Supports both structured output (JSON/schema) and unstructured input (e.g. plain text, jsonl). Shared by llm_extract_mapper, llm_condition_filter; reusable for DataFrame/SQL/DB by adapting input to (text, schema/condition). Aligns with llm_* naming.

class data_juicer.utils.llm_semantic_ops.LLMCallUsage(prompt_tokens: int = 0, completion_tokens: int = 0, total_tokens: int = 0, cost_estimate: float | None = None)[source]#

Bases: object

Token usage (and optional cost) for a single LLM call.

prompt_tokens: int = 0#
completion_tokens: int = 0#
total_tokens: int = 0#
cost_estimate: float | None = None#
to_dict() Dict[str, Any][source]#
classmethod from_dict(d: Dict[str, Any]) LLMCallUsage[source]#
__init__(prompt_tokens: int = 0, completion_tokens: int = 0, total_tokens: int = 0, cost_estimate: float | None = None) None#
class data_juicer.utils.llm_semantic_ops.RecordRow(**extra_data: Any)[source]#

Bases: BaseModel

Single row of extracted fields; schema aligns with output_schema keys.

Use model_validate(d) or RecordRow(**d) for dict -> RecordRow. Use row.model_dump() for RecordRow -> dict. Extra keys from output_schema are allowed (extra=’allow’).

model_config = {'extra': 'allow'}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

classmethod from_schema_dict(d: Dict[str, Any], schema_keys: List[str] | None = None) RecordRow[source]#

Build RecordRow from dict; optionally restrict to schema_keys.

to_dict() Dict[str, Any][source]#
data_juicer.utils.llm_semantic_ops.record_batch_from_dicts(items: List[Dict[str, Any]], schema_keys: List[str] | None = None) List[RecordRow][source]#

Convert list of dicts to RecordBatch (list of RecordRow).

data_juicer.utils.llm_semantic_ops.record_batch_to_dicts(batch: List[RecordRow]) List[Dict[str, Any]][source]#

Convert RecordBatch to list of dicts.

class data_juicer.utils.llm_semantic_ops.InferenceStrategy(value)[source]#

Bases: Enum

DIRECT = 'direct'#
COT = 'cot'#
FEW_SHOT = 'few_shot'#
COT_SHOT = 'cot_shot'#
data_juicer.utils.llm_semantic_ops.get_extract_prompt(input_text: str, output_schema: Dict[str, str], knowledge_grounding: str | None = None, strategy: InferenceStrategy | None = None, examples: str | None = None) str[source]#

Build user prompt for extraction. output_schema: {key: instruction}.

data_juicer.utils.llm_semantic_ops.get_condition_prompt(text: str, condition: str, knowledge_grounding: str | None = None, strategy: InferenceStrategy | None = None, examples: str | None = None) str[source]#

Build user prompt for LLM condition filter (yes/no).

data_juicer.utils.llm_semantic_ops.call_llm_sync(model: Any, messages: list, *, enable_vllm: bool = False, is_hf_model: bool = False, sampling_params: Dict | None = None) tuple[str, LLMCallUsage][source]#

Call LLM synchronously; return (content, usage). Compatible with DJ model_utils.

data_juicer.utils.llm_semantic_ops.extract_one(input_text: str, output_schema: Dict[str, str], model: Any, *, system_prompt: str | None = None, knowledge_grounding: str | None = None, strategy: InferenceStrategy | None = None, examples: str | None = None, enable_vllm: bool = False, is_hf_model: bool = False, sampling_params: Dict | None = None, return_record_row: bool = False) tuple[Dict[str, Any], LLMCallUsage] | tuple[RecordRow, LLMCallUsage][source]#

Extract structured fields from input_text using the model.

Returns (result, usage). result is dict by default, or RecordRow if return_record_row=True. Compatible with both structured (JSON) and unstructured (e.g. plain text) input.

data_juicer.utils.llm_semantic_ops.condition_filter_one(text: str, condition: str, model: Any, *, knowledge_grounding: str | None = None, strategy: InferenceStrategy | None = None, examples: str | None = None, enable_vllm: bool = False, is_hf_model: bool = False, sampling_params: Dict | None = None) tuple[bool, LLMCallUsage][source]#

True iff the model says the text satisfies the condition (yes/no). Returns (result, usage).