data_juicer.ops.filter.llm_condition_filter module#

LLM condition filter: keep samples satisfying a user-given condition.

Part of the llm_* ops family; yes/no by user-specified condition string (unlike llm_analysis_filter which uses fixed dimensions).

class data_juicer.ops.filter.llm_condition_filter.LLMConditionFilter(*args, **kwargs)[source]#

Filter by user-given natural language condition (LLM yes/no).

Uses text_key; writes to stats.llm_condition_filter_result; keeps if True.

__init__(text_key: str = 'text', condition: str = '', api_or_hf_model: str = 'gpt-4o', *, knowledge_grounding_key: str | None = None, knowledge_grounding_fixed: str | None = None, is_hf_model: bool = False, enable_vllm: bool = False, api_endpoint: str | None = None, response_path: str | None = None, strategy: InferenceStrategy | None = None, examples: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, model_params: dict | None = None, sampling_params: dict | None = None, **kwargs)[source]#: Args: text_key: Sample key for the text to evaluate. condition: Natural language condition (e.g. “contains X”). api_or_hf_model: Model name. knowledge_grounding_key: Optional sample key for per-sample grounding. knowledge_grounding_fixed: Optional fixed grounding string. strategy: Prompt strategy for condition inference (direct/cot/few_shot/cot_shot). examples: Optional examples text used by few-shot strategies. try_num: Retries on API/parse failure; treat as False after all fail.

compute_stats_single(sample: dict, rank: int | None = None, context: bool = False)[source]#

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:

sample – input sample.
context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample: dict, rank: int | None = None) → bool[source]#

For sample level, sample –> Boolean.