data_juicer.ops.filter.llm_perplexity_filter module#
- class data_juicer.ops.filter.llm_perplexity_filter.LLMPerplexityFilter(hf_model: str = 'Qwen/Qwen2.5-0.5B', model_params: Dict | None = None, min_score: float = 1.0, max_score: float = 100.0, query_template: str | None = None, response_template: str | None = None, *args, **kwargs)[source]#
Bases:
FilterFilter to keep samples with perplexity score, computed using a specified llm, within a specific range.
- __init__(hf_model: str = 'Qwen/Qwen2.5-0.5B', model_params: Dict | None = None, min_score: float = 1.0, max_score: float = 100.0, query_template: str | None = None, response_template: str | None = None, *args, **kwargs)[source]#
Initialization method.
- Parameters:
hf_model â huggingface embedding model name.
model_params â Parameters for initializing the API model.
min_score â Minimum perplexity score.
max_score â Maximum perplexity score.
query_template â Template for building the query string.
response_template â Template for building the response string.
args â extra args
kwargs â extra args
- compute_stats_single(sample, rank=None)[source]#
Compute stats for the sample which is used as a metric to decide whether to filter this sample.
- Parameters:
sample â input sample.
context â whether to store context information of intermediate vars in the sample temporarily.
- Returns:
sample with computed stats