data_juicer.ops.filter.perplexity_filter module#

class data_juicer.ops.filter.perplexity_filter.PerplexityFilter(lang: str = 'en', max_ppl: float = 1500, *args, **kwargs)[source]#

Bases: Filter

Filter to keep samples with perplexity score less than a specific max value.

__init__(lang: str = 'en', max_ppl: float = 1500, *args, **kwargs)[source]#

Initialization method.

Parameters:
  • lang – Compute perplexity for samples in which language.

  • max_ppl – The max filter perplexity in this op, samples will be filtered if their perplexity exceeds this parameter.

  • args – extra args

  • kwargs – extra args

compute_stats_batched(samples, context=False)[source]#
process_batched(samples)[source]#