data_juicer.ops.filter.stopwords_filter module#

class data_juicer.ops.filter.stopwords_filter.StopWordsFilter(lang: str = 'en', tokenization: bool = False, min_ratio: float = 0.3, stopwords_dir: str = '/home/runner/.cache/data_juicer/assets', use_words_aug: bool = False, words_aug_group_sizes: List[Annotated[int, Gt(gt=0)]] = [2], words_aug_join_char: str = '', *args, **kwargs)[source]#

Bases: Filter

Filter to keep samples with stopword ratio larger than a specific min value.

__init__(lang: str = 'en', tokenization: bool = False, min_ratio: float = 0.3, stopwords_dir: str = '/home/runner/.cache/data_juicer/assets', use_words_aug: bool = False, words_aug_group_sizes: List[Annotated[int, Gt(gt=0)]] = [2], words_aug_join_char: str = '', *args, **kwargs)[source]#

Initialization method.

Parameters:
  • lang โ€“ Consider stopwords in what language. If lang == โ€œallโ€, we will adopt the one merged from all the available languages

  • tokenization โ€“ whether to use model to tokenize documents

  • min_ratio โ€“ The min filter ratio in this op.

  • stopwords_dir โ€“ The directory storing the stopwords file(s) whose name includes โ€œstopwordsโ€ and in json format

  • use_words_aug โ€“ Whether to augment words, especially for Chinese and Vietnamese

  • words_aug_group_sizes โ€“ The group size of words to augment

  • words_aug_join_char โ€“ The join char between words to augment

  • args โ€“ extra args

  • kwargs โ€“ extra args

compute_stats_single(sample, context=False)[source]#

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample โ€“ input sample.

  • context โ€“ whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample)[source]#

For sample level, sample โ€“> Boolean.

Parameters:

sample โ€“ sample to decide whether to filter

Returns:

true for keeping and false for filtering