data_juicer.ops.filter.text_action_filter module#

class data_juicer.ops.filter.text_action_filter.TextActionFilter(lang: str = 'en', min_action_num: int = 1, *args, **kwargs)[source]#

Bases: Filter

Filter to keep texts those contain actions in the text.

__init__(lang: str = 'en', min_action_num: int = 1, *args, **kwargs)[source]#

Initialization method.

Parameters:
  • lang โ€“ language of the text in the samples. โ€˜enโ€™ for detection of actions in English and โ€˜zhโ€™ for detection of actions in Chinese.

  • mini_action_num โ€“ The min action number in the filtering. samples will be filtered if their action number in the text is below this parameter.

compute_stats_single(sample, context=False)[source]#

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample โ€“ input sample.

  • context โ€“ whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample)[source]#

For sample level, sample โ€“> Boolean.

Parameters:

sample โ€“ sample to decide whether to filter

Returns:

true for keeping and false for filtering