data_juicer.ops.filter.alphanumeric_filter module#
- class data_juicer.ops.filter.alphanumeric_filter.AlphanumericFilter(tokenization: bool = False, min_ratio: float = 0.25, max_ratio: float = 9223372036854775807, *args, **kwargs)[source]#
Bases:
FilterFilter to keep samples with alphabet/numeric ratio within a specific range.
- __init__(tokenization: bool = False, min_ratio: float = 0.25, max_ratio: float = 9223372036854775807, *args, **kwargs)[source]#
Initialization method.
- Parameters:
tokenization â Whether to count the ratio of alphanumeric to the total number of tokens. if tokenization=False, it will count the ratio of alphanumeric to the total number of characters.
min_ratio â The min filter ratio in alphanumeric op, samples will be filtered if their alphabet/numeric ratio is below this parameter.
max_ratio â The max filter ratio in alphanumeric op, samples will be filtered if their alphabet/numeric ratio exceeds this parameter.
args â extra args
kwargs â extra args