data_juicer.ops.filter.token_num_filter module#
- class data_juicer.ops.filter.token_num_filter.TokenNumFilter(hf_tokenizer: str = 'EleutherAI/pythia-6.9b-deduped', min_num: int = 10, max_num: int = 9223372036854775807, *args, **kwargs)[source]#
Bases:
FilterFilter to keep samples with total token number within a specific range.
- __init__(hf_tokenizer: str = 'EleutherAI/pythia-6.9b-deduped', min_num: int = 10, max_num: int = 9223372036854775807, *args, **kwargs)[source]#
Initialization method.
- Parameters:
hf_tokenizer â the tokenizer name of Hugging Face tokenizers.
min_num â The min filter token number in this op, samples will be filtered if their token number is below this parameter.
max_num â The max filter token number in this op, samples will be filtered if their token number exceeds this parameter.
args â extra args
kwargs â extra args
- compute_stats_single(sample)[source]#
Compute stats for the sample which is used as a metric to decide whether to filter this sample.
- Parameters:
sample â input sample.
context â whether to store context information of intermediate vars in the sample temporarily.
- Returns:
sample with computed stats