data_juicer.ops.deduplicator.ray_bts_minhash_cpp_deduplicator module#
- class data_juicer.ops.deduplicator.ray_bts_minhash_cpp_deduplicator.MinhashCalculator(num_hash_aggregators_per_node, num_permutation, num_bands, num_rows_per_band, union_find_parallel_num, text_key, tokenization: str = 'space', window_size: Annotated[int, Gt(gt=0)] = 5, lowercase: bool = True, ignore_pattern: str | None = None, tokenizer_model: str | None = None)[source]#
Bases:
object
- class data_juicer.ops.deduplicator.ray_bts_minhash_cpp_deduplicator.MinhashFilter(num_nodes, union_find_parallel_num, max_pending_filter_tasks, num_filter_task_returns)[source]#
Bases:
object
- class data_juicer.ops.deduplicator.ray_bts_minhash_cpp_deduplicator.RayBTSMinhashCppDeduplicator(*args, **kwargs)[source]#
Bases:
DeduplicatorA MinHash LSH deduplicator that operates in Ray distributed mode with C++ acceleration.
Same as ray_bts_minhash_deduplicator but with tokenization and MinHash signature computation implemented in C++ for improved performance.
This operator uses the MinHash LSH technique to identify and remove near-duplicate samples from a dataset. It supports various tokenization methods, including space, punctuation, character, and sentencepiece. The Jaccard similarity threshold is used to determine if two samples are considered duplicates. If the Jaccard similarity of two samples is greater than or equal to the specified threshold, one of the samples is filtered out. The operator computes the MinHash values for each sample and uses a union- find algorithm to group similar samples. The key metric, Jaccard similarity, is computed based on the shingling of the text.
- EMPTY_HASH_VALUE = 'EMPTY'#
- __init__(tokenization: str = 'space', window_size: Annotated[int, Gt(gt=0)] = 5, lowercase: bool = True, ignore_pattern: str | None = None, tokenizer_model: str | None = None, num_permutations: Annotated[int, Gt(gt=0)] = 256, jaccard_threshold: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] = 0.7, num_bands: Annotated[int, Gt(gt=0)] | None = None, num_rows_per_band: Annotated[int, Gt(gt=0)] | None = None, union_find_parallel_num: int | str = 'auto', union_threshold: int | None = 256, max_pending_edge_buffer_task: int | None = 20, num_edge_buffer_task_returns: int | None = 10, max_pending_filter_tasks: int | None = 20, num_filter_task_returns: int | None = 10, merge_batch_size: int | None = 1000, *args, **kwargs)[source]#
Initialization method.
- Parameters:
tokenization – tokenization method for sample texts. It should be one of [space, punctuation, character, sentencepiece]. For English-like languages, we recommend to use ‘space’, for Chinese-like languages, we recommend to use ‘character’, and for multiple languages, we recommend to use ‘sentencepiece’. If using ‘sentencepiece’, please provided the model path in the ‘tokenizer_model’ field.
window_size – window size of shingling
lowercase – whether to convert text to lower case first
ignore_pattern – whether to ignore sub-strings with specific pattern when computing minhash
num_permutations – number of permutations in minhash computing
jaccard_threshold – the min jaccard similarity threshold in near-duplicate detection. When the jaccard similarity of two sample texts is >= this threshold, they are regarded as similar samples and this op will only keep one of them after deduplication
num_bands – number of bands in LSH. Default it’s None, and it will be determined by an optimal params computation algorithm by minimize the weighted sum of probs of False Positives and False Negatives
num_rows_per_band – number of rows in each band in LSH. Default it’s None, and it will be determined by an optimal params computation algorithm
tokenizer_model – path for the sentencepiece model, used for sentencepiece tokenization.