data_juicer.ops.deduplicator.ray_bts_minhash_cpp_deduplicator module#

class data_juicer.ops.deduplicator.ray_bts_minhash_cpp_deduplicator.MinhashCalculator(num_hash_aggregators_per_node, num_permutation, num_bands, num_rows_per_band, union_find_parallel_num, text_key, tokenization: str = 'space', window_size: Annotated[int, Gt(gt=0)] = 5, lowercase: bool = True, ignore_pattern: str | None = None, tokenizer_model: str | None = None)[源代码]#

基类:object

__init__(num_hash_aggregators_per_node, num_permutation, num_bands, num_rows_per_band, union_find_parallel_num, text_key, tokenization: str = 'space', window_size: Annotated[int, Gt(gt=0)] = 5, lowercase: bool = True, ignore_pattern: str | None = None, tokenizer_model: str | None = None)[源代码]#
calc_minhash(text_list: Array, uid_begin: int, thread_num: int = 4) Table[源代码]#
class data_juicer.ops.deduplicator.ray_bts_minhash_cpp_deduplicator.MinhashFilter(num_nodes, union_find_parallel_num, max_pending_filter_tasks, num_filter_task_returns)[源代码]#

基类:object

__init__(num_nodes, union_find_parallel_num, max_pending_filter_tasks, num_filter_task_returns)[源代码]#
class data_juicer.ops.deduplicator.ray_bts_minhash_cpp_deduplicator.RayBTSMinhashCppDeduplicator(*args, **kwargs)[源代码]#

基类:Deduplicator

A basic exact matching deduplicator for RAY. Although its functionality is deduplication, it is implemented as Filter sub-class.

EMPTY_HASH_VALUE = 'EMPTY'#
__init__(tokenization: str = 'space', window_size: Annotated[int, Gt(gt=0)] = 5, lowercase: bool = True, ignore_pattern: str | None = None, tokenizer_model: str | None = None, num_permutations: Annotated[int, Gt(gt=0)] = 256, jaccard_threshold: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] = 0.7, num_bands: Annotated[int, Gt(gt=0)] | None = None, num_rows_per_band: Annotated[int, Gt(gt=0)] | None = None, union_find_parallel_num: int | str = 'auto', union_threshold: int | None = 256, max_pending_edge_buffer_task: int | None = 20, num_edge_buffer_task_returns: int | None = 10, max_pending_filter_tasks: int | None = 20, num_filter_task_returns: int | None = 10, merge_batch_size: int | None = 1000, *args, **kwargs)[源代码]#

Initialization method.

参数:
  • tokenization -- tokenization method for sample texts. It should be one of [space, punctuation, character, sentencepiece]. For English-like languages, we recommend to use 'space', for Chinese-like languages, we recommend to use 'character', and for multiple languages, we recommend to use 'sentencepiece'. If using 'sentencepiece', please provided the model path in the 'tokenizer_model' field.

  • window_size -- window size of shingling

  • lowercase -- whether to convert text to lower case first

  • ignore_pattern -- whether to ignore sub-strings with specific pattern when computing minhash

  • num_permutations -- number of permutations in minhash computing

  • jaccard_threshold -- the min jaccard similarity threshold in near-duplicate detection. When the jaccard similarity of two sample texts is >= this threshold, they are regarded as similar samples and this op will only keep one of them after deduplication

  • num_bands -- number of bands in LSH. Default it's None, and it will be determined by an optimal params computation algorithm by minimize the weighted sum of probs of False Positives and False Negatives

  • num_rows_per_band -- number of rows in each band in LSH. Default it's None, and it will be determined by an optimal params computation algorithm

  • tokenizer_model -- path for the sentencepiece model, used for sentencepiece tokenization.

merge_op_batch(object_refs)[源代码]#
merge()[源代码]#
run(dataset)[源代码]#