data_juicer.ops.deduplicator.document_deduplicator module#

class data_juicer.ops.deduplicator.document_deduplicator.DocumentDeduplicator(lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[源代码]#

基类：Deduplicator

Deduplicator to deduplicate samples at document-level using exact matching.

Using md5 hash to deduplicate samples.

__init__(lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[源代码]#

Initialization method.

参数:

lowercase -- Whether to convert sample text to lower case
ignore_non_character -- Whether to ignore non-alphabet characters, including whitespaces, digits, and punctuations
args -- extra args
kwargs -- extra args.

compute_hash(sample)[源代码]#

Compute md5 hash values for the sample.

参数:: sample -- input sample
返回:: sample with md5 hash value.

process(dataset, show_num=0)[源代码]#

For doc-level, dataset --> dataset.

参数:

dataset -- input dataset
show_num -- number of traced samples used when tracer is open.

返回:

deduplicated dataset and the sampled duplicate pairs.

data_juicer.ops.deduplicator.document_deduplicator module#

本页