data_juicer.ops.deduplicator.document_deduplicator module#
- class data_juicer.ops.deduplicator.document_deduplicator.DocumentDeduplicator(lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[源代码]#
基类:
DeduplicatorDeduplicator to deduplicate samples at document-level using exact matching.
Using md5 hash to deduplicate samples.
- __init__(lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[源代码]#
Initialization method.
- 参数:
lowercase -- Whether to convert sample text to lower case
ignore_non_character -- Whether to ignore non-alphabet characters, including whitespaces, digits, and punctuations
args -- extra args
kwargs -- extra args.