data_juicer.ops.deduplicator.document_deduplicator module#
- class data_juicer.ops.deduplicator.document_deduplicator.DocumentDeduplicator(lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[source]#
Bases:
DeduplicatorDeduplicator to deduplicate samples at document-level using exact matching.
Using md5 hash to deduplicate samples.
- __init__(lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[source]#
Initialization method.
- Parameters:
lowercase â Whether to convert sample text to lower case
ignore_non_character â Whether to ignore non-alphabet characters, including whitespaces, digits, and punctuations
args â extra args
kwargs â extra args.