data_juicer.ops.deduplicator.document_deduplicator module¶
- class data_juicer.ops.deduplicator.document_deduplicator.DocumentDeduplicator(lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[source]¶
Bases:
DeduplicatorDeduplicates samples at the document level using exact matching.
This operator computes an MD5 hash for each sample’s text. It can optionally convert the text to lowercase and ignore non-alphabet characters, including whitespaces, digits, and punctuation. The deduplication is based on the computed hash values, where samples with identical hashes are considered duplicates. The compute_hash method adds a ‘hash’ key to each sample, storing its MD5 hash. During processing, the first occurrence of each unique hash is kept, and subsequent duplicates are filtered out. If the show_num parameter is set, the operator also returns a specified number of duplicate pairs for inspection.
- __init__(lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[source]¶
Initialization method.
- Parameters:
lowercase – Whether to convert sample text to lower case
ignore_non_character – Whether to ignore non-alphabet characters, including whitespaces, digits, and punctuations
args – extra args
kwargs – extra args.