data_juicer.ops.deduplicator.document_line_deduplicator module#
- class data_juicer.ops.deduplicator.document_line_deduplicator.DocumentLineDeduplicator(*args, **kwargs)[source]#
Bases:
DeduplicatorDeduplicates at the line level across documents.
This operator identifies lines that appear in many documents (boilerplate text, copyright notices, navigation bars, etc.) and removes them. It works in two phases:
compute_hash â splits each document into lines, applies configurable skip rules, and computes an MD5 hash for every non-skipped line.
process â counts in how many distinct documents each line hash appears. Lines whose document frequency exceeds
frequency_thresholdare removed from every document.
- __init__(frequency_threshold: int = 6, lowercase: bool = False, ignore_special_character: bool = False, min_line_length: int = 2, skip_brackets: bool = True, skip_markdown_headers: bool = True, skip_latex_env: bool = True, skip_html_tags: bool = True, *args, **kwargs)[source]#
Initialization method.
- Parameters:
frequency_threshold â document-frequency threshold. Lines appearing in more than this many documents are removed.
lowercase â whether to lower-case a line before hashing.
ignore_special_character â whether to strip whitespace, digits, and punctuation before hashing.
min_line_length â lines whose stripped length is below this value are skipped (never considered for dedup).
skip_brackets â skip lines consisting solely of bracket / semicolon characters such as
{ } [ ] ( ) ;.skip_markdown_headers â skip lines that start with
#(Markdown headings).skip_latex_env â skip LaTeX
\begin{âĻ}/\end{âĻ}environment declarations.skip_html_tags â skip lines that are pure HTML / XML tags.
args â extra args
kwargs â extra args
- compute_hash(sample)[source]#
Compute per-line MD5 hashes for a single document.
Skipped lines receive an empty-string hash so that the list of hashes stays aligned with the original lines.
- Parameters:
sample â input sample
- Returns:
sample with
HashKeys.line_hashespopulated.
- process(dataset, show_num=0)[source]#
Remove high-frequency lines from the dataset.
- Parameters:
dataset â input dataset (already hash-annotated).
show_num â number of traced duplicate pairs for inspection.
- Returns:
(dataset, dup_pairs) where dup_pairs maps a line hash to sample texts that contained it.