data_juicer.ops.deduplicator.document_line_deduplicator module#
- class data_juicer.ops.deduplicator.document_line_deduplicator.DocumentLineDeduplicator(*args, **kwargs)[源代码]#
基类:
DeduplicatorDeduplicates at the line level across documents.
This operator identifies lines that appear in many documents (boilerplate text, copyright notices, navigation bars, etc.) and removes them. It works in two phases:
compute_hash – splits each document into lines, applies configurable skip rules, and computes an MD5 hash for every non-skipped line.
process – counts in how many distinct documents each line hash appears. Lines whose document frequency exceeds
frequency_thresholdare removed from every document.
- __init__(frequency_threshold: int = 6, lowercase: bool = False, ignore_special_character: bool = False, min_line_length: int = 2, skip_brackets: bool = True, skip_markdown_headers: bool = True, skip_latex_env: bool = True, skip_html_tags: bool = True, *args, **kwargs)[源代码]#
Initialization method.
- 参数:
frequency_threshold -- document-frequency threshold. Lines appearing in more than this many documents are removed.
lowercase -- whether to lower-case a line before hashing.
ignore_special_character -- whether to strip whitespace, digits, and punctuation before hashing.
min_line_length -- lines whose stripped length is below this value are skipped (never considered for dedup).
skip_brackets -- skip lines consisting solely of bracket / semicolon characters such as
{ } [ ] ( ) ;.skip_markdown_headers -- skip lines that start with
#(Markdown headings).skip_latex_env -- skip LaTeX
\begin{…}/\end{…}environment declarations.skip_html_tags -- skip lines that are pure HTML / XML tags.
args -- extra args
kwargs -- extra args