data_juicer.ops.mapper.remove_repeat_sentences_mapper module#
- class data_juicer.ops.mapper.remove_repeat_sentences_mapper.RemoveRepeatSentencesMapper(lowercase: bool = False, ignore_special_character: bool = True, min_repeat_sentence_length: int = 2, *args, **kwargs)[source]#
Bases:
MapperMapper to remove repeat sentences in text samples.
- __init__(lowercase: bool = False, ignore_special_character: bool = True, min_repeat_sentence_length: int = 2, *args, **kwargs)[source]#
Initialization method.
- Parameters:
lowercase â Whether to convert sample text to lower case
ignore_special_character â Whether to ignore special characters when judging repeated sentences. Special characters are all characters except Chinese characters, letters and numbers.
min_repeat_sentence_length â Sentences shorter than this length will not be deduplicated. If ignore_special_character is set to True, then special characters are not included in this length.
args â extra args
kwargs â extra args