data_juicer.ops.mapper.sentence_split_mapper module#

class data_juicer.ops.mapper.sentence_split_mapper.SentenceSplitMapper(lang: str = 'en', *args, **kwargs)[source]#

Bases: Mapper

Splits text samples into individual sentences based on the specified language.

This operator uses an NLTK-based tokenizer to split the input text into sentences. The language for the tokenizer is specified during initialization. The original text in each sample is replaced with a list of sentences. This operator processes samples in batches for efficiency. Ensure that the lang parameter is set to the appropriate language code (e.g., โ€œenโ€ for English) to achieve accurate sentence splitting.

__init__(lang: str = 'en', *args, **kwargs)[source]#

Initialization method.

Parameters:
  • lang โ€“ split sentence of text in which language.

  • args โ€“ extra args

  • kwargs โ€“ extra args

process_batched(samples)[source]#