data_juicer.ops.mapper.punctuation_normalization_mapper module#

class data_juicer.ops.mapper.punctuation_normalization_mapper.PunctuationNormalizationMapper(*args, **kwargs)[source]#

Bases: Mapper

Normalizes unicode punctuations to their English equivalents in text samples.

This operator processes a batch of text samples and replaces any unicode punctuation with its corresponding English punctuation. The mapping includes common substitutions like “īŧŒâ€ to “,”, “。” to “.”, and ““” to “. It iterates over each character in the text, replacing it if it is found in the predefined punctuation map. The result is a set of text samples with consistent punctuation formatting.

__init__(*args, **kwargs)[source]#

Initialization method.

Parameters:
  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]#