Skip to main content
Ctrl+K

Data Juicer

  • DOCS
  • API
  • Sandbox
  • Hub
  • Agents
  • GitHub
English įŽ€äŊ“中文
main v1.4.4 v1.4.3 v1.4.2 v1.4.1 v1.4.0
  • DOCS
  • API
  • Sandbox
  • Hub
  • Agents
  • GitHub
English įŽ€äŊ“中文
main v1.4.4 v1.4.3 v1.4.2 v1.4.1 v1.4.0
  • data_juicer.ops.mapper.clean_links_mapper module

data_juicer.ops.mapper.clean_links_mapper module#

class data_juicer.ops.mapper.clean_links_mapper.CleanLinksMapper(pattern: str | None = None, repl: str = '', *args, **kwargs)[source]#

Bases: Mapper

Mapper to clean links like http/https/ftp in text samples.

This operator removes or replaces URLs and other web links in the text. It uses a regular expression pattern to identify and remove links. By default, it replaces the identified links with an empty string, effectively removing them. The operator can be customized with a different pattern and replacement string. It processes samples in batches and modifies the text in place. If no links are found in a sample, it is left unchanged.

__init__(pattern: str | None = None, repl: str = '', *args, **kwargs)[source]#

Initialization method.

Parameters:
  • pattern – regular expression pattern to search for within text.

  • repl – replacement string, default is empty string.

  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]#
On this page
  • CleanLinksMapper
    • CleanLinksMapper.__init__()
    • CleanLinksMapper.process_batched()

This Page

  • Show Source

Š Copyright 2024, Data-Juicer Team.

Created using Sphinx 8.2.3.

Built with the PyData Sphinx Theme 0.16.1.