跳转至主要内容
Ctrl+K

Data Juicer

  • 文档
  • API
  • Sandbox
  • Hub
  • Agents
  • GitHub
English 简体中文
main v1.5.1 v1.5.0 v1.4.6 v1.4.5 v1.4.4 v1.4.3 v1.4.2 v1.4.1 v1.4.0
  • 文档
  • API
  • Sandbox
  • Hub
  • Agents
  • GitHub
English 简体中文
main v1.5.1 v1.5.0 v1.4.6 v1.4.5 v1.4.4 v1.4.3 v1.4.2 v1.4.1 v1.4.0
  • data_juicer.ops.mapper.clean_copyright_mapper module

data_juicer.ops.mapper.clean_copyright_mapper module#

class data_juicer.ops.mapper.clean_copyright_mapper.CleanCopyrightMapper(*args, **kwargs)[源代码]#

基类:Mapper

Cleans copyright comments at the beginning of text samples.

This operator removes copyright comments from the start of text samples. It identifies and strips multiline comments that contain the word "copyright" using a regular expression. It also greedily removes lines starting with comment markers like //, #, or -- at the beginning of the text, as these are often part of copyright headers. The operator processes each sample individually but can handle batches for efficiency.

__init__(*args, **kwargs)[源代码]#

Initialization method.

参数:
  • args -- extra args

  • kwargs -- extra args

process_batched(samples)[源代码]#
当前页面
  • CleanCopyrightMapper
    • CleanCopyrightMapper.__init__()
    • CleanCopyrightMapper.process_batched()

本页

  • 显示源代码

© Copyright 2024, Data-Juicer Team.

由 Sphinx 9.0.4创建。

使用 PyData Sphinx Theme 0.16.1构建.