跳转至主要内容
Ctrl+K

Data Juicer

  • 文档
  • API
  • Sandbox
  • Hub
  • Agents
  • GitHub
English 简体中文
main v1.4.4 v1.4.3 v1.4.2 v1.4.1 v1.4.0
  • 文档
  • API
  • Sandbox
  • Hub
  • Agents
  • GitHub
English 简体中文
main v1.4.4 v1.4.3 v1.4.2 v1.4.1 v1.4.0

章节导航

教程

  • DJ-Cookbook
  • 安装
  • 快速上手

帮助文档

  • Operator Schemas 算子提要
  • 数据集配置指南
  • “坏”数据展览
  • DJ-SORA
  • DataJuicer-Agent
  • DJ_服务化
  • 开发者指南
  • Data-Juicer 分布式数据处理
  • 数据菜谱Gallery
  • 沙盒实验室
  • Awesome Data-Model Co-Development of MLLMs

算子

  • Aggregator
  • Deduplicator
  • Filter
  • Mapper
    • audio_add_gaussian_noise_mapper
    • audio_ffmpeg_wrapped_mapper
    • calibrate_qa_mapper
    • calibrate_query_mapper
    • calibrate_response_mapper
    • chinese_convert_mapper
    • clean_copyright_mapper
    • clean_email_mapper
    • clean_html_mapper
    • clean_ip_mapper
    • clean_links_mapper
    • dialog_intent_detection_mapper
    • dialog_sentiment_detection_mapper
    • dialog_sentiment_intensity_mapper
    • dialog_topic_detection_mapper
    • download_file_mapper
    • expand_macro_mapper
    • extract_entity_attribute_mapper
    • extract_entity_relation_mapper
    • extract_event_mapper
    • extract_keyword_mapper
    • extract_nickname_mapper
    • extract_support_text_mapper
    • extract_tables_from_html_mapper
    • fix_unicode_mapper
    • generate_qa_from_examples_mapper
    • generate_qa_from_text_mapper
    • human_preference_annotation_mapper
    • image_blur_mapper
    • image_captioning_from_gpt4v_mapper
    • image_captioning_mapper
    • image_detection_yolo_mapper
    • image_diffusion_mapper
    • image_face_blur_mapper
    • image_remove_background_mapper
    • image_segment_mapper
    • image_tagging_mapper
    • imgdiff_difference_area_generator_mapper
    • imgdiff_difference_caption_generator_mapper
    • mllm_mapper
    • nlpaug_en_mapper
    • nlpcda_zh_mapper
    • optimize_qa_mapper
    • optimize_query_mapper
    • optimize_response_mapper
    • pair_preference_mapper
    • punctuation_normalization_mapper
    • python_file_mapper
    • python_lambda_mapper
    • query_intent_detection_mapper
    • query_sentiment_detection_mapper
    • query_topic_detection_mapper
    • relation_identity_mapper
    • remove_bibliography_mapper
    • remove_comments_mapper
    • remove_header_mapper
    • remove_long_words_mapper
    • remove_non_chinese_character_mapper
    • remove_repeat_sentences_mapper
    • remove_specific_chars_mapper
    • remove_table_text_mapper
    • remove_words_with_incorrect_substrings_mapper
    • replace_content_mapper
    • sdxl_prompt2prompt_mapper
    • sentence_augmentation_mapper
    • sentence_split_mapper
    • text_chunk_mapper
    • video_captioning_from_audio_mapper
    • video_captioning_from_frames_mapper
    • video_captioning_from_summarizer_mapper
    • video_captioning_from_video_mapper
    • video_extract_frames_mapper
    • video_face_blur_mapper
    • video_ffmpeg_wrapped_mapper
    • video_remove_watermark_mapper
    • video_resize_aspect_ratio_mapper
    • video_resize_resolution_mapper
    • video_split_by_duration_mapper
    • video_split_by_key_frame_mapper
    • video_split_by_scene_mapper
    • video_tagging_from_audio_mapper
    • video_tagging_from_frames_mapper
    • whitespace_normalization_mapper
  • Formatter
  • Grouper
  • Selector
  • Op

demos

  • 演示
  • 自动化评测:HELM 评测及可视化
  • Note for dataset path
  • 为LLM构造角色扮演的system prompt

工具

  • 分布式模糊去重工具
  • Auto Evaluation Toolkit
  • GPT EVAL:使用 OpenAI API 评测大模型
  • Evaluation Results Recorder
  • 格式转换工具
  • 多模态工具
  • 后微调工具
  • 数据菜谱的自动化超参优化
  • Label Studio Service Utility
  • 视频生成测评工具
  • VBench metrics
  • Postprocess tools
  • 预处理工具
  • 给数据打分

第三方

  • 大语言模型生态
  • 第三方模型库
  • 文档
  • Mapper
  • clean_links_mapper

clean_links_mapper#

Mapper to clean links like http/https/ftp in text samples.

This operator removes or replaces URLs and other web links in the text. It uses a regular expression pattern to identify and remove links. By default, it replaces the identified links with an empty string, effectively removing them. The operator can be customized with a different pattern and replacement string. It processes samples in batches and modifies the text in place. If no links are found in a sample, it is left unchanged.

映射器用于清理文本样本中的http/https/ftp等链接。

此算子删除或替换文本中的URL和其他网络链接。它使用正则表达式模式来识别和删除链接。默认情况下,它将识别到的链接替换为空字符串,从而删除它们。可以通过不同的模式和替换字符串自定义算子。它以批量方式处理样本并在原地修改文本。如果样本中没有找到链接,则保持不变。

Type 算子类型: mapper

Tags 标签: cpu, text

🔧 Parameter Configuration 参数配置#

name 参数名

type 类型

default 默认值

desc 说明

pattern

typing.Optional[str]

None

regular expression pattern to search for within text.

repl

<class 'str'>

''

replacement string, default is empty string.

args

''

extra args

kwargs

''

extra args

📊 Effect demonstration 效果演示#

test_mixed_https_links_text#

CleanLinksMapper()

📥 input data 输入数据#

Sample 1: list
['This is a test,https://www.example.com/file.html?param1=value1&param2=value2', '这是个测试,https://example.com/my-page.html?param1=value1&param2=value2', '这是个测试,https://example.com']

📤 output data 输出数据#

Sample 1: list
['This is a test,', '这是个测试,', '这是个测试,']

✨ explanation 解释#

This example shows the operator removing HTTPS links from text that contains both plain text and a link. The operator identifies and removes the links, leaving the rest of the text intact. For example, 'This is a test,https://www.example.com/file.html?param1=value1&param2=value2' becomes 'This is a test,' after processing. 这个示例展示了算子从同时包含纯文本和链接的文本中移除HTTPS链接。算子识别并移除这些链接,而保留其余文本不变。例如,'This is a test,https://www.example.com/file.html?param1=value1&param2=value2' 在处理后变为 'This is a test,'。

test_replace_links_text#

CleanLinksMapper(repl='<LINKS>')

📥 input data 输入数据#

Sample 1: list
['ftp://user:password@ftp.example.com:21/', 'This is a sample for test', 'abcd://ef is a sample for test', 'HTTP://example.com/my-page.html?param1=value1&param2=value2']

📤 output data 输出数据#

Sample 1: list
['<LINKS>', 'This is a sample for test', '<LINKS> is a sample for test', '<LINKS>']

✨ explanation 解释#

This example demonstrates the operator replacing different types of links with a custom string ''. If a sample contains a link, it will be replaced by '', while samples without links remain unchanged. For instance, 'ftp://user:password@ftp.example.com:21/' is transformed into '', whereas 'This is a sample for test' stays as it is because it doesn't contain any links. 这个示例展示了算子使用自定义字符串''替换不同类型的链接。如果一个样本包含链接,它将被替换为'',而不含链接的样本则保持不变。例如,'ftp://user:password@ftp.example.com:21/' 被转换为 '',而 'This is a sample for test' 保持不变,因为它不包含任何链接。

🔗 related links 相关链接#

  • source code 源代码

  • unit test 单元测试

  • Return operator list 返回算子列表

上一页

clean_ip_mapper

下一页

dialog_intent_detection_mapper

当前页面
  • 🔧 Parameter Configuration 参数配置
  • 📊 Effect demonstration 效果演示
    • test_mixed_https_links_text
      • 📥 input data 输入数据
      • 📤 output data 输出数据
      • ✨ explanation 解释
    • test_replace_links_text
      • 📥 input data 输入数据
      • 📤 output data 输出数据
      • ✨ explanation 解释
  • 🔗 related links 相关链接

本页

  • 显示源代码

© Copyright 2024, Data-Juicer Team.

由 Sphinx 8.2.3创建。

使用 PyData Sphinx Theme 0.16.1构建.