Skip to main content
Ctrl+K

Data Juicer

  • DOCS
  • API
  • Sandbox
  • Hub
  • Agents
  • GitHub
English 简体中文
main v1.4.4 v1.4.3 v1.4.2 v1.4.1 v1.4.0
  • DOCS
  • API
  • Sandbox
  • Hub
  • Agents
  • GitHub
English 简体中文
main v1.4.4 v1.4.3 v1.4.2 v1.4.1 v1.4.0

Section Navigation

Tutorial

  • DJ-Cookbook
  • Installation Guide
  • Quick Start

docs

  • Operator Schemas 算子提要
  • Dataset Configuration Guide
  • “Bad” Data Exhibition
  • DJ-SORA
  • DataJuicer-Agent
  • DJ_service
  • How-to Guide for Developers
  • Distributed Data Processing in Data-Juicer
  • Data Recipe Gallery
  • Sandbox
  • Awesome Data-Model Co-Development of MLLMs

operators

  • Aggregator
  • Deduplicator
  • Filter
    • alphanumeric_filter
    • audio_duration_filter
    • audio_nmf_snr_filter
    • audio_size_filter
    • average_line_length_filter
    • character_repetition_filter
    • flagged_words_filter
    • general_field_filter
    • image_aesthetics_filter
    • image_aspect_ratio_filter
    • image_face_count_filter
    • image_face_ratio_filter
    • image_nsfw_filter
    • image_pair_similarity_filter
    • image_shape_filter
    • image_size_filter
    • image_text_matching_filter
    • image_text_similarity_filter
    • image_watermark_filter
    • in_context_influence_filter
    • instruction_following_difficulty_filter
    • language_id_score_filter
    • llm_analysis_filter
    • llm_difficulty_score_filter
    • llm_perplexity_filter
    • llm_quality_score_filter
    • llm_task_relevance_filter
    • maximum_line_length_filter
    • perplexity_filter
    • phrase_grounding_recall_filter
    • special_characters_filter
    • specified_field_filter
    • specified_numeric_field_filter
    • stopwords_filter
    • suffix_filter
    • text_action_filter
    • text_embd_similarity_filter
    • text_entity_dependency_filter
    • text_length_filter
    • text_pair_similarity_filter
    • token_num_filter
    • video_aesthetics_filter
    • video_aspect_ratio_filter
    • video_duration_filter
    • video_frames_text_similarity_filter
    • video_motion_score_filter
    • video_motion_score_raft_filter
    • video_nsfw_filter
    • video_ocr_area_ratio_filter
    • video_resolution_filter
    • video_tagging_from_frames_filter
    • video_watermark_filter
    • word_repetition_filter
    • words_num_filter
  • Mapper
  • Formatter
  • Grouper
  • Selector
  • Op

demos

  • Demos
  • Note for dataset path

tools

  • Distributed Fuzzy Deduplication Tools
  • Auto Evaluation Toolkit
  • GPT EVAL: Evaluate your model with OpenAI API
  • Evaluation Results Recorder
  • Format Conversion Tools
  • Multimodal Tools
  • Post Tuning Tools
  • Hyper-parameter Optimization for Data Recipe
  • Label Studio Service Utility
  • Metrics for video generation
  • VBench metrics
  • Postprocess tools
  • Preprocess Tools
  • Data Scoring

thirdparty

  • LLM Ecosystems
  • Third-party Model Library
  • DOCS
  • Filter

Filter#

  • alphanumeric_filter
  • audio_duration_filter
  • audio_nmf_snr_filter
  • audio_size_filter
  • average_line_length_filter
  • character_repetition_filter
  • flagged_words_filter
  • general_field_filter
  • image_aesthetics_filter
  • image_aspect_ratio_filter
  • image_face_count_filter
  • image_face_ratio_filter
  • image_nsfw_filter
  • image_pair_similarity_filter
  • image_shape_filter
  • image_size_filter
  • image_text_matching_filter
  • image_text_similarity_filter
  • image_watermark_filter
  • in_context_influence_filter
  • instruction_following_difficulty_filter
  • language_id_score_filter
  • llm_analysis_filter
  • llm_difficulty_score_filter
  • llm_perplexity_filter
  • llm_quality_score_filter
  • llm_task_relevance_filter
  • maximum_line_length_filter
  • perplexity_filter
  • phrase_grounding_recall_filter
  • special_characters_filter
  • specified_field_filter
  • specified_numeric_field_filter
  • stopwords_filter
  • suffix_filter
  • text_action_filter
  • text_embd_similarity_filter
  • text_entity_dependency_filter
  • text_length_filter
  • text_pair_similarity_filter
  • token_num_filter
  • video_aesthetics_filter
  • video_aspect_ratio_filter
  • video_duration_filter
  • video_frames_text_similarity_filter
  • video_motion_score_filter
  • video_motion_score_raft_filter
  • video_nsfw_filter
  • video_ocr_area_ratio_filter
  • video_resolution_filter
  • video_tagging_from_frames_filter
  • video_watermark_filter
  • word_repetition_filter
  • words_num_filter

previous

video_deduplicator

next

alphanumeric_filter

This Page

  • Show Source

© Copyright 2024, Data-Juicer Team.

Created using Sphinx 8.2.3.

Built with the PyData Sphinx Theme 0.16.1.