API Reference¶ data_juicer.core data_juicer.ops data_juicer.ops.filter data_juicer.ops.mapper data_juicer.ops.deduplicator DocumentDeduplicator DocumentMinhashDeduplicator DocumentMinhashDeduplicatorWithUid DocumentSimhashDeduplicator ImageDeduplicator RayBasicDeduplicator RayDocumentDeduplicator RayImageDeduplicator RayVideoDeduplicator RayBTSMinhashDeduplicator RayBTSMinhashDeduplicatorWithUid VideoDeduplicator data_juicer.ops.selector data_juicer.ops.common get_sentences_from_document() get_words_from_document() merge_on_whitespace_tab_newline() split_on_newline_tab_whitespace() split_on_whitespace() strip() words_augmentation() words_refinement() split_text_by_punctuation() data_juicer.analysis ColumnWiseAnalysis CorrelationAnalysis DiversityAnalysis OverallAnalysis data_juicer.config init_configs() get_init_configs() export_config() merge_config() prepare_side_configs() get_default_cfg() prepare_cfgs_for_export() update_op_attr() data_juicer.format JsonFormatter LocalFormatter RemoteFormatter TextFormatter ParquetFormatter CsvFormatter TsvFormatter EmptyFormatter RayEmptyFormatter