data_juicer.ops.deduplicator.video_deduplicator module#
- class data_juicer.ops.deduplicator.video_deduplicator.VideoDeduplicator(consider_text: bool = False, *args, **kwargs)[源代码]#
基类:
DeduplicatorDeduplicator to deduplicate samples at document-level using exact matching of videos between documents.
- __init__(consider_text: bool = False, *args, **kwargs)[源代码]#
Initialization.
- 参数:
consider_text -- whether to consider text hash together with video hash when applying deduplication.
args -- extra args
kwargs -- extra args