data_juicer.ops#
- data_juicer.ops.load_ops(process_list)[source]#
Load op list according to the process list from config file.
- Parameters:
process_list â A process list. Each item is an op name and its arguments.
- Returns:
The op instance list.
- class data_juicer.ops.Filter(*args, **kwargs)[source]#
Bases:
OP- __init__(*args, **kwargs)[source]#
Base class that removes specific info.
- Parameters:
text_key â the key name of field that stores sample texts to be processed
image_key â the key name of field that stores sample image list to be processed
audio_key â the key name of field that stores sample audio list to be processed
video_key â the key name of field that stores sample video list to be processed
query_key â the key name of field that stores sample queries
response_key â the key name of field that stores responses
history_key â the key name of field that stores history of queries and responses
- compute_stats_single(sample, context=False)[source]#
Compute stats for the sample which is used as a metric to decide whether to filter this sample.
- Parameters:
sample â input sample.
context â whether to store context information of intermediate vars in the sample temporarily.
- Returns:
sample with computed stats
- class data_juicer.ops.Mapper(*args, **kwargs)[source]#
Bases:
OP- __init__(*args, **kwargs)[source]#
Base class that conducts data editing.
- Parameters:
text_key â the key name of field that stores sample texts to be processed.
image_key â the key name of field that stores sample image list to be processed
audio_key â the key name of field that stores sample audio list to be processed
video_key â the key name of field that stores sample video list to be processed
query_key â the key name of field that stores sample queries
response_key â the key name of field that stores responses
history_key â the key name of field that stores history of queries and responses
- class data_juicer.ops.Deduplicator(*args, **kwargs)[source]#
Bases:
OP- __init__(*args, **kwargs)[source]#
Base class that conducts deduplication.
- Parameters:
text_key â the key name of field that stores sample texts to be processed
image_key â the key name of field that stores sample image list to be processed
audio_key â the key name of field that stores sample audio list to be processed
video_key â the key name of field that stores sample video list to be processed
query_key â the key name of field that stores sample queries
response_key â the key name of field that stores responses
history_key â the key name of field that stores history of queries and responses
- compute_hash(sample)[source]#
Compute hash values for the sample.
- Parameters:
sample â input sample
- Returns:
sample with computed hash value.
- class data_juicer.ops.Selector(*args, **kwargs)[source]#
Bases:
OP- __init__(*args, **kwargs)[source]#
Base class that conducts selection in dataset-level.
- Parameters:
text_key â the key name of field that stores sample texts to be processed
image_key â the key name of field that stores sample image list to be processed
audio_key â the key name of field that stores sample audio list to be processed
video_key â the key name of field that stores sample video list to be processed
query_key â the key name of field that stores sample queries
response_key â the key name of field that stores responses
history_key â the key name of field that stores history of queries and responses
- class data_juicer.ops.Grouper(*args, **kwargs)[source]#
Bases:
OP- __init__(*args, **kwargs)[source]#
Base class that group samples.
- Parameters:
text_key â the key name of field that stores sample texts to be processed
image_key â the key name of field that stores sample image list to be processed
audio_key â the key name of field that stores sample audio list to be processed
video_key â the key name of field that stores sample video list to be processed
query_key â the key name of field that stores sample queries
response_key â the key name of field that stores responses
history_key â the key name of field that stores history of queries and responses
- class data_juicer.ops.Aggregator(*args, **kwargs)[source]#
Bases:
OP- __init__(*args, **kwargs)[source]#
Base class that group samples.
- Parameters:
text_key â the key name of field that stores sample texts to be processed
image_key â the key name of field that stores sample image list to be processed
audio_key â the key name of field that stores sample audio list to be processed
video_key â the key name of field that stores sample video list to be processed
query_key â the key name of field that stores sample queries
response_key â the key name of field that stores responses
history_key â the key name of field that stores history of queries and responses