data_juicer.ops.base_op module#
- data_juicer.ops.base_op.catch_map_batches_exception(method, skip_op_error=False, op_name=None)[source]#
For batched-map sample-level fault tolerance.
- data_juicer.ops.base_op.catch_map_single_exception(method, return_sample=True, skip_op_error=False, op_name=None)[source]#
For single-map sample-level fault tolerance. The input sample is expected batch_size = 1.
- class data_juicer.ops.base_op.OP(*args, **kwargs)[source]#
Bases:
object- __init__(*args, **kwargs)[source]#
Base class of operators.
- Parameters:
text_key โ the key name of field that stores sample texts to be processed.
image_key โ the key name of field that stores sample image list to be processed
audio_key โ the key name of field that stores sample audio list to be processed
video_key โ the key name of field that stores sample video list to be processed
image_bytes_key โ the key name of field that stores sample image bytes list to be processed
query_key โ the key name of field that stores sample queries
response_key โ the key name of field that stores responses
history_key โ the key name of field that stores history of queries and responses
index_key โ index the samples before process if not None
system_key โ the key name of field that stores system prompts
instruction_key โ the key name of field that stores instruction
index_key โ the key name of field that stores index
batch_size โ the batch size for processing
work_dir โ the working directory for this operator
skip_op_error โ whether to skip the error when processing samples
# Ray related parameters :param num_cpus: number of CPUs required for this operator, only used when
running in Ray mode
- Parameters:
num_gpus โ number of GPUs required for this operator, only used when running in Ray mode
memory โ memory size required for this operator, only used when running in Ray mode
runtime_env โ runtime environment for this operator, only used when running in Ray mode. More details can be found in Ray documentation.
ray_execution_mode โ execution mode in Ray, can be โactorโ or โtaskโ or None, if None, the โactorโ mode is used when the operator is a CUDA operator, and the โtaskโ mode is used if the operator is a CPU operator.
- remove_extra_parameters(param_dict, keys=None)[source]#
at the beginning of the init of the mapper op, call self.remove_extra_parameters(locals()) to get the init parameter dict of the op for convenience
- class data_juicer.ops.base_op.Mapper(*args, **kwargs)[source]#
Bases:
OP- __init__(*args, **kwargs)[source]#
Base class that conducts data editing.
- Parameters:
text_key โ the key name of field that stores sample texts to be processed.
image_key โ the key name of field that stores sample image list to be processed
audio_key โ the key name of field that stores sample audio list to be processed
video_key โ the key name of field that stores sample video list to be processed
image_bytes_key โ the key name of field that stores sample image bytes list to be processed
query_key โ the key name of field that stores sample queries
response_key โ the key name of field that stores responses
history_key โ the key name of field that stores history of queries and responses
- class data_juicer.ops.base_op.Filter(*args, **kwargs)[source]#
Bases:
OP- __init__(*args, **kwargs)[source]#
Base class that removes specific info.
- Parameters:
text_key โ the key name of field that stores sample texts to be processed
image_key โ the key name of field that stores sample image list to be processed
audio_key โ the key name of field that stores sample audio list to be processed
video_key โ the key name of field that stores sample video list to be processed
image_bytes_key โ the key name of field that stores sample image bytes list to be processed
query_key โ the key name of field that stores sample queries
response_key โ the key name of field that stores responses
history_key โ the key name of field that stores history of queries and responses
min_closed_interval โ whether the min_val of the specified filter range is a closed interval. Itโs True by default.
max_closed_interval โ whether the max_val of the specified filter range is a closed interval. Itโs True by default.
reversed_range โ whether to reverse the target range [min_val, max_val] to (-โ, min_val) or (max_val, +โ). Itโs False by default.
- compute_stats_single(sample, context=False)[source]#
Compute stats for the sample which is used as a metric to decide whether to filter this sample.
- Parameters:
sample โ input sample.
context โ whether to store context information of intermediate vars in the sample temporarily.
- Returns:
sample with computed stats
- class data_juicer.ops.base_op.Deduplicator(*args, **kwargs)[source]#
Bases:
OP- __init__(*args, **kwargs)[source]#
Base class that conducts deduplication.
- Parameters:
text_key โ the key name of field that stores sample texts to be processed
image_key โ the key name of field that stores sample image list to be processed
audio_key โ the key name of field that stores sample audio list to be processed
video_key โ the key name of field that stores sample video list to be processed
image_bytes_key โ the key name of field that stores sample image bytes list to be processed
query_key โ the key name of field that stores sample queries
response_key โ the key name of field that stores responses
history_key โ the key name of field that stores history of queries and responses
- compute_hash(sample)[source]#
Compute hash values for the sample.
- Parameters:
sample โ input sample
- Returns:
sample with computed hash value.
- class data_juicer.ops.base_op.Selector(*args, **kwargs)[source]#
Bases:
OP- __init__(*args, **kwargs)[source]#
Base class that conducts selection in dataset-level.
- Parameters:
text_key โ the key name of field that stores sample texts to be processed
image_key โ the key name of field that stores sample image list to be processed
audio_key โ the key name of field that stores sample audio list to be processed
video_key โ the key name of field that stores sample video list to be processed
image_bytes_key โ the key name of field that stores sample image bytes list to be processed
query_key โ the key name of field that stores sample queries
response_key โ the key name of field that stores responses
history_key โ the key name of field that stores history of queries and responses
- class data_juicer.ops.base_op.Grouper(*args, **kwargs)[source]#
Bases:
OP- __init__(*args, **kwargs)[source]#
Base class that group samples.
- Parameters:
text_key โ the key name of field that stores sample texts to be processed
image_key โ the key name of field that stores sample image list to be processed
audio_key โ the key name of field that stores sample audio list to be processed
video_key โ the key name of field that stores sample video list to be processed
image_bytes_key โ the key name of field that stores sample image bytes list to be processed
query_key โ the key name of field that stores sample queries
response_key โ the key name of field that stores responses
history_key โ the key name of field that stores history of queries and responses
- class data_juicer.ops.base_op.Aggregator(*args, **kwargs)[source]#
Bases:
OP- __init__(*args, **kwargs)[source]#
Base class that group samples.
- Parameters:
text_key โ the key name of field that stores sample texts to be processed
image_key โ the key name of field that stores sample image list to be processed
audio_key โ the key name of field that stores sample audio list to be processed
video_key โ the key name of field that stores sample video list to be processed
image_bytes_key โ the key name of field that stores sample image bytes list to be processed
query_key โ the key name of field that stores sample queries
response_key โ the key name of field that stores responses
history_key โ the key name of field that stores history of queries and responses
- class data_juicer.ops.base_op.Pipeline(*args, **kwargs)[source]#
Bases:
OPBase class for Operators that represent a data processing pipeline.
- __init__(*args, **kwargs)[source]#
Base class of operators.
- Parameters:
text_key โ the key name of field that stores sample texts to be processed.
image_key โ the key name of field that stores sample image list to be processed
audio_key โ the key name of field that stores sample audio list to be processed
video_key โ the key name of field that stores sample video list to be processed
image_bytes_key โ the key name of field that stores sample image bytes list to be processed
query_key โ the key name of field that stores sample queries
response_key โ the key name of field that stores responses
history_key โ the key name of field that stores history of queries and responses
index_key โ index the samples before process if not None
batch_size โ the batch size for processing