data_juicer.ops.base_op module#

data_juicer.ops.base_op.convert_list_dict_to_dict_list(samples)[源代码]#

data_juicer.ops.base_op.convert_dict_list_to_list_dict(samples)[源代码]#

data_juicer.ops.base_op.convert_arrow_to_python(method)[源代码]#

data_juicer.ops.base_op.catch_map_batches_exception(method, skip_op_error=False, op_name=None)[源代码]#: For batched-map sample-level fault tolerance.

data_juicer.ops.base_op.sample_to_dict(sample)[源代码]#: Convert sample to dict.

data_juicer.ops.base_op.wrap_mapper_with_tracer(process_method, op_name, text_key, tracer, is_batched_op)[源代码]#

Wrap a mapper's process method to collect sample-level changes.

参数:

process_method -- the original process method (single or batched)
op_name -- the operator name
text_key -- the text key to compare
tracer -- the tracer instance
is_batched_op -- whether this is a batched operator

返回:

wrapped process method

data_juicer.ops.base_op.wrap_filter_with_tracer(process_method, op_name, tracer, is_batched_op)[源代码]#

Wrap a filter's process method to collect sample-level changes.

参数:

process_method -- the original process method (single or batched)
op_name -- the operator name
tracer -- the tracer instance
is_batched_op -- whether this is a batched operator

返回:

wrapped process method

data_juicer.ops.base_op.catch_map_single_exception(method, return_sample=True, skip_op_error=False, op_name=None)[源代码]#: For single-map sample-level fault tolerance. The input sample is expected batch_size = 1.

class data_juicer.ops.base_op.OPMetaClass(name, bases, namespace, /, **kwargs)[源代码]#: 基类：ABCMeta

class data_juicer.ops.base_op.OP(*args, **kwargs)[源代码]#

基类：object

__init__(*args, **kwargs)[源代码]#

Base class of operators.

参数:

text_key -- the key name of field that stores sample texts to be processed.
image_key -- the key name of field that stores sample image list to be processed
audio_key -- the key name of field that stores sample audio list to be processed
video_key -- the key name of field that stores sample video list to be processed
image_bytes_key -- the key name of field that stores sample image bytes list to be processed
query_key -- the key name of field that stores sample queries
response_key -- the key name of field that stores responses
history_key -- the key name of field that stores history of queries and responses
index_key -- index the samples before process if not None
system_key -- the key name of field that stores system prompts
instruction_key -- the key name of field that stores instruction
index_key -- the key name of field that stores index
batch_size -- the batch size for processing
work_dir -- the working directory for this operator
skip_op_error -- whether to skip the error when processing samples

# Ray related parameters :param num_cpus: number of CPUs required for this operator, only used when

running in Ray mode

参数:

num_gpus -- number of GPUs required for this operator, only used when running in Ray mode
memory -- memory size required for this operator, only used when running in Ray mode
runtime_env -- runtime environment for this operator, only used when running in Ray mode. More details can be found in Ray documentation.
ray_execution_mode -- execution mode in Ray, can be "actor" or "task" or None, if None, the "actor" mode is used when the operator is a CUDA operator, and the "task" mode is used if the operator is a CPU operator.

get_env_spec() → OPEnvSpec[源代码]#

use_auto_proc()[源代码]#

is_batched_op()[源代码]#

use_ray_actor()[源代码]#

process(*args, **kwargs)[源代码]#

use_cuda()[源代码]#

runtime_np()[源代码]#

remove_extra_parameters(param_dict, keys=None)[源代码]#: at the beginning of the init of the mapper op, call self.remove_extra_parameters(locals()) to get the init parameter dict of the op for convenience

add_parameters(init_parameter_dict, **extra_param_dict)[源代码]#: add parameters for each sample, need to keep extra_param_dict and init_parameter_dict unchanged.

run(dataset)[源代码]#

empty_history()[源代码]#

class data_juicer.ops.base_op.Mapper(*args, **kwargs)[源代码]#

基类：OP

__init__(*args, **kwargs)[源代码]#

Base class that conducts data editing.

参数:

text_key -- the key name of field that stores sample texts to be processed.
image_key -- the key name of field that stores sample image list to be processed
audio_key -- the key name of field that stores sample audio list to be processed
video_key -- the key name of field that stores sample video list to be processed
image_bytes_key -- the key name of field that stores sample image bytes list to be processed
query_key -- the key name of field that stores sample queries
response_key -- the key name of field that stores responses
history_key -- the key name of field that stores history of queries and responses

process_batched(samples, *args, **kwargs)[源代码]#

process_single(sample)[源代码]#

For sample level, sample --> sample

参数:: sample -- sample to process
返回:: processed sample

run(dataset, *, exporter=None, tracer=None)[源代码]#

class data_juicer.ops.base_op.Filter(*args, **kwargs)[源代码]#

基类：OP

__init__(*args, **kwargs)[源代码]#

Base class that removes specific info.

参数:

text_key -- the key name of field that stores sample texts to be processed
image_key -- the key name of field that stores sample image list to be processed
audio_key -- the key name of field that stores sample audio list to be processed
video_key -- the key name of field that stores sample video list to be processed
image_bytes_key -- the key name of field that stores sample image bytes list to be processed
query_key -- the key name of field that stores sample queries
response_key -- the key name of field that stores responses
history_key -- the key name of field that stores history of queries and responses
min_closed_interval -- whether the min_val of the specified filter range is a closed interval. It's True by default.
max_closed_interval -- whether the max_val of the specified filter range is a closed interval. It's True by default.
reversed_range -- whether to reverse the target range [min_val, max_val] to (-∞, min_val) or (max_val, +∞). It's False by default.

get_keep_boolean(val, min_val=None, max_val=None)[源代码]#

compute_stats_batched(samples, *args, **kwargs)[源代码]#

process_batched(samples)[源代码]#

compute_stats_single(sample, context=False)[源代码]#

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

参数:

sample -- input sample.
context -- whether to store context information of intermediate vars in the sample temporarily.

返回:

sample with computed stats

process_single(sample)[源代码]#

For sample level, sample --> Boolean.

参数:: sample -- sample to decide whether to filter
返回:: true for keeping and false for filtering

run(dataset, *, exporter=None, tracer=None, reduce=True)[源代码]#

class data_juicer.ops.base_op.Deduplicator(*args, **kwargs)[源代码]#

基类：OP

__init__(*args, **kwargs)[源代码]#

Base class that conducts deduplication.

参数:

text_key -- the key name of field that stores sample texts to be processed
image_key -- the key name of field that stores sample image list to be processed
audio_key -- the key name of field that stores sample audio list to be processed
video_key -- the key name of field that stores sample video list to be processed
image_bytes_key -- the key name of field that stores sample image bytes list to be processed
query_key -- the key name of field that stores sample queries
response_key -- the key name of field that stores responses
history_key -- the key name of field that stores history of queries and responses

compute_hash(sample)[源代码]#

Compute hash values for the sample.

参数:: sample -- input sample
返回:: sample with computed hash value.

process(dataset, show_num=0)[源代码]#

For doc-level, dataset --> dataset.

参数:

dataset -- input dataset
show_num -- number of traced samples used when tracer is open.

返回:

deduplicated dataset and the sampled duplicate pairs.

run(dataset, *, exporter=None, tracer=None, reduce=True)[源代码]#

class data_juicer.ops.base_op.Selector(*args, **kwargs)[源代码]#

基类：OP

__init__(*args, **kwargs)[源代码]#

Base class that conducts selection in dataset-level.

参数:

text_key -- the key name of field that stores sample texts to be processed
image_key -- the key name of field that stores sample image list to be processed
audio_key -- the key name of field that stores sample audio list to be processed
video_key -- the key name of field that stores sample video list to be processed
image_bytes_key -- the key name of field that stores sample image bytes list to be processed
query_key -- the key name of field that stores sample queries
response_key -- the key name of field that stores responses
history_key -- the key name of field that stores history of queries and responses

process(dataset)[源代码]#

Dataset --> dataset.

参数:: dataset -- input dataset
返回:: selected dataset.

run(dataset, *, exporter=None, tracer=None)[源代码]#

class data_juicer.ops.base_op.Grouper(*args, **kwargs)[源代码]#

基类：OP

__init__(*args, **kwargs)[源代码]#

Base class that group samples.

参数:

text_key -- the key name of field that stores sample texts to be processed
image_key -- the key name of field that stores sample image list to be processed
audio_key -- the key name of field that stores sample audio list to be processed
video_key -- the key name of field that stores sample video list to be processed
image_bytes_key -- the key name of field that stores sample image bytes list to be processed
query_key -- the key name of field that stores sample queries
response_key -- the key name of field that stores responses
history_key -- the key name of field that stores history of queries and responses

process(dataset)[源代码]#

Dataset --> dataset.

参数:: dataset -- input dataset
返回:: dataset of batched samples.

run(dataset, *, exporter=None, tracer=None)[源代码]#

class data_juicer.ops.base_op.Aggregator(*args, **kwargs)[源代码]#

基类：OP

__init__(*args, **kwargs)[源代码]#

Base class that group samples.

参数:

text_key -- the key name of field that stores sample texts to be processed
image_key -- the key name of field that stores sample image list to be processed
audio_key -- the key name of field that stores sample audio list to be processed
video_key -- the key name of field that stores sample video list to be processed
image_bytes_key -- the key name of field that stores sample image bytes list to be processed
query_key -- the key name of field that stores sample queries
response_key -- the key name of field that stores responses
history_key -- the key name of field that stores history of queries and responses

process_single(sample)[源代码]#

For sample level, batched sample --> sample, the input must be the output of some Grouper OP.

参数:: sample -- batched sample to aggregate
返回:: aggregated sample

run(dataset, *, exporter=None, tracer=None)[源代码]#

class data_juicer.ops.base_op.Pipeline(*args, **kwargs)[源代码]#

基类：OP

Base class for Operators that represent a data processing pipeline.

__init__(*args, **kwargs)[源代码]#

Base class of operators.

参数:

text_key -- the key name of field that stores sample texts to be processed.
image_key -- the key name of field that stores sample image list to be processed
audio_key -- the key name of field that stores sample audio list to be processed
video_key -- the key name of field that stores sample video list to be processed
image_bytes_key -- the key name of field that stores sample image bytes list to be processed
query_key -- the key name of field that stores sample queries
response_key -- the key name of field that stores responses
history_key -- the key name of field that stores history of queries and responses
index_key -- index the samples before process if not None
batch_size -- the batch size for processing

run(dataset)[源代码]#

data_juicer.ops.base_op module#

本页