data_juicer.core.tracer.tracer module#
- class data_juicer.core.tracer.tracer.Tracer(work_dir, op_list_to_trace=None, show_num=10, trace_keys=None, lock=None)[源代码]#
基类:
objectThe tracer to trace the sample changes before and after an operator process.
The comparison results will be stored in the work directory. Now supports sample-level tracing for better efficiency and accuracy.
- __init__(work_dir, op_list_to_trace=None, show_num=10, trace_keys=None, lock=None)[源代码]#
Initialization method.
- 参数:
work_dir -- the work directory to store the comparison results
op_list_to_trace -- the OP list to be traced.
show_num -- the maximum number of samples to show in the comparison result files.
trace_keys -- list of field names to include in trace output. If set, the specified fields' values will be included in each trace entry.
- should_trace_op(op_name: str) bool[源代码]#
Check if an operator should be traced.
- 参数:
op_name -- the operator name
- 返回:
True if the operator should be traced
- is_collection_complete(op_name: str) bool[源代码]#
Check if enough samples have been collected for an operator.
- 参数:
op_name -- the operator name
- 返回:
True if enough samples have been collected
- collect_mapper_sample(op_name: str, original_sample: dict, processed_sample: dict, text_key: str)[源代码]#
Collect a sample-level change for a Mapper operator. This method is thread-safe and will only collect up to show_num samples.
- 参数:
op_name -- the operator name
original_sample -- the original sample before processing
processed_sample -- the processed sample after processing
text_key -- the key name of the text field to compare
- 返回:
True if the sample was collected, False if collection is complete
- collect_filter_sample(op_name: str, sample: dict, should_keep: bool)[源代码]#
Collect a sample-level change for a Filter operator. This method is thread-safe and will only collect up to show_num samples. Only collects samples that are filtered out (should_keep=False).
- 参数:
op_name -- the operator name
sample -- the sample being filtered
should_keep -- True if the sample should be kept, False if filtered
- 返回:
True if the sample was collected, False if collection is complete
- get_trace_file_path(op_name: str) str[源代码]#
Get the file path for a trace file.
- 参数:
op_name -- the operator name
- 返回:
the file path
- trace_mapper(op_name: str, previous_ds: Dataset, processed_ds: Dataset, text_key: str)[源代码]#
Compare datasets before and after a Mapper.
This will mainly show the different sample pairs due to the modification by the Mapper
- 参数:
op_name -- the op name of mapper
previous_ds -- dataset before the mapper process
processed_ds -- dataset processed by the mapper
text_key -- which text_key to trace
- 返回:
- trace_batch_mapper(op_name: str, previous_ds: Dataset, processed_ds: Dataset, text_key: str)[源代码]#
Compare datasets before and after a BatchMapper.
This will mainly show the new samples augmented by the BatchMapper
- 参数:
op_name -- the op name of mapper
previous_ds -- dataset before the mapper process
processed_ds -- dataset processed by the mapper
text_key -- which text_key to trace
- 返回:
- trace_filter(op_name: str, previous_ds: Dataset, processed_ds: Dataset)[源代码]#
Compare datasets before and after a Filter.
This will mainly show the filtered samples by the Filter
- 参数:
op_name -- the op name of filter
previous_ds -- dataset before the filter process
processed_ds -- dataset processed by the filter
- 返回:
- trace_deduplicator(op_name: str, dup_pairs: dict)[源代码]#
Compare datasets before and after a Deduplicator.
This will mainly show the near-duplicate sample pairs extracted by the Deduplicator. Different from the other two trace methods, the trace process for deduplicator is embedded into the process method of deduplicator, but the other two trace methods are independent of the process method of mapper and filter operators
- 参数:
op_name -- the op name of deduplicator
dup_pairs -- duplicate sample pairs obtained from deduplicator
- 返回: