data_juicer.core.tracer.tracer module#

class data_juicer.core.tracer.tracer.Tracer(work_dir, op_list_to_trace=None, show_num=10, trace_keys=None, lock=None)[source]#

Bases: object

The tracer to trace the sample changes before and after an operator process.

The comparison results will be stored in the work directory. Now supports sample-level tracing for better efficiency and accuracy.

__init__(work_dir, op_list_to_trace=None, show_num=10, trace_keys=None, lock=None)[source]#

Initialization method.

Parameters:
  • work_dir โ€“ the work directory to store the comparison results

  • op_list_to_trace โ€“ the OP list to be traced.

  • show_num โ€“ the maximum number of samples to show in the comparison result files.

  • trace_keys โ€“ list of field names to include in trace output. If set, the specified fieldsโ€™ values will be included in each trace entry.

should_trace_op(op_name: str) bool[source]#

Check if an operator should be traced.

Parameters:

op_name โ€“ the operator name

Returns:

True if the operator should be traced

is_collection_complete(op_name: str) bool[source]#

Check if enough samples have been collected for an operator.

Parameters:

op_name โ€“ the operator name

Returns:

True if enough samples have been collected

collect_mapper_sample(op_name: str, original_sample: dict, processed_sample: dict, text_key: str)[source]#

Collect a sample-level change for a Mapper operator. This method is thread-safe and will only collect up to show_num samples.

Parameters:
  • op_name โ€“ the operator name

  • original_sample โ€“ the original sample before processing

  • processed_sample โ€“ the processed sample after processing

  • text_key โ€“ the key name of the text field to compare

Returns:

True if the sample was collected, False if collection is complete

collect_filter_sample(op_name: str, sample: dict, should_keep: bool)[source]#

Collect a sample-level change for a Filter operator. This method is thread-safe and will only collect up to show_num samples. Only collects samples that are filtered out (should_keep=False).

Parameters:
  • op_name โ€“ the operator name

  • sample โ€“ the sample being filtered

  • should_keep โ€“ True if the sample should be kept, False if filtered

Returns:

True if the sample was collected, False if collection is complete

get_trace_file_path(op_name: str) str[source]#

Get the file path for a trace file.

Parameters:

op_name โ€“ the operator name

Returns:

the file path

trace_mapper(op_name: str, previous_ds: Dataset, processed_ds: Dataset, text_key: str)[source]#

Compare datasets before and after a Mapper.

This will mainly show the different sample pairs due to the modification by the Mapper

Parameters:
  • op_name โ€“ the op name of mapper

  • previous_ds โ€“ dataset before the mapper process

  • processed_ds โ€“ dataset processed by the mapper

  • text_key โ€“ which text_key to trace

Returns:

trace_batch_mapper(op_name: str, previous_ds: Dataset, processed_ds: Dataset, text_key: str)[source]#

Compare datasets before and after a BatchMapper.

This will mainly show the new samples augmented by the BatchMapper

Parameters:
  • op_name โ€“ the op name of mapper

  • previous_ds โ€“ dataset before the mapper process

  • processed_ds โ€“ dataset processed by the mapper

  • text_key โ€“ which text_key to trace

Returns:

trace_filter(op_name: str, previous_ds: Dataset, processed_ds: Dataset)[source]#

Compare datasets before and after a Filter.

This will mainly show the filtered samples by the Filter

Parameters:
  • op_name โ€“ the op name of filter

  • previous_ds โ€“ dataset before the filter process

  • processed_ds โ€“ dataset processed by the filter

Returns:

trace_deduplicator(op_name: str, dup_pairs: dict)[source]#

Compare datasets before and after a Deduplicator.

This will mainly show the near-duplicate sample pairs extracted by the Deduplicator. Different from the other two trace methods, the trace process for deduplicator is embedded into the process method of deduplicator, but the other two trace methods are independent of the process method of mapper and filter operators

Parameters:
  • op_name โ€“ the op name of deduplicator

  • dup_pairs โ€“ duplicate sample pairs obtained from deduplicator

Returns: