Data Tracing#
This document describes DataJuicer’s tracing system for tracking sample-level changes during data processing.
Overview#
The Tracer records how each operator modifies, filters, or deduplicates individual samples in the processing pipeline. This is useful for:
Debugging — Understanding why specific samples were modified or removed
Quality Assurance — Verifying operators are working as expected
Auditing — Maintaining records of data transformations
Configuration#
Basic Settings#
open_tracer: false # Enable/disable tracing
op_list_to_trace: [] # List of operators to trace (empty = all operators)
trace_num: 10 # Maximum number of samples to collect per operator
trace_keys: [] # Additional fields to include in trace output
Command Line#
# Enable tracing for all operators
dj-process --config config.yaml --open_tracer true
# Trace only specific operators
dj-process --config config.yaml --open_tracer true \
--op_list_to_trace clean_email_mapper,words_num_filter
# Collect more samples per operator
dj-process --config config.yaml --open_tracer true --trace_num 50
# Include additional fields in trace output
dj-process --config config.yaml --open_tracer true \
--trace_keys sample_id,source_file
Output Structure#
Trace results are stored in the trace/ subdirectory of the work directory:
{work_dir}/
└── trace/
├── sample_trace-clean_email_mapper.jsonl
├── sample_trace-words_num_filter.jsonl
├── duplicate-document_deduplicator.jsonl
└── ...
Each trace file is in JSONL format (one JSON object per line), with content varying by operator type.
Traced Operator Types#
Mapper Tracing#
For Mapper operators, the Tracer records samples where text content changes. Each record contains:
Field |
Description |
|---|---|
|
Text before Mapper processing |
|
Text after Mapper processing |
trace_keys fields |
Values corresponding to configured |
Example output (sample_trace-clean_email_mapper.jsonl):
{"original_text":"Contact us at user@example.com for details.","processed_text":"Contact us at for details."}
{"original_text": "Email: admin@test.org", "processed_text": "Email: "}
Only samples with actual text changes are collected; unchanged samples are skipped.
Filter Tracing#
For Filter operators, the Tracer records samples that are filtered out (removed). Each record contains the complete sample data.
Example output (sample_trace-words_num_filter.jsonl):
{"text": "Too short.", "__dj__stats__": {"words_num": 2}}
{"text": "Also brief.", "__dj__stats__": {"words_num": 2}}
Only samples that fail the filter are collected; samples passing the filter are skipped.
Deduplicator Tracing#
For Deduplicator operators, the Tracer records pairs of near-duplicate samples. Each record contains:
Field |
Description |
|---|---|
|
First sample in the duplicate pair |
|
Second sample in the duplicate pair |
Example output (duplicate-document_deduplicator.jsonl):
{"dup1": "This is a duplicate text.", "dup2": "This is a duplicate text."}
Sample Collection Behavior#
The Tracer uses an efficient sample-level collection approach:
Each operator collects at most
trace_numsamples during processingCollection stops early once enough samples are gathered
In default mode, collection is thread-safe using multiprocess locks
In Ray mode, each Worker has its own Tracer instance (no locking needed)
This design minimizes performance overhead — the Tracer does not compare the entire dataset, but captures changes in real-time during processing.
trace_keys#
The trace_keys option allows including additional fields from original samples in the trace output. This is useful for identifying which samples were affected:
open_tracer: true
trace_keys:
- sample_id
- source_file
With this configuration, Mapper trace entries will include:
{
"sample_id": "doc_00042",
"source_file": "corpus_part1.jsonl",
"original_text": "Original content...",
"processed_text": "Processed content..."
}
API Reference#
Tracer (Default Mode)#
from data_juicer.core.tracer import Tracer
tracer = Tracer(
work_dir="./outputs",
op_list_to_trace=["clean_email_mapper", "words_num_filter"],
show_num=10,
trace_keys=["sample_id"]
)
# Check if an operator should be traced
tracer.should_trace_op("clean_email_mapper") # True
# Check if enough samples have been collected
tracer.is_collection_complete("clean_email_mapper") # False
# Collect Mapper sample
tracer.collect_mapper_sample(
op_name="clean_email_mapper",
original_sample={"text": "Email: a@b.com"},
processed_sample={"text": "Email: "},
text_key="text"
)
# Collect Filter sample
tracer.collect_filter_sample(
op_name="words_num_filter",
sample={"text": "too short"},
should_keep=False
)
RayTracer (Distributed Mode)#
from data_juicer.core.tracer.ray_tracer import RayTracer
# RayTracer is a Ray Actor — created via Ray
tracer = RayTracer.remote(
work_dir="./outputs",
op_list_to_trace=None, # Trace all operators
show_num=10,
trace_keys=["sample_id"]
)
# Remote method calls
ray.get(tracer.collect_mapper_sample.remote(
op_name="clean_email_mapper",
original_sample={"text": "Email: a@b.com"},
processed_sample={"text": "Email: "},
text_key="text"
))
# Finalize and export all trace results
ray.get(tracer.finalize_traces.remote())
Helper Functions#
The data_juicer.core.tracer module provides mode-agnostic helper functions:
from data_juicer.core.tracer import (
should_trace_op,
check_tracer_collect_complete,
collect_for_mapper,
collect_for_filter,
)
# These functions automatically handle default mode and Ray mode
should_trace_op(tracer_instance, "clean_email_mapper")
check_tracer_collect_complete(tracer_instance, "clean_email_mapper")
collect_for_mapper(tracer_instance, "op_name", original, processed, "text")
collect_for_filter(tracer_instance, "op_name", sample, should_keep=False)
Performance Considerations#
Overhead#
When
trace_numis small (default: 10), the additional overhead of tracing is minimalOnce an operator has collected
trace_numsamples, no further collection occursThe main cost is comparing original and processed text in Mappers
Recommendations#
Scenario |
Recommended Settings |
|---|---|
Development/Debugging |
|
Production Runs |
|
Auditing Specific Operators |
|
Large-scale Tracing |
|
Troubleshooting#
No trace files generated:
# Verify tracer is enabled
grep "open_tracer" config.yaml
# Check if trace directory exists
ls -la ./outputs/{work_dir}/trace/
Trace files are empty:
For Mapper: The operator may not have modified any samples
For Filter: The operator may not have filtered out any samples
Check logs for warnings like “Datasets before and after op [X] are all the same”
Too few samples in trace files:
Increase
trace_numto collect more samplesThere may be fewer than
trace_numchanged/filtered samples in the dataset