# Data Tracing This document describes DataJuicer's tracing system for tracking sample-level changes during data processing. ## Overview The Tracer records how each operator modifies, filters, or deduplicates individual samples in the processing pipeline. This is useful for: - **Debugging** — Understanding why specific samples were modified or removed - **Quality Assurance** — Verifying operators are working as expected - **Auditing** — Maintaining records of data transformations ## Configuration ### Basic Settings ```yaml open_tracer: false # Enable/disable tracing op_list_to_trace: [] # List of operators to trace (empty = all operators) trace_num: 10 # Maximum number of samples to collect per operator trace_keys: [] # Additional fields to include in trace output ``` ### Command Line ```bash # Enable tracing for all operators dj-process --config config.yaml --open_tracer true # Trace only specific operators dj-process --config config.yaml --open_tracer true \ --op_list_to_trace clean_email_mapper,words_num_filter # Collect more samples per operator dj-process --config config.yaml --open_tracer true --trace_num 50 # Include additional fields in trace output dj-process --config config.yaml --open_tracer true \ --trace_keys sample_id,source_file ``` ## Output Structure Trace results are stored in the `trace/` subdirectory of the work directory: ``` {work_dir}/ └── trace/ ├── sample_trace-clean_email_mapper.jsonl ├── sample_trace-words_num_filter.jsonl ├── duplicate-document_deduplicator.jsonl └── ... ``` Each trace file is in JSONL format (one JSON object per line), with content varying by operator type. ## Traced Operator Types ### Mapper Tracing For Mapper operators, the Tracer records samples where text content changes. Each record contains: | Field | Description | |-------|-------------| | `original_text` | Text before Mapper processing | | `processed_text` | Text after Mapper processing | | *trace_keys fields* | Values corresponding to configured `trace_keys` | Example output (`sample_trace-clean_email_mapper.jsonl`): ```json {"original_text":"Contact us at user@example.com for details.","processed_text":"Contact us at for details."} {"original_text": "Email: admin@test.org", "processed_text": "Email: "} ``` Only samples with actual text changes are collected; unchanged samples are skipped. ### Filter Tracing For Filter operators, the Tracer records samples that are **filtered out** (removed). Each record contains the complete sample data. Example output (`sample_trace-words_num_filter.jsonl`): ```json {"text": "Too short.", "__dj__stats__": {"words_num": 2}} {"text": "Also brief.", "__dj__stats__": {"words_num": 2}} ``` Only samples that fail the filter are collected; samples passing the filter are skipped. ### Deduplicator Tracing For Deduplicator operators, the Tracer records pairs of near-duplicate samples. Each record contains: | Field | Description | |-------|-------------| | `dup1` | First sample in the duplicate pair | | `dup2` | Second sample in the duplicate pair | Example output (`duplicate-document_deduplicator.jsonl`): ```json {"dup1": "This is a duplicate text.", "dup2": "This is a duplicate text."} ``` ## Sample Collection Behavior The Tracer uses an efficient **sample-level collection** approach: 1. Each operator collects at most `trace_num` samples during processing 2. Collection stops early once enough samples are gathered 3. In default mode, collection is **thread-safe** using multiprocess locks 4. In Ray mode, each Worker has its own Tracer instance (no locking needed) This design minimizes performance overhead — the Tracer does not compare the entire dataset, but captures changes in real-time during processing. ## trace_keys The `trace_keys` option allows including additional fields from original samples in the trace output. This is useful for identifying which samples were affected: ```yaml open_tracer: true trace_keys: - sample_id - source_file ``` With this configuration, Mapper trace entries will include: ```json { "sample_id": "doc_00042", "source_file": "corpus_part1.jsonl", "original_text": "Original content...", "processed_text": "Processed content..." } ``` ## API Reference ### Tracer (Default Mode) ```python from data_juicer.core.tracer import Tracer tracer = Tracer( work_dir="./outputs", op_list_to_trace=["clean_email_mapper", "words_num_filter"], show_num=10, trace_keys=["sample_id"] ) # Check if an operator should be traced tracer.should_trace_op("clean_email_mapper") # True # Check if enough samples have been collected tracer.is_collection_complete("clean_email_mapper") # False # Collect Mapper sample tracer.collect_mapper_sample( op_name="clean_email_mapper", original_sample={"text": "Email: a@b.com"}, processed_sample={"text": "Email: "}, text_key="text" ) # Collect Filter sample tracer.collect_filter_sample( op_name="words_num_filter", sample={"text": "too short"}, should_keep=False ) ``` ### RayTracer (Distributed Mode) ```python from data_juicer.core.tracer.ray_tracer import RayTracer # RayTracer is a Ray Actor — created via Ray tracer = RayTracer.remote( work_dir="./outputs", op_list_to_trace=None, # Trace all operators show_num=10, trace_keys=["sample_id"] ) # Remote method calls ray.get(tracer.collect_mapper_sample.remote( op_name="clean_email_mapper", original_sample={"text": "Email: a@b.com"}, processed_sample={"text": "Email: "}, text_key="text" )) # Finalize and export all trace results ray.get(tracer.finalize_traces.remote()) ``` ### Helper Functions The `data_juicer.core.tracer` module provides mode-agnostic helper functions: ```python from data_juicer.core.tracer import ( should_trace_op, check_tracer_collect_complete, collect_for_mapper, collect_for_filter, ) # These functions automatically handle default mode and Ray mode should_trace_op(tracer_instance, "clean_email_mapper") check_tracer_collect_complete(tracer_instance, "clean_email_mapper") collect_for_mapper(tracer_instance, "op_name", original, processed, "text") collect_for_filter(tracer_instance, "op_name", sample, should_keep=False) ``` ## Performance Considerations ### Overhead - When `trace_num` is small (default: 10), the additional overhead of tracing is minimal - Once an operator has collected `trace_num` samples, no further collection occurs - The main cost is comparing original and processed text in Mappers ### Recommendations | Scenario | Recommended Settings | |----------|----------------------| | Development/Debugging | `open_tracer: true`, `trace_num: 10-50` | | Production Runs | `open_tracer: false` | | Auditing Specific Operators | `open_tracer: true`, `op_list_to_trace: [specific operators]` | | Large-scale Tracing | `open_tracer: true`, `trace_num: 100`, specify `op_list_to_trace` | ## Troubleshooting **No trace files generated:** ```bash # Verify tracer is enabled grep "open_tracer" config.yaml # Check if trace directory exists ls -la ./outputs/{work_dir}/trace/ ``` **Trace files are empty:** - For Mapper: The operator may not have modified any samples - For Filter: The operator may not have filtered out any samples - Check logs for warnings like "Datasets before and after op [X] are all the same" **Too few samples in trace files:** - Increase `trace_num` to collect more samples - There may be fewer than `trace_num` changed/filtered samples in the dataset