# Cache Management This document describes DataJuicer's cache management system, including HuggingFace dataset caching, cache directory configuration, cache compression, and temporary storage. ## Overview DataJuicer provides a caching mechanism based on HuggingFace Datasets to avoid redundant computation. When enabled, each operator generates cache files with a unique **fingerprint** based on: - The fingerprint of the input data - The operator name and parameters - The hash of the processing function That is: same input + same operator configuration = same fingerprint = cache hit. Therefore, re-running the same pipeline on the same data will skip already-computed steps. For more details, please refer to our paper: [Data-Juicer: A One-Stop Data Processing System for Large Language Models](https://arxiv.org/abs/2309.02033). The cache system also provides: - **Configurable cache directories** via environment variables or config options - **Cache compression** to reduce disk usage for large-scale datasets - **Temporary storage** for intermediate files in non-cache mode - **Fine-grained cache control** via context managers and decorators ## Configuration ### Basic Cache Settings ```yaml use_cache: true # Enable/disable HuggingFace dataset caching ds_cache_dir: null # Custom cache directory (overrides HF_DATASETS_CACHE) cache_compress: null # Compression method: 'gzip', 'zstd', 'lz4', or null temp_dir: null # Temp directory for intermediate files when cache is disabled ``` ### Command Line ```bash # Enable caching (default) dj-process --config config.yaml --use_cache true # Disable caching dj-process --config config.yaml --use_cache false # Enable cache compression dj-process --config config.yaml --cache_compress zstd # Custom cache directory dj-process --config config.yaml --ds_cache_dir /fast-storage/dj-cache ``` ## Cache Directory Structure DataJuicer organizes cache files in a hierarchical directory structure controlled by environment variables: ``` ~/.cache/ # CACHE_HOME (default) └── data_juicer/ # DATA_JUICER_CACHE_HOME ├── assets/ # DATA_JUICER_ASSETS_CACHE │ └── (extracted frames, stopwords, flagged words, etc.) └── models/ # DATA_JUICER_MODELS_CACHE └── (downloaded model files) ``` ### Environment Variables | Variable | Default | Description | |----------|---------|-------------| | `CACHE_HOME` | `~/.cache` | Root cache directory | | `DATA_JUICER_CACHE_HOME` | `$CACHE_HOME/data_juicer` | DataJuicer cache root | | `DATA_JUICER_ASSETS_CACHE` | `$DATA_JUICER_CACHE_HOME/assets` | Assets cache (frames, word lists, etc.) | | `DATA_JUICER_MODELS_CACHE` | `$DATA_JUICER_CACHE_HOME/models` | Downloaded models cache | | `DATA_JUICER_EXTERNAL_MODELS_HOME` | `None` | External models directory | Override defaults by setting environment variables: ```bash export DATA_JUICER_CACHE_HOME=/data/dj-cache export DATA_JUICER_MODELS_CACHE=/models/dj-models dj-process --config config.yaml ``` ## Cache Compression For large-scale datasets (tens of GB or more), cache files can consume significant disk space. Cache compression reduces storage requirements by compressing intermediate cache files after each operator completes. ### Supported Algorithms | Algorithm | Library | Speed | Compression Ratio | Recommended For | |-----------|---------|-------|-------------------|-----------------| | `zstd` | zstandard | Fast | High | General use (default) | | `lz4` | lz4 | Fastest | Moderate | Speed-critical workloads | | `gzip` | gzip | Slow | High | Compatibility needs | ### Configuration ```yaml use_cache: true cache_compress: zstd # Enable zstd compression ``` ```bash dj-process --config config.yaml --cache_compress zstd ``` ### Multi-Process Compression Cache compression supports parallel processing. The number of compression worker processes is controlled by the `np` parameter: ```yaml np: 4 # Number of parallel workers (also used for compression) cache_compress: zstd ``` ## Cache Control API ### DatasetCacheControl A context manager to temporarily enable or disable HuggingFace dataset caching within a specific scope: ```python from data_juicer.utils.cache_utils import DatasetCacheControl # Temporarily disable caching with DatasetCacheControl(on=False): # Operations here will not use cache result = dataset.map(my_function) # Temporarily enable caching with DatasetCacheControl(on=True): # Operations here will use cache result = dataset.map(my_function) ``` ### dataset_cache_control Decorator A decorator for functions that need to control cache state: ```python from data_juicer.utils.cache_utils import dataset_cache_control @dataset_cache_control(on=False) def process_without_cache(dataset): return dataset.map(my_function) ``` ### CompressionOff A context manager to temporarily disable cache compression: ```python from data_juicer.utils.compress import CompressionOff with CompressionOff(): # Cache compression is disabled in this scope result = dataset.map(my_function) ``` ### CompressManager Low-level API for manual compression/decompression: ```python from data_juicer.utils.compress import CompressManager manager = CompressManager(compressor_format="zstd") # Compress a file manager.compress("input.arrow", "input.arrow.zstd") # Decompress a file manager.decompress("input.arrow.zstd", "input.arrow") ``` ### CacheCompressManager High-level API for managing HuggingFace dataset cache compression: ```python from data_juicer.utils.compress import CacheCompressManager manager = CacheCompressManager(compressor_format="zstd") # Compress previous dataset's cache files manager.compress(prev_ds=previous_dataset, this_ds=current_dataset, num_proc=4) # Decompress cache files for a dataset manager.decompress(ds=dataset, num_proc=4) # Clean up all compressed cache files manager.cleanup_cache_files(ds=dataset) ``` ## Cache vs Checkpoint Cache and checkpoint are mutually exclusive — enabling checkpoint automatically disables cache: | Feature | Cache | Checkpoint | |---------|-------|------------| | **Purpose** | Accelerate repeated runs with same configuration | Fault recovery and resumption | | **Granularity** | Per-operator result | Full dataset snapshot | | **Storage Location** | HuggingFace cache directory | Work directory | | **Recovery Method** | Automatic (hash-based) | Manual (config-based) | | **Compression** | Supported (`cache_compress`) | Not applicable | | **Scenario** | Iterative development, parameter tuning | Long-running production tasks | ```yaml # Cache mode (default) use_cache: true use_checkpoint: false # Checkpoint mode (cache auto-disabled) use_cache: true # Will be overridden to false use_checkpoint: true ``` ## Disabling Cache and Temporary Directory When `use_cache: false` or checkpoint mode is enabled (`use_checkpoint: true`), HuggingFace dataset caching is fully disabled. In this mode, DataJuicer writes intermediate files produced during operator processing to a temporary directory, and cleans them up automatically when processing completes. The `temp_dir` parameter controls where these intermediate files are stored. ### Behavior - **Defaults to `null`**: When not set, the operating system determines the temporary directory location (typically `/tmp`), equivalent to Python's `tempfile.gettempdir()`. - **Takes effect automatically when cache is disabled**: Once caching is disabled, `temp_dir` is applied as the global temporary directory for the entire process via Python's `tempfile.tempdir`, affecting all temporary files created through `tempfile` in the process. - **Cache compression is automatically disabled**: When caching is disabled, `cache_compress` is automatically ignored and reset to `null`. ### Configuration ```yaml use_cache: false temp_dir: /data/dj-temp # Custom temp directory; null means system default ``` ```bash dj-process --config config.yaml --use_cache false --temp_dir /data/dj-temp ``` ### Safety Notes > **Set `temp_dir` with caution — an unsafe path can cause unexpected program behavior.** - **Do not point to critical system directories** (e.g., `/`, `/usr`, `/etc`). Automatic cleanup of temporary files may accidentally delete important files. - **Do not point to directories containing important data**. Temporary file writes and cleanup operations may conflict with existing files. - **Ensure sufficient disk space**: When cache is disabled, intermediate files are written and deleted dynamically during processing. Peak disk usage is approximately equal to the output size of a single operator. - **The directory is created automatically if it does not exist**: DataJuicer calls `os.makedirs` to create the specified path if it is missing. - **`temp_dir` affects the entire process's `tempfile` behavior**: Because it sets the global `tempfile.tempdir` variable, this setting influences all components in the process that rely on `tempfile`, including third-party libraries. ## Performance Considerations ### When to Enable Cache - **Enable**: For iterative development where you frequently re-run pipelines with minor changes - **Enable**: When operators are computationally expensive and you want to skip already-computed steps - **Disable**: For one-shot processing to avoid disk overhead ### When to Enable Compression - **Enable**: When dataset size exceeds tens of GB and disk space is limited - **Enable** `zstd`: For the best balance of speed and compression ratio - **Enable** `lz4`: When compression speed is critical - **Disable**: When disk space is abundant and you want maximum processing speed ## Troubleshooting **Cache files consuming too much disk space:** ```bash # Check cache directory size du -sh ~/.cache/data_juicer/ # Enable compression dj-process --config config.yaml --cache_compress zstd ``` **Stale cache causing unexpected results:** ```bash # Clear HuggingFace dataset cache rm -rf ~/.cache/huggingface/datasets/ # Or specify a fresh cache directory dj-process --config config.yaml --ds_cache_dir /tmp/fresh-cache ```