Cache Management#
This document describes DataJuicer’s cache management system, including HuggingFace dataset caching, cache directory configuration, cache compression, and temporary storage.
Overview#
DataJuicer provides a caching mechanism based on HuggingFace Datasets to avoid redundant computation. When enabled, each operator generates cache files with a unique fingerprint based on:
The fingerprint of the input data
The operator name and parameters
The hash of the processing function
That is: same input + same operator configuration = same fingerprint = cache hit. Therefore, re-running the same pipeline on the same data will skip already-computed steps. For more details, please refer to our paper: Data-Juicer: A One-Stop Data Processing System for Large Language Models.
The cache system also provides:
Configurable cache directories via environment variables or config options
Cache compression to reduce disk usage for large-scale datasets
Temporary storage for intermediate files in non-cache mode
Fine-grained cache control via context managers and decorators
Configuration#
Basic Cache Settings#
use_cache: true # Enable/disable HuggingFace dataset caching
ds_cache_dir: null # Custom cache directory (overrides HF_DATASETS_CACHE)
cache_compress: null # Compression method: 'gzip', 'zstd', 'lz4', or null
temp_dir: null # Temp directory for intermediate files when cache is disabled
Command Line#
# Enable caching (default)
dj-process --config config.yaml --use_cache true
# Disable caching
dj-process --config config.yaml --use_cache false
# Enable cache compression
dj-process --config config.yaml --cache_compress zstd
# Custom cache directory
dj-process --config config.yaml --ds_cache_dir /fast-storage/dj-cache
Cache Directory Structure#
DataJuicer organizes cache files in a hierarchical directory structure controlled by environment variables:
~/.cache/ # CACHE_HOME (default)
└── data_juicer/ # DATA_JUICER_CACHE_HOME
├── assets/ # DATA_JUICER_ASSETS_CACHE
│ └── (extracted frames, stopwords, flagged words, etc.)
└── models/ # DATA_JUICER_MODELS_CACHE
└── (downloaded model files)
Environment Variables#
Variable |
Default |
Description |
|---|---|---|
|
|
Root cache directory |
|
|
DataJuicer cache root |
|
|
Assets cache (frames, word lists, etc.) |
|
|
Downloaded models cache |
|
|
External models directory |
Override defaults by setting environment variables:
export DATA_JUICER_CACHE_HOME=/data/dj-cache
export DATA_JUICER_MODELS_CACHE=/models/dj-models
dj-process --config config.yaml
Cache Compression#
For large-scale datasets (tens of GB or more), cache files can consume significant disk space. Cache compression reduces storage requirements by compressing intermediate cache files after each operator completes.
Supported Algorithms#
Algorithm |
Library |
Speed |
Compression Ratio |
Recommended For |
|---|---|---|---|---|
|
zstandard |
Fast |
High |
General use (default) |
|
lz4 |
Fastest |
Moderate |
Speed-critical workloads |
|
gzip |
Slow |
High |
Compatibility needs |
Configuration#
use_cache: true
cache_compress: zstd # Enable zstd compression
dj-process --config config.yaml --cache_compress zstd
Multi-Process Compression#
Cache compression supports parallel processing. The number of compression worker processes is controlled by the np parameter:
np: 4 # Number of parallel workers (also used for compression)
cache_compress: zstd
Cache Control API#
DatasetCacheControl#
A context manager to temporarily enable or disable HuggingFace dataset caching within a specific scope:
from data_juicer.utils.cache_utils import DatasetCacheControl
# Temporarily disable caching
with DatasetCacheControl(on=False):
# Operations here will not use cache
result = dataset.map(my_function)
# Temporarily enable caching
with DatasetCacheControl(on=True):
# Operations here will use cache
result = dataset.map(my_function)
dataset_cache_control Decorator#
A decorator for functions that need to control cache state:
from data_juicer.utils.cache_utils import dataset_cache_control
@dataset_cache_control(on=False)
def process_without_cache(dataset):
return dataset.map(my_function)
CompressionOff#
A context manager to temporarily disable cache compression:
from data_juicer.utils.compress import CompressionOff
with CompressionOff():
# Cache compression is disabled in this scope
result = dataset.map(my_function)
CompressManager#
Low-level API for manual compression/decompression:
from data_juicer.utils.compress import CompressManager
manager = CompressManager(compressor_format="zstd")
# Compress a file
manager.compress("input.arrow", "input.arrow.zstd")
# Decompress a file
manager.decompress("input.arrow.zstd", "input.arrow")
CacheCompressManager#
High-level API for managing HuggingFace dataset cache compression:
from data_juicer.utils.compress import CacheCompressManager
manager = CacheCompressManager(compressor_format="zstd")
# Compress previous dataset's cache files
manager.compress(prev_ds=previous_dataset, this_ds=current_dataset, num_proc=4)
# Decompress cache files for a dataset
manager.decompress(ds=dataset, num_proc=4)
# Clean up all compressed cache files
manager.cleanup_cache_files(ds=dataset)
Cache vs Checkpoint#
Cache and checkpoint are mutually exclusive — enabling checkpoint automatically disables cache:
Feature |
Cache |
Checkpoint |
|---|---|---|
Purpose |
Accelerate repeated runs with same configuration |
Fault recovery and resumption |
Granularity |
Per-operator result |
Full dataset snapshot |
Storage Location |
HuggingFace cache directory |
Work directory |
Recovery Method |
Automatic (hash-based) |
Manual (config-based) |
Compression |
Supported ( |
Not applicable |
Scenario |
Iterative development, parameter tuning |
Long-running production tasks |
# Cache mode (default)
use_cache: true
use_checkpoint: false
# Checkpoint mode (cache auto-disabled)
use_cache: true # Will be overridden to false
use_checkpoint: true
Disabling Cache and Temporary Directory#
When use_cache: false or checkpoint mode is enabled (use_checkpoint: true), HuggingFace dataset caching is fully disabled. In this mode, DataJuicer writes intermediate files produced during operator processing to a temporary directory, and cleans them up automatically when processing completes. The temp_dir parameter controls where these intermediate files are stored.
Behavior#
Defaults to
null: When not set, the operating system determines the temporary directory location (typically/tmp), equivalent to Python’stempfile.gettempdir().Takes effect automatically when cache is disabled: Once caching is disabled,
temp_diris applied as the global temporary directory for the entire process via Python’stempfile.tempdir, affecting all temporary files created throughtempfilein the process.Cache compression is automatically disabled: When caching is disabled,
cache_compressis automatically ignored and reset tonull.
Configuration#
use_cache: false
temp_dir: /data/dj-temp # Custom temp directory; null means system default
dj-process --config config.yaml --use_cache false --temp_dir /data/dj-temp
Safety Notes#
Set
temp_dirwith caution — an unsafe path can cause unexpected program behavior.
Do not point to critical system directories (e.g.,
/,/usr,/etc). Automatic cleanup of temporary files may accidentally delete important files.Do not point to directories containing important data. Temporary file writes and cleanup operations may conflict with existing files.
Ensure sufficient disk space: When cache is disabled, intermediate files are written and deleted dynamically during processing. Peak disk usage is approximately equal to the output size of a single operator.
The directory is created automatically if it does not exist: DataJuicer calls
os.makedirsto create the specified path if it is missing.temp_diraffects the entire process’stempfilebehavior: Because it sets the globaltempfile.tempdirvariable, this setting influences all components in the process that rely ontempfile, including third-party libraries.
Performance Considerations#
When to Enable Cache#
Enable: For iterative development where you frequently re-run pipelines with minor changes
Enable: When operators are computationally expensive and you want to skip already-computed steps
Disable: For one-shot processing to avoid disk overhead
When to Enable Compression#
Enable: When dataset size exceeds tens of GB and disk space is limited
Enable
zstd: For the best balance of speed and compression ratioEnable
lz4: When compression speed is criticalDisable: When disk space is abundant and you want maximum processing speed
Troubleshooting#
Cache files consuming too much disk space:
# Check cache directory size
du -sh ~/.cache/data_juicer/
# Enable compression
dj-process --config config.yaml --cache_compress zstd
Stale cache causing unexpected results:
# Clear HuggingFace dataset cache
rm -rf ~/.cache/huggingface/datasets/
# Or specify a fresh cache directory
dj-process --config config.yaml --ds_cache_dir /tmp/fresh-cache