Cache Management#

This document describes DataJuicer’s cache management system, including HuggingFace dataset caching, cache directory configuration, cache compression, and temporary storage.

Overview#

DataJuicer provides a caching mechanism based on HuggingFace Datasets to avoid redundant computation. When enabled, each operator generates cache files with a unique fingerprint based on:

The fingerprint of the input data
The operator name and parameters
The hash of the processing function

That is: same input + same operator configuration = same fingerprint = cache hit. Therefore, re-running the same pipeline on the same data will skip already-computed steps. For more details, please refer to our paper: Data-Juicer: A One-Stop Data Processing System for Large Language Models.

The cache system also provides:

Configurable cache directories via environment variables or config options
Cache compression to reduce disk usage for large-scale datasets
Temporary storage for intermediate files in non-cache mode
Fine-grained cache control via context managers and decorators

Configuration#

Basic Cache Settings#

use_cache: true           # Enable/disable HuggingFace dataset caching
ds_cache_dir: null         # Custom cache directory (overrides HF_DATASETS_CACHE)
cache_compress: null       # Compression method: 'gzip', 'zstd', 'lz4', or null
temp_dir: null             # Temp directory for intermediate files when cache is disabled

Command Line#

# Enable caching (default)
dj-process --config config.yaml --use_cache true

# Disable caching
dj-process --config config.yaml --use_cache false

# Enable cache compression
dj-process --config config.yaml --cache_compress zstd

# Custom cache directory
dj-process --config config.yaml --ds_cache_dir /fast-storage/dj-cache

Cache Directory Structure#

DataJuicer organizes cache files in a hierarchical directory structure controlled by environment variables:

~/.cache/                              # CACHE_HOME (default)
└── data_juicer/                       # DATA_JUICER_CACHE_HOME
    ├── assets/                        # DATA_JUICER_ASSETS_CACHE
    │   └── (extracted frames, stopwords, flagged words, etc.)
    └── models/                        # DATA_JUICER_MODELS_CACHE
        └── (downloaded model files)

Environment Variables#

Variable	Default	Description
`CACHE_HOME`	`~/.cache`	Root cache directory
`DATA_JUICER_CACHE_HOME`	`$CACHE_HOME/data_juicer`	DataJuicer cache root
`DATA_JUICER_ASSETS_CACHE`	`$DATA_JUICER_CACHE_HOME/assets`	Assets cache (frames, word lists, etc.)
`DATA_JUICER_MODELS_CACHE`	`$DATA_JUICER_CACHE_HOME/models`	Downloaded models cache
`DATA_JUICER_EXTERNAL_MODELS_HOME`	`None`	External models directory

Override defaults by setting environment variables:

export DATA_JUICER_CACHE_HOME=/data/dj-cache
export DATA_JUICER_MODELS_CACHE=/models/dj-models
dj-process --config config.yaml

Cache Compression#

For large-scale datasets (tens of GB or more), cache files can consume significant disk space. Cache compression reduces storage requirements by compressing intermediate cache files after each operator completes.

Supported Algorithms#

Algorithm	Library	Speed	Compression Ratio	Recommended For
`zstd`	zstandard	Fast	High	General use (default)
`lz4`	lz4	Fastest	Moderate	Speed-critical workloads
`gzip`	gzip	Slow	High	Compatibility needs

Configuration#

use_cache: true
cache_compress: zstd    # Enable zstd compression

dj-process --config config.yaml --cache_compress zstd

Multi-Process Compression#

Cache compression supports parallel processing. The number of compression worker processes is controlled by the np parameter:

np: 4                   # Number of parallel workers (also used for compression)
cache_compress: zstd

Cache Control API#

DatasetCacheControl#

A context manager to temporarily enable or disable HuggingFace dataset caching within a specific scope:

from data_juicer.utils.cache_utils import DatasetCacheControl

# Temporarily disable caching
with DatasetCacheControl(on=False):
    # Operations here will not use cache
    result = dataset.map(my_function)

# Temporarily enable caching
with DatasetCacheControl(on=True):
    # Operations here will use cache
    result = dataset.map(my_function)

dataset_cache_control Decorator#

A decorator for functions that need to control cache state:

from data_juicer.utils.cache_utils import dataset_cache_control

@dataset_cache_control(on=False)
def process_without_cache(dataset):
    return dataset.map(my_function)

CompressionOff#

A context manager to temporarily disable cache compression:

from data_juicer.utils.compress import CompressionOff

with CompressionOff():
    # Cache compression is disabled in this scope
    result = dataset.map(my_function)

CompressManager#

Low-level API for manual compression/decompression:

from data_juicer.utils.compress import CompressManager

manager = CompressManager(compressor_format="zstd")

# Compress a file
manager.compress("input.arrow", "input.arrow.zstd")

# Decompress a file
manager.decompress("input.arrow.zstd", "input.arrow")

CacheCompressManager#

High-level API for managing HuggingFace dataset cache compression:

from data_juicer.utils.compress import CacheCompressManager

manager = CacheCompressManager(compressor_format="zstd")

# Compress previous dataset's cache files
manager.compress(prev_ds=previous_dataset, this_ds=current_dataset, num_proc=4)

# Decompress cache files for a dataset
manager.decompress(ds=dataset, num_proc=4)

# Clean up all compressed cache files
manager.cleanup_cache_files(ds=dataset)

Cache vs Checkpoint#

Cache and checkpoint are mutually exclusive — enabling checkpoint automatically disables cache:

Feature	Cache	Checkpoint
Purpose	Accelerate repeated runs with same configuration	Fault recovery and resumption
Granularity	Per-operator result	Full dataset snapshot
Storage Location	HuggingFace cache directory	Work directory
Recovery Method	Automatic (hash-based)	Manual (config-based)
Compression	Supported (`cache_compress`)	Not applicable
Scenario	Iterative development, parameter tuning	Long-running production tasks

# Cache mode (default)
use_cache: true
use_checkpoint: false

# Checkpoint mode (cache auto-disabled)
use_cache: true           # Will be overridden to false
use_checkpoint: true

Disabling Cache and Temporary Directory#

When use_cache: false or checkpoint mode is enabled (use_checkpoint: true), HuggingFace dataset caching is fully disabled. In this mode, DataJuicer writes intermediate files produced during operator processing to a temporary directory, and cleans them up automatically when processing completes. The temp_dir parameter controls where these intermediate files are stored.

Behavior#

Defaults to null: When not set, the operating system determines the temporary directory location (typically /tmp), equivalent to Python’s tempfile.gettempdir().
Takes effect automatically when cache is disabled: Once caching is disabled, temp_dir is applied as the global temporary directory for the entire process via Python’s tempfile.tempdir, affecting all temporary files created through tempfile in the process.
Cache compression is automatically disabled: When caching is disabled, cache_compress is automatically ignored and reset to null.

Configuration#

use_cache: false
temp_dir: /data/dj-temp    # Custom temp directory; null means system default

dj-process --config config.yaml --use_cache false --temp_dir /data/dj-temp

Safety Notes#

Set temp_dir with caution — an unsafe path can cause unexpected program behavior.

Do not point to critical system directories (e.g., /, /usr, /etc). Automatic cleanup of temporary files may accidentally delete important files.
Do not point to directories containing important data. Temporary file writes and cleanup operations may conflict with existing files.
Ensure sufficient disk space: When cache is disabled, intermediate files are written and deleted dynamically during processing. Peak disk usage is approximately equal to the output size of a single operator.
The directory is created automatically if it does not exist: DataJuicer calls os.makedirs to create the specified path if it is missing.
temp_dir affects the entire process’s tempfile behavior: Because it sets the global tempfile.tempdir variable, this setting influences all components in the process that rely on tempfile, including third-party libraries.

Performance Considerations#

When to Enable Cache#

Enable: For iterative development where you frequently re-run pipelines with minor changes
Enable: When operators are computationally expensive and you want to skip already-computed steps
Disable: For one-shot processing to avoid disk overhead

When to Enable Compression#

Enable: When dataset size exceeds tens of GB and disk space is limited
Enable zstd: For the best balance of speed and compression ratio
Enable lz4: When compression speed is critical
Disable: When disk space is abundant and you want maximum processing speed

Troubleshooting#

Cache files consuming too much disk space:

# Check cache directory size
du -sh ~/.cache/data_juicer/

# Enable compression
dj-process --config config.yaml --cache_compress zstd

Stale cache causing unexpected results:

# Clear HuggingFace dataset cache
rm -rf ~/.cache/huggingface/datasets/

# Or specify a fresh cache directory
dj-process --config config.yaml --ds_cache_dir /tmp/fresh-cache

Cache Management#

Overview#

Configuration#

Basic Cache Settings#

Command Line#

Cache Directory Structure#

Environment Variables#

Cache Compression#

Supported Algorithms#

Configuration#

Multi-Process Compression#

Cache Control API#

DatasetCacheControl#

dataset_cache_control Decorator#

CompressionOff#

CompressManager#

CacheCompressManager#

Cache vs Checkpoint#

Disabling Cache and Temporary Directory#

Behavior#

Configuration#

Safety Notes#

Performance Considerations#

When to Enable Cache#

When to Enable Compression#

Troubleshooting#

This Page