Dataset Export#
This document describes how DataJuicer exports processed datasets, including supported formats, sharding, parallel export, S3 export, and stats/hash management.
Overview#
After processing, DataJuicer exports the result dataset to disk using the Exporter (default mode) or RayExporter (Ray mode). The export system supports:
Multiple output formats โ JSONL, JSON, Parquet, and more in Ray mode
Shard export โ split large datasets into multiple files by size
Parallel export โ speed up single-file export with multiprocessing
S3 export โ write results directly to Amazon S3 or S3-compatible storage
Stats and hash management โ control which intermediate fields are kept in the output
Configuration#
Basic Settings#
export_path: ./outputs/result.jsonl # Output file path (required)
export_type: jsonl # Format type (auto-detected from path if omitted)
export_shard_size: 0 # Shard size in bytes (0 = single file)
export_in_parallel: false # Parallel export for single-file mode
keep_stats_in_res_ds: false # Keep computed stats in output
keep_hashes_in_res_ds: false # Keep computed hashes in output
export_extra_args: {} # Additional format-specific arguments
export_aws_credentials: null # For S3 export, see S3 section for details
Command Line#
# Basic export
dj-process --config config.yaml --export_path ./outputs/result.jsonl
# Export as Parquet
dj-process --config config.yaml --export_path ./outputs/result.parquet
# Export with sharding (256MB per shard)
dj-process --config config.yaml --export_shard_size 268435456
# Keep stats in output
dj-process --config config.yaml --keep_stats_in_res_ds true
Supported Formats#
Default Mode (Exporter)#
Format |
Suffix |
Description |
|---|---|---|
JSONL |
|
JSON Lines โ one JSON object per line (default) |
JSON |
|
Standard JSON array |
Parquet |
|
Columnar format, efficient for large datasets |
Ray Mode (RayExporter)#
Format |
Suffix |
Description |
|---|---|---|
JSONL |
|
JSON Lines |
JSON |
|
Standard JSON |
Parquet |
|
Columnar format |
CSV |
|
Comma-separated values |
TFRecords |
|
TensorFlow record format |
WebDataset |
|
WebDataset tar-based format |
Lance |
|
Lance columnar format |
Parallel Export#
For single-file export (export_shard_size: 0), enable parallel writing to speed up the process:
export_path: ./outputs/result.jsonl
export_shard_size: 0
export_in_parallel: true
np: 4 # Number of parallel processes
Important: Parallel export can sometimes be slower than sequential export due to IO blocking, especially for very large datasets. If you observe this, set export_in_parallel: false.
When export_shard_size > 0, shards are always exported in parallel regardless of this setting.
S3 Export#
Export results directly to Amazon S3 or S3-compatible storage.
Default Mode#
export_path: "s3://my-bucket/outputs/result.jsonl"
export_aws_credentials:
aws_access_key_id: "AKIA..."
aws_secret_access_key: "secret..."
aws_region: "us-east-1"
endpoint_url: "https://s3.example.com" # Optional: for S3-compatible storage
The default exporter uses HuggingFaceโs storage_options with fsspec/s3fs for S3 access.
Ray Mode#
export_path: "s3://my-bucket/outputs/result.jsonl"
export_extra_args:
aws_access_key_id: "AKIA..."
aws_secret_access_key: "secret..."
aws_region: "us-east-1"
The Ray exporter uses PyArrowโs S3 filesystem for S3 access.
S3 with Sharding#
When using S3 with shard export, shard files are written directly to S3:
export_path: "s3://my-bucket/outputs/result.jsonl"
export_shard_size: 268435456
export_aws_credentials:
aws_access_key_id: "AKIA..."
aws_secret_access_key: "secret..."
This produces S3 objects like:
s3://my-bucket/outputs/result-00-of-04.jsonl
s3://my-bucket/outputs/result-01-of-04.jsonl
...
Credential Resolution#
AWS credentials are resolved in priority order:
export_aws_credentialsconfig (default mode) orexport_extra_args(Ray mode)Environment variables (
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY)Default credential chain (IAM role,
~/.aws/credentials)
Stats and Hash Management#
During processing, DataJuicer computes intermediate fields:
Stats (
__dj__stats__,__dj__meta__): computed by Filter operatorsHashes (
__dj__hash__,__dj__minhash__,__dj__simhash__, etc.): computed by Deduplicator operators
By default, these fields are removed from the exported dataset. To keep them:
keep_stats_in_res_ds: true # Keep stats and meta fields
keep_hashes_in_res_ds: true # Keep hash fields
Stats Export#
Regardless of keep_stats_in_res_ds, DataJuicer always exports a separate stats file alongside the main dataset:
outputs/
โโโ result.jsonl # Main dataset (stats removed by default)
โโโ result_stats.jsonl # Stats-only file (always exported)
The stats file contains only the __dj__stats__ and __dj__meta__ columns.
WebDataset Export (Ray Mode)#
In Ray mode, you can export to WebDataset format with custom field mapping:
export_path: ./outputs/webdataset
export_type: webdataset
export_extra_args:
field_mapping:
txt: "text"
png: "images"
json: "metadata"
API Reference#
Exporter (Default Mode)#
from data_juicer.core.exporter import Exporter
exporter = Exporter(
export_path="./outputs/result.jsonl",
export_type="jsonl",
export_shard_size=0,
export_in_parallel=True,
num_proc=4,
keep_stats_in_res_ds=False,
keep_hashes_in_res_ds=False,
)
exporter.export(dataset)
RayExporter (Ray Mode)#
from data_juicer.core.ray_exporter import RayExporter
exporter = RayExporter(
export_path="./outputs/result.jsonl",
export_type="jsonl",
export_shard_size=268435456,
keep_stats_in_res_ds=False,
keep_hashes_in_res_ds=False,
)
exporter.export(ray_dataset)
Troubleshooting#
Export format not supported:
# Check supported formats
# Default mode: jsonl, json, parquet
# Ray mode: jsonl, json, parquet, csv, tfrecords, webdataset, lance
Parallel export is slower than expected:
# Disable parallel export
export_in_parallel: false
S3 export fails with permission error:
# Verify credentials
aws s3 ls s3://your-bucket/
# Check that export_aws_credentials is configured
Too many shard files generated:
# Increase shard size
export_shard_size: 1073741824 # 1 GB
Stats missing from exported dataset:
# Keep stats in the result dataset
keep_stats_in_res_ds: true
# Or check the separate stats file: result_stats.jsonl