Dataset Export#

This document describes how DataJuicer exports processed datasets, including supported formats, sharding, parallel export, S3 export, and stats/hash management.

Overview#

After processing, DataJuicer exports the result dataset to disk using the Exporter (default mode) or RayExporter (Ray mode). The export system supports:

Multiple output formats — JSONL, JSON, Parquet, and more in Ray mode
Shard export — split large datasets into multiple files by size
Parallel export — speed up single-file export with multiprocessing
S3 export — write results directly to Amazon S3 or S3-compatible storage
Stats and hash management — control which intermediate fields are kept in the output

Configuration#

Basic Settings#

export_path: ./outputs/result.jsonl       # Output file path (required)
export_type: jsonl                         # Format type (auto-detected from path if omitted)
export_shard_size: 0                       # Shard size in bytes (0 = single file)
export_in_parallel: false                  # Parallel export for single-file mode
keep_stats_in_res_ds: false                # Keep computed stats in output
keep_hashes_in_res_ds: false               # Keep computed hashes in output
export_extra_args: {}                      # Additional format-specific arguments
export_aws_credentials: null               # For S3 export, see S3 section for details

Command Line#

# Basic export
dj-process --config config.yaml --export_path ./outputs/result.jsonl

# Export as Parquet
dj-process --config config.yaml --export_path ./outputs/result.parquet

# Export with sharding (256MB per shard)
dj-process --config config.yaml --export_shard_size 268435456

# Keep stats in output
dj-process --config config.yaml --keep_stats_in_res_ds true

Supported Formats#

Default Mode (Exporter)#

Format	Suffix	Description
JSONL	`.jsonl`	JSON Lines — one JSON object per line (default)
JSON	`.json`	Standard JSON array
Parquet	`.parquet`	Columnar format, efficient for large datasets

Ray Mode (RayExporter)#

Format	Suffix	Description
JSONL	`.jsonl`	JSON Lines
JSON	`.json`	Standard JSON
Parquet	`.parquet`	Columnar format
CSV	`.csv`	Comma-separated values
TFRecords	`.tfrecords`	TensorFlow record format
WebDataset	`webdataset`	WebDataset tar-based format
Lance	`.lance`	Lance columnar format

Shard Export#

For large datasets, split the output into multiple shard files based on size:

export_path: ./outputs/result.jsonl
export_shard_size: 268435456              # 256 MB per shard

This produces files like:

outputs/
├── result-00-of-04.jsonl
├── result-01-of-04.jsonl
├── result-02-of-04.jsonl
└── result-03-of-04.jsonl

How shard size is calculated:

The total dataset size in bytes is estimated
Number of shards = ceil(dataset_bytes / export_shard_size)
The dataset is split into contiguous shards
Each shard is exported in parallel using multiprocessing

Recommended shard sizes:

Dataset Size	Recommended Shard Size	Notes
< 1 GB	0 (single file)	No need to shard
1-10 GB	256 MB - 512 MB	Good balance
10-100 GB	512 MB - 1 GB	Fewer files
> 100 GB	1 GB - 10 GB	Avoid too many shards

Shard sizes below 1 MiB or above 1 TiB will trigger warnings.

Parallel Export#

For single-file export (export_shard_size: 0), enable parallel writing to speed up the process:

export_path: ./outputs/result.jsonl
export_shard_size: 0
export_in_parallel: true
np: 4                                     # Number of parallel processes

Important: Parallel export can sometimes be slower than sequential export due to IO blocking, especially for very large datasets. If you observe this, set export_in_parallel: false.

When export_shard_size > 0, shards are always exported in parallel regardless of this setting.

S3 Export#

Export results directly to Amazon S3 or S3-compatible storage.

Default Mode#

export_path: "s3://my-bucket/outputs/result.jsonl"
export_aws_credentials:
  aws_access_key_id: "AKIA..."
  aws_secret_access_key: "secret..."
  aws_region: "us-east-1"
  endpoint_url: "https://s3.example.com"   # Optional: for S3-compatible storage

The default exporter uses HuggingFace’s storage_options with fsspec/s3fs for S3 access.

Ray Mode#

export_path: "s3://my-bucket/outputs/result.jsonl"
export_extra_args:
  aws_access_key_id: "AKIA..."
  aws_secret_access_key: "secret..."
  aws_region: "us-east-1"

The Ray exporter uses PyArrow’s S3 filesystem for S3 access.

S3 with Sharding#

When using S3 with shard export, shard files are written directly to S3:

export_path: "s3://my-bucket/outputs/result.jsonl"
export_shard_size: 268435456
export_aws_credentials:
  aws_access_key_id: "AKIA..."
  aws_secret_access_key: "secret..."

This produces S3 objects like:

s3://my-bucket/outputs/result-00-of-04.jsonl
s3://my-bucket/outputs/result-01-of-04.jsonl
...

Credential Resolution#

AWS credentials are resolved in priority order:

export_aws_credentials config (default mode) or export_extra_args (Ray mode)
Environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
Default credential chain (IAM role, ~/.aws/credentials)

Stats and Hash Management#

During processing, DataJuicer computes intermediate fields:

Stats (__dj__stats__, __dj__meta__): computed by Filter operators
Hashes (__dj__hash__, __dj__minhash__, __dj__simhash__, etc.): computed by Deduplicator operators

By default, these fields are removed from the exported dataset. To keep them:

keep_stats_in_res_ds: true                # Keep stats and meta fields
keep_hashes_in_res_ds: true               # Keep hash fields

Stats Export#

Regardless of keep_stats_in_res_ds, DataJuicer always exports a separate stats file alongside the main dataset:

outputs/
├── result.jsonl                          # Main dataset (stats removed by default)
└── result_stats.jsonl                    # Stats-only file (always exported)

The stats file contains only the __dj__stats__ and __dj__meta__ columns.

WebDataset Export (Ray Mode)#

In Ray mode, you can export to WebDataset format with custom field mapping:

export_path: ./outputs/webdataset
export_type: webdataset
export_extra_args:
  field_mapping:
    txt: "text"
    png: "images"
    json: "metadata"

API Reference#

Exporter (Default Mode)#

from data_juicer.core.exporter import Exporter

exporter = Exporter(
    export_path="./outputs/result.jsonl",
    export_type="jsonl",
    export_shard_size=0,
    export_in_parallel=True,
    num_proc=4,
    keep_stats_in_res_ds=False,
    keep_hashes_in_res_ds=False,
)

exporter.export(dataset)

RayExporter (Ray Mode)#

from data_juicer.core.ray_exporter import RayExporter

exporter = RayExporter(
    export_path="./outputs/result.jsonl",
    export_type="jsonl",
    export_shard_size=268435456,
    keep_stats_in_res_ds=False,
    keep_hashes_in_res_ds=False,
)

exporter.export(ray_dataset)

Troubleshooting#

Export format not supported:

# Check supported formats
# Default mode: jsonl, json, parquet
# Ray mode: jsonl, json, parquet, csv, tfrecords, webdataset, lance

Parallel export is slower than expected:

# Disable parallel export
export_in_parallel: false

S3 export fails with permission error:

# Verify credentials
aws s3 ls s3://your-bucket/

# Check that export_aws_credentials is configured

Too many shard files generated:

# Increase shard size
export_shard_size: 1073741824             # 1 GB

Stats missing from exported dataset:

# Keep stats in the result dataset
keep_stats_in_res_ds: true
# Or check the separate stats file: result_stats.jsonl