# Dataset Export

This document describes how DataJuicer exports processed datasets, including supported formats, sharding, parallel export, S3 export, and stats/hash management.

## Overview

After processing, DataJuicer exports the result dataset to disk using the `Exporter` (default mode) or `RayExporter` (Ray mode). The export system supports:

- **Multiple output formats** — JSONL, JSON, Parquet, and more in Ray mode
- **Shard export** — split large datasets into multiple files by size
- **Parallel export** — speed up single-file export with multiprocessing
- **S3 export** — write results directly to Amazon S3 or S3-compatible storage
- **Stats and hash management** — control which intermediate fields are kept in the output

## Configuration

### Basic Settings

```yaml
export_path: ./outputs/result.jsonl       # Output file path (required)
export_type: jsonl                         # Format type (auto-detected from path if omitted)
export_shard_size: 0                       # Shard size in bytes (0 = single file)
export_in_parallel: false                  # Parallel export for single-file mode
keep_stats_in_res_ds: false                # Keep computed stats in output
keep_hashes_in_res_ds: false               # Keep computed hashes in output
export_extra_args: {}                      # Additional format-specific arguments
export_aws_credentials: null               # For S3 export, see S3 section for details
```

### Command Line

```bash
# Basic export
dj-process --config config.yaml --export_path ./outputs/result.jsonl

# Export as Parquet
dj-process --config config.yaml --export_path ./outputs/result.parquet

# Export with sharding (256MB per shard)
dj-process --config config.yaml --export_shard_size 268435456

# Keep stats in output
dj-process --config config.yaml --keep_stats_in_res_ds true
```

## Supported Formats

### Default Mode (Exporter)

| Format | Suffix | Description |
|--------|--------|-------------|
| JSONL | `.jsonl` | JSON Lines — one JSON object per line (default) |
| JSON | `.json` | Standard JSON array |
| Parquet | `.parquet` | Columnar format, efficient for large datasets |

### Ray Mode (RayExporter)

| Format | Suffix | Description |
|--------|--------|-------------|
| JSONL | `.jsonl` | JSON Lines |
| JSON | `.json` | Standard JSON |
| Parquet | `.parquet` | Columnar format |
| CSV | `.csv` | Comma-separated values |
| TFRecords | `.tfrecords` | TensorFlow record format |
| WebDataset | `webdataset` | WebDataset tar-based format |
| Lance | `.lance` | Lance columnar format |

## Shard Export

For large datasets, split the output into multiple shard files based on size:

```yaml
export_path: ./outputs/result.jsonl
export_shard_size: 268435456              # 256 MB per shard
```

This produces files like:
```
outputs/
├── result-00-of-04.jsonl
├── result-01-of-04.jsonl
├── result-02-of-04.jsonl
└── result-03-of-04.jsonl
```

**How shard size is calculated:**
1. The total dataset size in bytes is estimated
2. Number of shards = `ceil(dataset_bytes / export_shard_size)`
3. The dataset is split into contiguous shards
4. Each shard is exported in parallel using multiprocessing

**Recommended shard sizes:**

| Dataset Size | Recommended Shard Size | Notes |
|-------------|----------------------|-------|
| < 1 GB | 0 (single file) | No need to shard |
| 1-10 GB | 256 MB - 512 MB | Good balance |
| 10-100 GB | 512 MB - 1 GB | Fewer files |
| > 100 GB | 1 GB - 10 GB | Avoid too many shards |

Shard sizes below 1 MiB or above 1 TiB will trigger warnings.

## Parallel Export

For single-file export (`export_shard_size: 0`), enable parallel writing to speed up the process:

```yaml
export_path: ./outputs/result.jsonl
export_shard_size: 0
export_in_parallel: true
np: 4                                     # Number of parallel processes
```

**Important**: Parallel export can sometimes be **slower** than sequential export due to IO blocking, especially for very large datasets. If you observe this, set `export_in_parallel: false`.

When `export_shard_size > 0`, shards are always exported in parallel regardless of this setting.

## S3 Export

Export results directly to Amazon S3 or S3-compatible storage.

### Default Mode

```yaml
export_path: "s3://my-bucket/outputs/result.jsonl"
export_aws_credentials:
  aws_access_key_id: "AKIA..."
  aws_secret_access_key: "secret..."
  aws_region: "us-east-1"
  endpoint_url: "https://s3.example.com"   # Optional: for S3-compatible storage
```

The default exporter uses HuggingFace's `storage_options` with `fsspec`/`s3fs` for S3 access.

### Ray Mode

```yaml
export_path: "s3://my-bucket/outputs/result.jsonl"
export_extra_args:
  aws_access_key_id: "AKIA..."
  aws_secret_access_key: "secret..."
  aws_region: "us-east-1"
```

The Ray exporter uses PyArrow's S3 filesystem for S3 access.

### S3 with Sharding

When using S3 with shard export, shard files are written directly to S3:

```yaml
export_path: "s3://my-bucket/outputs/result.jsonl"
export_shard_size: 268435456
export_aws_credentials:
  aws_access_key_id: "AKIA..."
  aws_secret_access_key: "secret..."
```

This produces S3 objects like:
```
s3://my-bucket/outputs/result-00-of-04.jsonl
s3://my-bucket/outputs/result-01-of-04.jsonl
...
```

### Credential Resolution

AWS credentials are resolved in priority order:
1. `export_aws_credentials` config (default mode) or `export_extra_args` (Ray mode)
2. Environment variables (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`)
3. Default credential chain (IAM role, `~/.aws/credentials`)

## Stats and Hash Management

During processing, DataJuicer computes intermediate fields:
- **Stats** (`__dj__stats__`, `__dj__meta__`): computed by Filter operators
- **Hashes** (`__dj__hash__`, `__dj__minhash__`, `__dj__simhash__`, etc.): computed by Deduplicator operators

By default, these fields are **removed** from the exported dataset. To keep them:

```yaml
keep_stats_in_res_ds: true                # Keep stats and meta fields
keep_hashes_in_res_ds: true               # Keep hash fields
```

### Stats Export

Regardless of `keep_stats_in_res_ds`, DataJuicer always exports a separate stats file alongside the main dataset:

```
outputs/
├── result.jsonl                          # Main dataset (stats removed by default)
└── result_stats.jsonl                    # Stats-only file (always exported)
```

The stats file contains only the `__dj__stats__` and `__dj__meta__` columns.

## WebDataset Export (Ray Mode)

In Ray mode, you can export to WebDataset format with custom field mapping:

```yaml
export_path: ./outputs/webdataset
export_type: webdataset
export_extra_args:
  field_mapping:
    txt: "text"
    png: "images"
    json: "metadata"
```

## API Reference

### Exporter (Default Mode)

```python
from data_juicer.core.exporter import Exporter

exporter = Exporter(
    export_path="./outputs/result.jsonl",
    export_type="jsonl",
    export_shard_size=0,
    export_in_parallel=True,
    num_proc=4,
    keep_stats_in_res_ds=False,
    keep_hashes_in_res_ds=False,
)

exporter.export(dataset)
```

### RayExporter (Ray Mode)

```python
from data_juicer.core.ray_exporter import RayExporter

exporter = RayExporter(
    export_path="./outputs/result.jsonl",
    export_type="jsonl",
    export_shard_size=268435456,
    keep_stats_in_res_ds=False,
    keep_hashes_in_res_ds=False,
)

exporter.export(ray_dataset)
```

## Troubleshooting

**Export format not supported:**
```bash
# Check supported formats
# Default mode: jsonl, json, parquet
# Ray mode: jsonl, json, parquet, csv, tfrecords, webdataset, lance
```

**Parallel export is slower than expected:**
```yaml
# Disable parallel export
export_in_parallel: false
```

**S3 export fails with permission error:**
```bash
# Verify credentials
aws s3 ls s3://your-bucket/

# Check that export_aws_credentials is configured
```

**Too many shard files generated:**
```yaml
# Increase shard size
export_shard_size: 1073741824             # 1 GB
```

**Stats missing from exported dataset:**
```yaml
# Keep stats in the result dataset
keep_stats_in_res_ds: true
# Or check the separate stats file: result_stats.jsonl
```