data_juicer.utils.s3_utils module

S3 utilities for Data-Juicer.

Provides unified S3 authentication and filesystem creation for both s3fs (default executor) and PyArrow (Ray executor) backends.

data_juicer.utils.s3_utils.get_aws_credentials(ds_config: Dict = {}) Tuple[str, str, str, str][source]

Get AWS credentials with priority order: 1. Environment variables (e.g., AWS_ACCESS_KEY_ID) 2. Explicit config parameters (e.g., in a dataset config dict)

Parameters:

ds_config – Dataset configuration dictionary containing optional AWS credentials. If not provided, an empty dict is used.

Returns:

Tuple of (access_key_id, secret_access_key, session_token, region)

data_juicer.utils.s3_utils.create_pyarrow_s3_filesystem(ds_config: Dict = {}) S3FileSystem[source]

Create a PyArrow S3FileSystem with proper authentication.

Authentication priority: 1. Environment variables (most secure, recommended for production) 2. Explicit config parameters (for development/testing) 3. Default AWS credential chain (boto3-style: env vars, ~/.aws/credentials, IAM roles)

Parameters:

ds_config – Dataset configuration dictionary containing optional AWS credentials

Returns:

pyarrow.fs.S3FileSystem instance configured with credentials

data_juicer.utils.s3_utils.validate_s3_path(path: str) None[source]

Validate that a path is a valid S3 path.

Parameters:

path – Path to validate

Raises:

ValueError – If path doesn’t start with ‘s3://’