data_juicer.utils.s3_utils module#

S3 utilities for Data-Juicer.

Provides unified S3 authentication and filesystem creation for both s3fs (default executor) and PyArrow (Ray executor) backends.

data_juicer.utils.s3_utils.get_aws_credentials(ds_config: Dict = {}) → Tuple[str, str, str, str][source]#

Get AWS credentials with priority order: 1. Environment variables (e.g., AWS_ACCESS_KEY_ID) 2. Explicit config parameters (e.g., in a dataset config dict)

Parameters:: ds_config – Dataset configuration dictionary containing optional AWS credentials. If not provided, an empty dict is used.
Returns:: Tuple of (access_key_id, secret_access_key, session_token, region)

data_juicer.utils.s3_utils.create_pyarrow_s3_filesystem(ds_config: Dict = {}) → S3FileSystem[source]#

Create a PyArrow S3FileSystem with proper authentication.

Authentication priority: 1. Environment variables (most secure, recommended for production) 2. Explicit config parameters (for development/testing) 3. Default AWS credential chain (boto3-style: env vars, ~/.aws/credentials, IAM roles)

Parameters:: ds_config – Dataset configuration dictionary containing optional AWS credentials
Returns:: pyarrow.fs.S3FileSystem instance configured with credentials

data_juicer.utils.s3_utils.validate_s3_path(path: str) → None[source]#

Validate that a path is a valid S3 path.

Parameters:: path – Path to validate
Raises:: ValueError – If path doesn’t start with ‘s3://’

data_juicer.utils.s3_utils module#

This Page