data_juicer.utils.s3_utils module

S3 utilities for Data-Juicer.

Provides unified S3 authentication and filesystem creation for both s3fs (default executor) and PyArrow (Ray executor) backends.

data_juicer.utils.s3_utils.get_aws_credentials(ds_config: Dict = {}) Tuple[str, str, str, str][源代码]

Get AWS credentials with priority order: 1. Environment variables (e.g., AWS_ACCESS_KEY_ID) 2. Explicit config parameters (e.g., in a dataset config dict)

参数:

ds_config -- Dataset configuration dictionary containing optional AWS credentials. If not provided, an empty dict is used.

返回:

Tuple of (access_key_id, secret_access_key, session_token, region)

data_juicer.utils.s3_utils.create_pyarrow_s3_filesystem(ds_config: Dict = {}) S3FileSystem[源代码]

Create a PyArrow S3FileSystem with proper authentication.

Authentication priority: 1. Environment variables (most secure, recommended for production) 2. Explicit config parameters (for development/testing) 3. Default AWS credential chain (boto3-style: env vars, ~/.aws/credentials, IAM roles)

参数:

ds_config -- Dataset configuration dictionary containing optional AWS credentials

返回:

pyarrow.fs.S3FileSystem instance configured with credentials

data_juicer.utils.s3_utils.validate_s3_path(path: str) None[源代码]

Validate that a path is a valid S3 path.

参数:

path -- Path to validate

抛出:

ValueError -- If path doesn't start with 's3://'