data_juicer.utils.s3_utils module¶
S3 utilities for Data-Juicer.
Provides unified S3 authentication and filesystem creation for both s3fs (default executor) and PyArrow (Ray executor) backends.
- data_juicer.utils.s3_utils.get_aws_credentials(ds_config: Dict = {}) Tuple[str, str, str, str][source]¶
Get AWS credentials with priority order: 1. Environment variables (e.g., AWS_ACCESS_KEY_ID) 2. Explicit config parameters (e.g., in a dataset config dict)
- Parameters:
ds_config – Dataset configuration dictionary containing optional AWS credentials. If not provided, an empty dict is used.
- Returns:
Tuple of (access_key_id, secret_access_key, session_token, region)
- data_juicer.utils.s3_utils.create_pyarrow_s3_filesystem(ds_config: Dict = {}) S3FileSystem[source]¶
Create a PyArrow S3FileSystem with proper authentication.
Authentication priority: 1. Environment variables (most secure, recommended for production) 2. Explicit config parameters (for development/testing) 3. Default AWS credential chain (boto3-style: env vars, ~/.aws/credentials, IAM roles)
- Parameters:
ds_config – Dataset configuration dictionary containing optional AWS credentials
- Returns:
pyarrow.fs.S3FileSystem instance configured with credentials