data_juicer.ops.mapper.s3_download_file_mapper module#
- class data_juicer.ops.mapper.s3_download_file_mapper.S3DownloadFileMapper(*args, **kwargs)[source]#
Bases:
MapperMapper to download files from S3 to local files or load them into memory.
This operator downloads files from S3 URLs (s3://âĻ) or handles local files. It supports: - Downloading multiple files concurrently - Saving files to a specified directory or loading content into memory - Resume download functionality - S3 authentication with access keys - Custom S3 endpoints (for S3-compatible services like MinIO)
The operator processes nested lists of URLs/paths, maintaining the original structure in the output.
- __init__(download_field: str = None, save_dir: str = None, save_field: str = None, resume_download: bool = False, timeout: int = 30, max_concurrent: int = 10, aws_access_key_id: str = None, aws_secret_access_key: str = None, aws_session_token: str = None, aws_region: str = None, endpoint_url: str = None, *args, **kwargs)[source]#
Initialization method.
- Parameters:
download_field â The field name to get the URL/path to download.
save_dir â The directory to save downloaded files.
save_field â The field name to save the downloaded file content.
resume_download â Whether to resume download. If True, skip the sample if it exists.
timeout â (Deprecated) Kept for backward compatibility, not used for S3 downloads.
max_concurrent â Maximum concurrent downloads.
aws_access_key_id â AWS access key ID for S3.
aws_secret_access_key â AWS secret access key for S3.
aws_session_token â AWS session token for S3 (optional).
aws_region â AWS region for S3.
endpoint_url â Custom S3 endpoint URL (for S3-compatible services).
args â extra args
kwargs â extra args
- property s3_client#
Lazy initialization of S3 client to avoid serialization issues with Ray.
- async download_files_async(urls, return_contents, save_dir=None, **kwargs)[source]#
Download files asynchronously from S3.