data_juicer.ops.mapper.s3_download_file_mapper module#
- class data_juicer.ops.mapper.s3_download_file_mapper.S3DownloadFileMapper(*args, **kwargs)[源代码]#
基类:
MapperMapper to download files from S3 to local files or load them into memory.
This operator downloads files from S3 URLs (s3://...) or handles local files. It supports: - Downloading multiple files concurrently - Saving files to a specified directory or loading content into memory - Resume download functionality - S3 authentication with access keys - Custom S3 endpoints (for S3-compatible services like MinIO)
The operator processes nested lists of URLs/paths, maintaining the original structure in the output.
- __init__(download_field: str = None, save_dir: str = None, save_field: str = None, resume_download: bool = False, timeout: int = 30, max_concurrent: int = 10, aws_access_key_id: str = None, aws_secret_access_key: str = None, aws_session_token: str = None, aws_region: str = None, endpoint_url: str = None, *args, **kwargs)[源代码]#
Initialization method.
- 参数:
download_field -- The field name to get the URL/path to download.
save_dir -- The directory to save downloaded files.
save_field -- The field name to save the downloaded file content.
resume_download -- Whether to resume download. If True, skip the sample if it exists.
timeout -- (Deprecated) Kept for backward compatibility, not used for S3 downloads.
max_concurrent -- Maximum concurrent downloads.
aws_access_key_id -- AWS access key ID for S3.
aws_secret_access_key -- AWS secret access key for S3.
aws_session_token -- AWS session token for S3 (optional).
aws_region -- AWS region for S3.
endpoint_url -- Custom S3 endpoint URL (for S3-compatible services).
args -- extra args
kwargs -- extra args
- property s3_client#
Lazy initialization of S3 client to avoid serialization issues with Ray.
- async download_files_async(urls, return_contents, save_dir=None, **kwargs)[源代码]#
Download files asynchronously from S3.