data_juicer.ops.mapper.s3_download_file_mapper module#

class data_juicer.ops.mapper.s3_download_file_mapper.S3DownloadFileMapper(*args, **kwargs)[源代码]#

基类:Mapper

Mapper to download files from S3 to local files or load them into memory.

This operator downloads files from S3 URLs (s3://...) or handles local files. It supports: - Downloading multiple files concurrently - Saving files to a specified directory or loading content into memory - Resume download functionality - S3 authentication with access keys - Custom S3 endpoints (for S3-compatible services like MinIO)

The operator processes nested lists of URLs/paths, maintaining the original structure in the output.

__init__(download_field: str = None, save_dir: str = None, save_field: str = None, resume_download: bool = False, timeout: int = 30, max_concurrent: int = 10, aws_access_key_id: str = None, aws_secret_access_key: str = None, aws_session_token: str = None, aws_region: str = None, endpoint_url: str = None, *args, **kwargs)[源代码]#

Initialization method.

参数:
  • download_field -- The field name to get the URL/path to download.

  • save_dir -- The directory to save downloaded files.

  • save_field -- The field name to save the downloaded file content.

  • resume_download -- Whether to resume download. If True, skip the sample if it exists.

  • timeout -- (Deprecated) Kept for backward compatibility, not used for S3 downloads.

  • max_concurrent -- Maximum concurrent downloads.

  • aws_access_key_id -- AWS access key ID for S3.

  • aws_secret_access_key -- AWS secret access key for S3.

  • aws_session_token -- AWS session token for S3 (optional).

  • aws_region -- AWS region for S3.

  • endpoint_url -- Custom S3 endpoint URL (for S3-compatible services).

  • args -- extra args

  • kwargs -- extra args

property s3_client#

Lazy initialization of S3 client to avoid serialization issues with Ray.

async download_files_async(urls, return_contents, save_dir=None, **kwargs)[源代码]#

Download files asynchronously from S3.

async download_nested_urls(nested_urls: List[str | List[str]], save_dir=None, save_field_contents=None)[源代码]#

Download nested URLs with structure preservation.

process_batched(samples)[源代码]#

Process a batch of samples.