data_juicer.ops.mapper.s3_download_file_mapper module#

class data_juicer.ops.mapper.s3_download_file_mapper.S3DownloadFileMapper(*args, **kwargs)[source]#

Bases: Mapper

Mapper to download files from S3 to local files or load them into memory.

This operator downloads files from S3 URLs (s3://â€Ļ) or handles local files. It supports: - Downloading multiple files concurrently - Saving files to a specified directory or loading content into memory - Resume download functionality - S3 authentication with access keys - Custom S3 endpoints (for S3-compatible services like MinIO)

The operator processes nested lists of URLs/paths, maintaining the original structure in the output.

__init__(download_field: str = None, save_dir: str = None, save_field: str = None, resume_download: bool = False, timeout: int = 30, max_concurrent: int = 10, aws_access_key_id: str = None, aws_secret_access_key: str = None, aws_session_token: str = None, aws_region: str = None, endpoint_url: str = None, *args, **kwargs)[source]#

Initialization method.

Parameters:
  • download_field – The field name to get the URL/path to download.

  • save_dir – The directory to save downloaded files.

  • save_field – The field name to save the downloaded file content.

  • resume_download – Whether to resume download. If True, skip the sample if it exists.

  • timeout – (Deprecated) Kept for backward compatibility, not used for S3 downloads.

  • max_concurrent – Maximum concurrent downloads.

  • aws_access_key_id – AWS access key ID for S3.

  • aws_secret_access_key – AWS secret access key for S3.

  • aws_session_token – AWS session token for S3 (optional).

  • aws_region – AWS region for S3.

  • endpoint_url – Custom S3 endpoint URL (for S3-compatible services).

  • args – extra args

  • kwargs – extra args

property s3_client#

Lazy initialization of S3 client to avoid serialization issues with Ray.

async download_files_async(urls, return_contents, save_dir=None, **kwargs)[source]#

Download files asynchronously from S3.

async download_nested_urls(nested_urls: List[str | List[str]], save_dir=None, save_field_contents=None)[source]#

Download nested URLs with structure preservation.

process_batched(samples)[source]#

Process a batch of samples.