data_juicer.core.ray_exporter module#

class data_juicer.core.ray_exporter.RayExporter(export_path, export_type=None, export_shard_size=0, keep_stats_in_res_ds=True, keep_hashes_in_res_ds=False, encrypt_before_export=False, encryption_key_path=None, **kwargs)[source]#

Bases: object

The Exporter class is used to export a ray dataset to files of specific format.

__init__(export_path, export_type=None, export_shard_size=0, keep_stats_in_res_ds=True, keep_hashes_in_res_ds=False, encrypt_before_export=False, encryption_key_path=None, **kwargs)[source]#

Initialization method.

Parameters:
  • export_path โ€“ the path to export datasets.

  • export_type โ€“ the format type of the exported datasets.

  • export_shard_size โ€“ the approximate size of each shard of exported dataset. In default, itโ€™s 0, which means export the dataset in the default setting of ray.

  • keep_stats_in_res_ds โ€“ whether to keep stats in the result dataset.

  • keep_hashes_in_res_ds โ€“ whether to keep hashes in the result dataset.

  • encrypt_before_export โ€“ whether to encrypt each exported file in-place after Ray has finished writing. All files inside the export directory will be encrypted. S3 paths are skipped. Default: False.

  • encryption_key_path โ€“ path to a file containing the Fernet key. Falls back to the DJ_ENCRYPTION_KEY environment variable when None. Only used when encrypt_before_export is True.

export(dataset, columns=None)[source]#

Export method for a dataset.

Parameters:
  • dataset โ€“ the dataset to export.

  • columns โ€“ the columns to export.

Returns:

static write_json(dataset, export_path, **kwargs)[source]#

Export method for json/jsonl target files.

Parameters:
  • dataset โ€“ the dataset to export.

  • export_path โ€“ the path to store the exported dataset.

  • kwargs โ€“ extra arguments.

Returns:

static write_webdataset(dataset, export_path, **kwargs)[source]#

Export method for webdataset target files.

Parameters:
  • dataset โ€“ the dataset to export.

  • export_path โ€“ the path to store the exported dataset.

  • kwargs โ€“ extra arguments.

Returns:

static write_others(dataset, export_path, **kwargs)[source]#

Export method for other target files.

Parameters:
  • dataset โ€“ the dataset to export.

  • export_path โ€“ the path to store the exported dataset.

  • kwargs โ€“ extra arguments.

Returns: