data_juicer.core.exporter module#

class data_juicer.core.exporter.Exporter(export_path, export_type=None, export_shard_size=0, export_in_parallel=True, num_proc=1, export_ds=True, keep_stats_in_res_ds=False, keep_hashes_in_res_ds=False, export_stats=True, **kwargs)[source]#

Bases: object

The Exporter class is used to export a dataset to files of specific format.

__init__(export_path, export_type=None, export_shard_size=0, export_in_parallel=True, num_proc=1, export_ds=True, keep_stats_in_res_ds=False, keep_hashes_in_res_ds=False, export_stats=True, **kwargs)[source]#

Initialization method.

Parameters:
  • export_path โ€“ the path to export datasets.

  • export_type โ€“ the format type of the exported datasets.

  • export_shard_size โ€“ the approximate size of each shard of exported dataset. In default, itโ€™s 0, which means export the dataset to a single file.

  • export_in_parallel โ€“ whether to export the datasets in parallel.

  • num_proc โ€“ number of process to export the dataset.

  • export_ds โ€“ whether to export the dataset contents.

  • keep_stats_in_res_ds โ€“ whether to keep stats in the result dataset.

  • keep_hashes_in_res_ds โ€“ whether to keep hashes in the result dataset.

  • export_stats โ€“ whether to export the stats of dataset.

export(dataset)[source]#

Export method for a dataset.

Parameters:

dataset โ€“ the dataset to export.

Returns:

export_compute_stats(dataset, export_path)[source]#

Export method for saving compute status in filters

static to_jsonl(dataset, export_path, num_proc=1, **kwargs)[source]#

Export method for jsonl target files.

Parameters:
  • dataset โ€“ the dataset to export.

  • export_path โ€“ the path to store the exported dataset.

  • num_proc โ€“ the number of processes used to export the dataset.

  • kwargs โ€“ extra arguments.

Returns:

static to_json(dataset, export_path, num_proc=1, **kwargs)[source]#

Export method for json target files.

Parameters:
  • dataset โ€“ the dataset to export.

  • export_path โ€“ the path to store the exported dataset.

  • num_proc โ€“ the number of processes used to export the dataset.

  • kwargs โ€“ extra arguments.

Returns:

static to_parquet(dataset, export_path, **kwargs)[source]#

Export method for parquet target files.

Parameters:
  • dataset โ€“ the dataset to export.

  • export_path โ€“ the path to store the exported dataset.

  • kwargs โ€“ extra arguments.

Returns: