data_juicer.core.exporter module#
- class data_juicer.core.exporter.Exporter(export_path, export_type=None, export_shard_size=0, export_in_parallel=True, num_proc=1, export_ds=True, keep_stats_in_res_ds=False, keep_hashes_in_res_ds=False, export_stats=True, **kwargs)[source]#
Bases:
objectThe Exporter class is used to export a dataset to files of specific format.
- __init__(export_path, export_type=None, export_shard_size=0, export_in_parallel=True, num_proc=1, export_ds=True, keep_stats_in_res_ds=False, keep_hashes_in_res_ds=False, export_stats=True, **kwargs)[source]#
Initialization method.
- Parameters:
export_path โ the path to export datasets.
export_type โ the format type of the exported datasets.
export_shard_size โ the approximate size of each shard of exported dataset. In default, itโs 0, which means export the dataset to a single file.
export_in_parallel โ whether to export the datasets in parallel.
num_proc โ number of process to export the dataset.
export_ds โ whether to export the dataset contents.
keep_stats_in_res_ds โ whether to keep stats in the result dataset.
keep_hashes_in_res_ds โ whether to keep hashes in the result dataset.
export_stats โ whether to export the stats of dataset.
- export(dataset)[source]#
Export method for a dataset.
- Parameters:
dataset โ the dataset to export.
- Returns:
- export_compute_stats(dataset, export_path)[source]#
Export method for saving compute status in filters
- static to_jsonl(dataset, export_path, num_proc=1, **kwargs)[source]#
Export method for jsonl target files.
- Parameters:
dataset โ the dataset to export.
export_path โ the path to store the exported dataset.
num_proc โ the number of processes used to export the dataset.
kwargs โ extra arguments.
- Returns:
- static to_json(dataset, export_path, num_proc=1, **kwargs)[source]#
Export method for json target files.
- Parameters:
dataset โ the dataset to export.
export_path โ the path to store the exported dataset.
num_proc โ the number of processes used to export the dataset.
kwargs โ extra arguments.
- Returns: