data_juicer.tools.quality_classifier.qc_utils module

data_juicer.tools.quality_classifier.qc_utils.init_spark(spark_executor_memory=None, spark_driver_memory=None, spark_executor_memoryOverhead=None)[source]

Initialize a spark session. You can set parameters such as memory, number of partitions, timeout and so on here :return: A spark session instance.

data_juicer.tools.quality_classifier.qc_utils.prepare_model(model_name, model_path='/home/runner/.cache/data_juicer/models')[source]

Prepare the specific model from model cache path or the remote oss :param model_name: name of the quality classifier model :param model_path: the path to store the model to be loaded :return: a loaded PipelineModel

data_juicer.tools.quality_classifier.qc_utils.load_dataset(spark, ds_path, text_key='text', only_text=False)[source]

Load a single dataset using PySpark. Only support ‘json’, ‘jsonl’, or ‘parquet’ files for now :param spark: spark session :param ds_path: dataset path :param text_key: the name of the column that stores the contents of texts :param only_text: whether to load texts only and drop other columns. :return: a data frame

data_juicer.tools.quality_classifier.qc_utils.load_datasets(spark, ds_paths, text_key='text', label=None, only_text=True)[source]

Load a list of datasets. Only support ‘json’, ‘jsonl’, or ‘parquet’ files for now :param spark: spark session :param ds_paths: a list of datasets to be loaded. :param text_key: the name of the column that stores the contents of texts :param label: the label set to these datasets. Used in training pipeline :param only_text: whether to load texts only and drop other columns. :return: a data frame

data_juicer.tools.quality_classifier.qc_utils.shuffle(df)[source]

Shuffle a data frame :param df: input data frame :return: shuffled data frame

data_juicer.tools.quality_classifier.qc_utils.export_result(ds, res_path)[source]

Export a dataset to specified path. Only support ‘json’, ‘jsonl’, or ‘parquet’ export formats for now :param ds: the dataset to be exported :param res_path: the path to store the exported dataset :return:

data_juicer.tools.quality_classifier.qc_utils.get_keep_method_udf(keep_method)[source]

Given the name of keep method, return a PySpark user-defined function of this kind of keep method. Only support ‘gpt3’ or ‘label’ for now :param keep_method: name of keep method :return: a PySpark udf of specified keep method

data_juicer.tools.quality_classifier.qc_utils.tokenize_dataset(ds, tokenizer)[source]

Tokenize the texts in input dataset using specified tokenizer :param ds: dataset to be tokenized :param tokenizer: tokenizer used to tokenize texts :return: a dataset with an extra column “words” that stores the tokenized

texts

data_juicer.tools.quality_classifier.qc_utils.train(output_model_path, ds, tokenizer=None)[source]

Train a quality classifier with training dataset and export the trained model to a specified path :param output_model_path: the path to store the trained model :param ds: training dataset :param tokenizer: specified sentencepiece tokenizer. It’s None in default,

which means using the standard Tokenizer in PySpark

Returns:

data_juicer.tools.quality_classifier.qc_utils.eval(model_path, ds, tokenizer=None)[source]

Evaluate a quality classifier model on specified dataset :param model_path: the path to the model to be evaluated :param ds: evaluation dataset :param tokenizer: specified sentencepiece tokenizer. It’s None in default,

which means using the standard Tokenizer in PySpark

Returns:

data_juicer.tools.quality_classifier.qc_utils.predict(model, ds, tokenizer=None, keep_method='label')[source]

Predict document scores for a dataset using a trained quality classifier model :param model: the model used to predict :param ds: the dataset to be predicted :param tokenizer: specified sentencepiece tokenizer. It’s None in default,

which means using the standard Tokenizer in PySpark

Parameters:

keep_method – name of keep method to label the “should_keep” column

Returns: