data_juicer.tools.quality_classifier.qc_utils module¶
- data_juicer.tools.quality_classifier.qc_utils.init_spark(spark_executor_memory=None, spark_driver_memory=None, spark_executor_memoryOverhead=None)[源代码]¶
Initialize a spark session. You can set parameters such as memory, number of partitions, timeout and so on here :return: A spark session instance.
- data_juicer.tools.quality_classifier.qc_utils.prepare_model(model_name, model_path='/home/runner/.cache/data_juicer/models')[源代码]¶
Prepare the specific model from model cache path or the remote oss :param model_name: name of the quality classifier model :param model_path: the path to store the model to be loaded :return: a loaded PipelineModel
- data_juicer.tools.quality_classifier.qc_utils.load_dataset(spark, ds_path, text_key='text', only_text=False)[源代码]¶
Load a single dataset using PySpark. Only support 'json', 'jsonl', or 'parquet' files for now :param spark: spark session :param ds_path: dataset path :param text_key: the name of the column that stores the contents of texts :param only_text: whether to load texts only and drop other columns. :return: a data frame
- data_juicer.tools.quality_classifier.qc_utils.load_datasets(spark, ds_paths, text_key='text', label=None, only_text=True)[源代码]¶
Load a list of datasets. Only support 'json', 'jsonl', or 'parquet' files for now :param spark: spark session :param ds_paths: a list of datasets to be loaded. :param text_key: the name of the column that stores the contents of texts :param label: the label set to these datasets. Used in training pipeline :param only_text: whether to load texts only and drop other columns. :return: a data frame
- data_juicer.tools.quality_classifier.qc_utils.shuffle(df)[源代码]¶
Shuffle a data frame :param df: input data frame :return: shuffled data frame
- data_juicer.tools.quality_classifier.qc_utils.export_result(ds, res_path)[源代码]¶
Export a dataset to specified path. Only support 'json', 'jsonl', or 'parquet' export formats for now :param ds: the dataset to be exported :param res_path: the path to store the exported dataset :return:
- data_juicer.tools.quality_classifier.qc_utils.get_keep_method_udf(keep_method)[源代码]¶
Given the name of keep method, return a PySpark user-defined function of this kind of keep method. Only support 'gpt3' or 'label' for now :param keep_method: name of keep method :return: a PySpark udf of specified keep method
- data_juicer.tools.quality_classifier.qc_utils.tokenize_dataset(ds, tokenizer)[源代码]¶
Tokenize the texts in input dataset using specified tokenizer :param ds: dataset to be tokenized :param tokenizer: tokenizer used to tokenize texts :return: a dataset with an extra column "words" that stores the tokenized
texts
- data_juicer.tools.quality_classifier.qc_utils.train(output_model_path, ds, tokenizer=None)[源代码]¶
Train a quality classifier with training dataset and export the trained model to a specified path :param output_model_path: the path to store the trained model :param ds: training dataset :param tokenizer: specified sentencepiece tokenizer. It's None in default,
which means using the standard Tokenizer in PySpark
- 返回:
- data_juicer.tools.quality_classifier.qc_utils.eval(model_path, ds, tokenizer=None)[源代码]¶
Evaluate a quality classifier model on specified dataset :param model_path: the path to the model to be evaluated :param ds: evaluation dataset :param tokenizer: specified sentencepiece tokenizer. It's None in default,
which means using the standard Tokenizer in PySpark
- 返回:
- data_juicer.tools.quality_classifier.qc_utils.predict(model, ds, tokenizer=None, keep_method='label')[源代码]¶
Predict document scores for a dataset using a trained quality classifier model :param model: the model used to predict :param ds: the dataset to be predicted :param tokenizer: specified sentencepiece tokenizer. It's None in default,
which means using the standard Tokenizer in PySpark
- 参数:
keep_method -- name of keep method to label the "should_keep" column
- 返回: