data_juicer.tools.quality_classifier.predict module#

data_juicer.tools.quality_classifier.predict.predict_score(dataset_path, result_path, model='gpt3', tokenizer=None, keep_method='gpt3', text_key='text', overall_stats=False)[源代码]#

Use specific quality classifier to predict document scores on your dataset :param dataset_path: the path to the dataset you want to predict for :param result_path: the path to store the predicted result dataset :param model: quality classifier name to apply. It's "gpt3" in default. You

can use one of ["gpt3", "chinese", "code"] we provided, or you can set it to the path to your own model trained using the train.py tool

参数:

tokenizer -- what tokenizer to use to tokenize texts. It's None in default, which means using the standard Tokenizer of PySpark. You can use one of ["zh.sp.model", "code.sp.model"] we provided, or you can set it to the path to your own sentencepiece model
keep_method -- the method to label should_keep field for each sample. It's "gpt3" in default. Should be one of ["gpt3", "label"]
text_key -- the field key name to hold texts to be classified. It's "text" in default
overall_stats -- whether to output an overall stats report on predicted document scores. It's False in default

返回:

None if overall_stats is False average quality score of the document if overall_stats is True

data_juicer.tools.quality_classifier.predict module#

本页