data_juicer.tools.quality_classifier.train module¶
- data_juicer.tools.quality_classifier.train.main(positive_datasets, negative_datasets, output_model_path='my_quality_model', num_training_samples=0, train_test_split_ratio=0.8, tokenizer=None, evaluation=True, text_key='text')[源代码]¶
Train a quality classifier using your own pos/neg datasets :param positive_datasets: the paths to the positive datasets. It could be a
string for a single dataset, e.g. 'pos.parquet', or a list of strings for several datasets, e.g. '["pos1.parquet", "pos2.parquet"]'
- 参数:
negative_datasets -- the paths to the negative datasets. It could be a string for a single dataset, e.g. 'neg.parquet', or a list of strings for several datasets, e.g. '["neg1.parquet", "neg2.parquet"]'
output_model_path -- the path to store the trained quality classifier. It's "my_quality_model" in default
num_training_samples -- number of samples used to train the model. It's 0 in default, which means using all samples in datasets to train
train_test_split_ratio -- ratio to split train and test set. It's 0.8 in default
tokenizer -- what tokenizer to use to tokenize texts. It's None in default, which means using the standard Tokenizer of PySpark. You can use one of ["zh.sp.model", "code.sp.model"] we provided, or you can set it to the path to your own sentencepiece model
evaluation -- whether to evaluate the model after training using test set. It's True in default
text_key -- the field key name to hold texts to be classified. It's "text" in default
- 返回: