data_juicer.tools.quality_classifier.train module

data_juicer.tools.quality_classifier.train.main(positive_datasets, negative_datasets, output_model_path='my_quality_model', num_training_samples=0, train_test_split_ratio=0.8, tokenizer=None, evaluation=True, text_key='text')[source]

Train a quality classifier using your own pos/neg datasets :param positive_datasets: the paths to the positive datasets. It could be a

string for a single dataset, e.g. ‘pos.parquet’, or a list of strings for several datasets, e.g. ‘[“pos1.parquet”, “pos2.parquet”]’

Parameters:
  • negative_datasets – the paths to the negative datasets. It could be a string for a single dataset, e.g. ‘neg.parquet’, or a list of strings for several datasets, e.g. ‘[“neg1.parquet”, “neg2.parquet”]’

  • output_model_path – the path to store the trained quality classifier. It’s “my_quality_model” in default

  • num_training_samples – number of samples used to train the model. It’s 0 in default, which means using all samples in datasets to train

  • train_test_split_ratio – ratio to split train and test set. It’s 0.8 in default

  • tokenizer – what tokenizer to use to tokenize texts. It’s None in default, which means using the standard Tokenizer of PySpark. You can use one of [“zh.sp.model”, “code.sp.model”] we provided, or you can set it to the path to your own sentencepiece model

  • evaluation – whether to evaluate the model after training using test set. It’s True in default

  • text_key – the field key name to hold texts to be classified. It’s “text” in default

Returns: