data_juicer.tools.quality_classifier.eval module

data_juicer.tools.quality_classifier.eval.main(positive_datasets=None, negative_datasets=None, model='my_quality_model', tokenizer=None, text_key='text')[source]

Evaluate a trained quality classifier using specific positive/negative datasets :param positive_datasets: the paths to the positive datasets. It could be a

string for a single dataset, e.g. ‘pos.parquet’, or a list of strings for multiple datasets, e.g. ‘[“pos1.parquet”, “pos2.parquet”]’

Parameters:
  • negative_datasets – the paths to the negative datasets. It could be a string for a single dataset, e.g. ‘neg.parquet’, or a list of strings for multiple datasets, e.g. ‘[“neg1.parquet”, “neg2.parquet”]’

  • model – quality classifier name to apply. It’s “my_quality_model” in default. You can use one of [“gpt3”, “chinese”, “code”] we provided, or you can set it to the path to your own model trained using the train.py tool

  • tokenizer – what tokenizer to use to tokenize texts. It’s None in default, which means using the standard Tokenizer of PySpark. You can use one of [“zh.sp.model”, “code.sp.model”] we provided, or you can set it to the path to your own sentencepiece model

  • text_key – the field key name to hold texts to be classified. It’s “text” in default

Returns: