data_juicer.analysis.diversity_analysis module#

data_juicer.analysis.diversity_analysis.find_root_verb_and_its_dobj(tree_root)[source]#

Find the verb and its object closest to the root.

Parameters:

tree_root โ€“ the root of lexical tree

Returns:

valid verb and its object.

data_juicer.analysis.diversity_analysis.find_root_verb_and_its_dobj_in_string(nlp, s, first_sent=True)[source]#

Find the verb and its object closest to the root of lexical tree of input string.

Parameters:
  • nlp โ€“ the diversity model to analyze the diversity strings

  • s โ€“ the string to be analyzed

  • first_sent โ€“ whether to analyze the first sentence in the input string only. If itโ€™s true, return the analysis result of the first sentence no matter itโ€™s valid or not. If itโ€™s false, return the first valid result over all sentences

Returns:

valid verb and its object of this string

data_juicer.analysis.diversity_analysis.get_diversity(dataset, top_k_verbs=20, top_k_nouns=4, **kwargs)[source]#

Given the lexical tree analysis result, return the diversity results.

Parameters:
  • dataset โ€“ lexical tree analysis result

  • top_k_verbs โ€“ only keep the top_k_verbs largest verb groups

  • top_k_nouns โ€“ only keep the top_k_nouns largest noun groups for each verb group

  • kwargs โ€“ extra args

Returns:

the diversity results

class data_juicer.analysis.diversity_analysis.DiversityAnalysis(dataset, output_path, lang_or_model='en')[source]#

Bases: object

Apply diversity analysis for each sample and get an overall analysis result.

__init__(dataset, output_path, lang_or_model='en')[source]#

Initialization method :param dataset: the dataset to be analyzed :param output_path: path to store the analysis results :param lang_or_model: the diversity model or a specific language used to load the diversity model.

compute(lang_or_model=None, column_name='text')[source]#

Apply lexical tree analysis on each sample.

Parameters:
  • lang_or_model โ€“ the diversity model or a specific language used to load the diversity model

  • column_name โ€“ the name of column to be analyzed

Returns:

the analysis result.

analyze(lang_or_model=None, column_name='text', postproc_func=<function get_diversity>, **postproc_kwarg)[source]#

Apply diversity analysis on the whole dataset.

Parameters:
  • lang_or_model โ€“ the diversity model or a specific language used to load the diversity model

  • column_name โ€“ the name of column to be analyzed

  • postproc_func โ€“ function to analyze diversity. In default, itโ€™s function get_diversity

  • postproc_kwarg โ€“ arguments of the postproc_func

Returns: