data_juicer.analysis#

class data_juicer.analysis.ColumnWiseAnalysis(dataset, output_path, overall_result=None, save_stats_in_one_file=True)[source]#

Bases: object

Apply analysis on each column of stats respectively.

__init__(dataset, output_path, overall_result=None, save_stats_in_one_file=True)[source]#

Initialization method

Parameters:
  • dataset โ€“ the dataset to be analyzed

  • output_path โ€“ path to store the analysis results

  • overall_result โ€“ optional precomputed overall stats result

  • save_stats_in_one_file โ€“ whether save all analysis figures of all stats into one image file

analyze(show_percentiles=False, show=False, skip_export=False)[source]#

Apply analysis and draw the analysis figure for stats.

Parameters:
  • show_percentiles โ€“ whether to show the percentile line in each sub-figure. If itโ€™s true, there will be several red lines to indicate the quantiles of the stats distributions

  • show โ€“ whether to show in a single window after drawing

  • skip_export โ€“ whether save the results into disk

Returns:

draw_box(ax, data, save_path, percentiles=None, show=False)[source]#

Draw the box plot for the data.

Parameters:
  • ax โ€“ the axes to draw

  • data โ€“ data to draw

  • save_path โ€“ the path to save the box figure

  • percentiles โ€“ the overall analysis result of the data including percentile information

  • show โ€“ whether to show in a single window after drawing

Returns:

draw_hist(ax, data, save_path, percentiles=None, show=False)[source]#

Draw the histogram for the data.

Parameters:
  • ax โ€“ the axes to draw

  • data โ€“ data to draw

  • save_path โ€“ the path to save the histogram figure

  • percentiles โ€“ the overall analysis result of the data including percentile information

  • show โ€“ whether to show in a single window after drawing

Returns:

draw_wordcloud(ax, data, save_path, show=False)[source]#
class data_juicer.analysis.CorrelationAnalysis(dataset, output_path)[source]#

Bases: object

Analyze the correlations among different stats. Only for numerical stats.

__init__(dataset, output_path)[source]#

Initialization method.

Parameters:
  • dataset โ€“ the dataset to be analyzed

  • output_path โ€“ path to store the analysis results

analyze(method='pearson', show=False, skip_export=False)[source]#
class data_juicer.analysis.DiversityAnalysis(dataset, output_path, lang_or_model='en')[source]#

Bases: object

Apply diversity analysis for each sample and get an overall analysis result.

__init__(dataset, output_path, lang_or_model='en')[source]#

Initialization method :param dataset: the dataset to be analyzed :param output_path: path to store the analysis results :param lang_or_model: the diversity model or a specific language used to load the diversity model.

analyze(lang_or_model=None, column_name='text', postproc_func=<function get_diversity>, **postproc_kwarg)[source]#

Apply diversity analysis on the whole dataset.

Parameters:
  • lang_or_model โ€“ the diversity model or a specific language used to load the diversity model

  • column_name โ€“ the name of column to be analyzed

  • postproc_func โ€“ function to analyze diversity. In default, itโ€™s function get_diversity

  • postproc_kwarg โ€“ arguments of the postproc_func

Returns:

compute(lang_or_model=None, column_name='text')[source]#

Apply lexical tree analysis on each sample.

Parameters:
  • lang_or_model โ€“ the diversity model or a specific language used to load the diversity model

  • column_name โ€“ the name of column to be analyzed

Returns:

the analysis result.

class data_juicer.analysis.OverallAnalysis(dataset, output_path)[source]#

Bases: object

Apply analysis on the overall stats, including mean, std, quantiles, etc.

__init__(dataset, output_path)[source]#

Initialization method.

Parameters:
  • dataset โ€“ the dataset to be analyzed

  • output_path โ€“ path to store the analysis results.

analyze(percentiles=[], num_proc=1, skip_export=False)[source]#

Apply overall analysis on the whole dataset based on the describe method of pandas.

Parameters:
  • percentiles โ€“ percentiles to analyze

  • num_proc โ€“ number of processes to analyze the dataset

  • skip_export โ€“ whether export the results to disk

Returns:

the overall analysis result.

refine_single_column(col)[source]#