data_juicer.analysis#
- class data_juicer.analysis.ColumnWiseAnalysis(dataset, output_path, overall_result=None, save_stats_in_one_file=True)[source]#
Bases:
objectApply analysis on each column of stats respectively.
- __init__(dataset, output_path, overall_result=None, save_stats_in_one_file=True)[source]#
Initialization method
- Parameters:
dataset โ the dataset to be analyzed
output_path โ path to store the analysis results
overall_result โ optional precomputed overall stats result
save_stats_in_one_file โ whether save all analysis figures of all stats into one image file
- analyze(show_percentiles=False, show=False, skip_export=False)[source]#
Apply analysis and draw the analysis figure for stats.
- Parameters:
show_percentiles โ whether to show the percentile line in each sub-figure. If itโs true, there will be several red lines to indicate the quantiles of the stats distributions
show โ whether to show in a single window after drawing
skip_export โ whether save the results into disk
- Returns:
- draw_box(ax, data, save_path, percentiles=None, show=False)[source]#
Draw the box plot for the data.
- Parameters:
ax โ the axes to draw
data โ data to draw
save_path โ the path to save the box figure
percentiles โ the overall analysis result of the data including percentile information
show โ whether to show in a single window after drawing
- Returns:
- draw_hist(ax, data, save_path, percentiles=None, show=False)[source]#
Draw the histogram for the data.
- Parameters:
ax โ the axes to draw
data โ data to draw
save_path โ the path to save the histogram figure
percentiles โ the overall analysis result of the data including percentile information
show โ whether to show in a single window after drawing
- Returns:
- class data_juicer.analysis.CorrelationAnalysis(dataset, output_path)[source]#
Bases:
objectAnalyze the correlations among different stats. Only for numerical stats.
- class data_juicer.analysis.DiversityAnalysis(dataset, output_path, lang_or_model='en')[source]#
Bases:
objectApply diversity analysis for each sample and get an overall analysis result.
- __init__(dataset, output_path, lang_or_model='en')[source]#
Initialization method :param dataset: the dataset to be analyzed :param output_path: path to store the analysis results :param lang_or_model: the diversity model or a specific language used to load the diversity model.
- analyze(lang_or_model=None, column_name='text', postproc_func=<function get_diversity>, **postproc_kwarg)[source]#
Apply diversity analysis on the whole dataset.
- Parameters:
lang_or_model โ the diversity model or a specific language used to load the diversity model
column_name โ the name of column to be analyzed
postproc_func โ function to analyze diversity. In default, itโs function get_diversity
postproc_kwarg โ arguments of the postproc_func
- Returns:
- class data_juicer.analysis.OverallAnalysis(dataset, output_path)[source]#
Bases:
objectApply analysis on the overall stats, including mean, std, quantiles, etc.
- __init__(dataset, output_path)[source]#
Initialization method.
- Parameters:
dataset โ the dataset to be analyzed
output_path โ path to store the analysis results.
- analyze(percentiles=[], num_proc=1, skip_export=False)[source]#
Apply overall analysis on the whole dataset based on the describe method of pandas.
- Parameters:
percentiles โ percentiles to analyze
num_proc โ number of processes to analyze the dataset
skip_export โ whether export the results to disk
- Returns:
the overall analysis result.