data_juicer_sandbox.data_pool_manipulators module#

data_juicer_sandbox.data_pool_manipulators.make_hashable(obj)[源代码]#

data_juicer_sandbox.data_pool_manipulators.get_longest_common_prefix(list_of_strings)[源代码]#: Get the longest common prefix of the given list of strings.

data_juicer_sandbox.data_pool_manipulators.check_io_paths(input_paths, export_path)[源代码]#

data_juicer_sandbox.data_pool_manipulators.load_data_pool(ds_path) → NestedDataset[源代码]#: Load dataset. Can only return NestedDataset.

class data_juicer_sandbox.data_pool_manipulators.BaseDataPoolManipulator(data_pool_cfg: dict)[源代码]#

基类：object

__init__(data_pool_cfg: dict)[源代码]#

run()[源代码]#: Manipulations for data pools.

class data_juicer_sandbox.data_pool_manipulators.DataPoolConstruction(data_pool_cfg: dict)[源代码]#

基类：BaseDataPoolManipulator

run()[源代码]#

construct data pool from specified analyzed data source

Input:

an analyzed dataset.
an output path.
(optional) split_ratios. It's [1/3, 2/3] in default, Support fraction in string format.
(optional) split_num. It's not activated in default. Support specifying the number of samples in one data
pool. Use split_num first if both of split_ratios and split_num are specified.
(optional) ignore_stats. It's False in default. Whether to split the data pool according to the stats
ranking.

Output: MxN data pools, where N is the number of types of analyzed stats and M means the number of split parts.

If ignore_stats is True, N is 1 and the result data pools are stored in the export_path directly. They are named following the rule "<stats_key_name>/<original_name>_<part_idx>.jsonl"

class data_juicer_sandbox.data_pool_manipulators.DataPoolCombination(data_pool_cfg: dict)[源代码]#

基类：BaseDataPoolManipulator

run()[源代码]#

combine data pool from specified data pools

Input:

N split data pools, which are already ordered by their ranks.

Output: 2^N combined data pools including the original N data pools. Equals to N + C(N, 2) + ... + C(N, N). They

are named following the rule "<longest_common_prefix>_top_<combined_ranks>_num_<num_samples>.jsonl"

class data_juicer_sandbox.data_pool_manipulators.DataPoolDuplication(data_pool_cfg: dict)[源代码]#

基类：BaseDataPoolManipulator

run()[源代码]#

duplicate a data pool for specified times

Input:

N specified data pools.
a list of duplicating times. E.g. [2, 4, 8]
whether to shuffle the duplicated dataset.

Output: NxM new duplicated data pools, where M means the length of the times list. They are named following the

rule "<original_name>_x<times>.jsonl"

class data_juicer_sandbox.data_pool_manipulators.DataPoolRanking(data_pool_cfg: dict)[源代码]#

基类：BaseDataPoolManipulator

run()[源代码]#

rank data pools according to specified evaluation metrics.

Input:

N specified data pools
The evaluated metrics of these N data pools in dict with data paths as keys.
(optional) Keys in the metrics to rank the data pools. Support '.' operator to get a nested key. Use the
whole metric obj in default.
(optional) whether to sort in descending. It's True in default
(optional) a number N that only return the top-N data pool paths.

Output: A ordered list of data pool paths according to their evaluated metrics.

class data_juicer_sandbox.data_pool_manipulators.DataPoolDownsampling(data_pool_cfg: dict)[源代码]#

基类：BaseDataPoolManipulator

run()[源代码]#

Randomly downsample data pools to specified scale.

Input:

N specified data pools.
(optional) the target number of samples. It's decided by the smallest data pool in default.
(optional) seed for randomness.

Output: N downsampled data pools. They are named following the rule "<original_name>_<num_sample>.jsonl"

class data_juicer_sandbox.data_pool_manipulators.DataPoolMerging(data_pool_cfg: dict)[源代码]#

基类：BaseDataPoolManipulator

run()[源代码]#

merge data pools into one dataset or data pool.

Input:

N split data pools.

Output: 1 merged dataset/data pool, which is named following the rule "<longest_common_prefix>_merged.jsonl"

class data_juicer_sandbox.data_pool_manipulators.DataPoolCartesianJoin(data_pool_cfg: dict)[源代码]#

基类：BaseDataPoolManipulator

run()[源代码]#

join two sets of data pools with Cartesian Join.

Example: Given two sets of data pools M and N, where M = {DP(A, B, C), DP(E, F), DP(G, H, I, J)} and

N = {DP(1), DP(2, 3)}. After this hook, they are Cartesian joined to: {

DP(A1, B1, C1), DP(A2, A3, B2, B3, C2, C3), DP(E1, F1), DP(E2, E3, F2, F3), DP(G1, H1, I1, J1), DP(G2, G3, H2, H3, I2, I3, J2, J3),

}

Input:

M data pools.
N data pools.

Output: M x N joined data pools MN, where MN(i, j) = M(i) x N(j).

They are named following the rule "<longest_common_prefix>_cartesian_join_{i}_{j}.jsonl"

data_juicer_sandbox.data_pool_manipulators module#

本页