data_juicer_sandbox.data_pool_manipulators module#
- data_juicer_sandbox.data_pool_manipulators.get_longest_common_prefix(list_of_strings)[源代码]#
Get the longest common prefix of the given list of strings.
- data_juicer_sandbox.data_pool_manipulators.load_data_pool(ds_path) NestedDataset[源代码]#
Load dataset. Can only return NestedDataset.
- class data_juicer_sandbox.data_pool_manipulators.BaseDataPoolManipulator(data_pool_cfg: dict)[源代码]#
基类:
object
- class data_juicer_sandbox.data_pool_manipulators.DataPoolConstruction(data_pool_cfg: dict)[源代码]#
-
- run()[源代码]#
construct data pool from specified analyzed data source
- Input:
an analyzed dataset.
an output path.
(optional) split_ratios. It's [1/3, 2/3] in default, Support fraction in string format.
- (optional) split_num. It's not activated in default. Support specifying the number of samples in one data
pool. Use split_num first if both of split_ratios and split_num are specified.
- (optional) ignore_stats. It's False in default. Whether to split the data pool according to the stats
ranking.
- Output: MxN data pools, where N is the number of types of analyzed stats and M means the number of split parts.
If ignore_stats is True, N is 1 and the result data pools are stored in the export_path directly. They are named following the rule "<stats_key_name>/<original_name>_<part_idx>.jsonl"
- class data_juicer_sandbox.data_pool_manipulators.DataPoolCombination(data_pool_cfg: dict)[源代码]#
-
- run()[源代码]#
combine data pool from specified data pools
- Input:
N split data pools, which are already ordered by their ranks.
- Output: 2^N combined data pools including the original N data pools. Equals to N + C(N, 2) + ... + C(N, N). They
are named following the rule "<longest_common_prefix>_top_<combined_ranks>_num_<num_samples>.jsonl"
- class data_juicer_sandbox.data_pool_manipulators.DataPoolDuplication(data_pool_cfg: dict)[源代码]#
-
- run()[源代码]#
duplicate a data pool for specified times
- Input:
N specified data pools.
a list of duplicating times. E.g. [2, 4, 8]
whether to shuffle the duplicated dataset.
- Output: NxM new duplicated data pools, where M means the length of the times list. They are named following the
rule "<original_name>_x<times>.jsonl"
- class data_juicer_sandbox.data_pool_manipulators.DataPoolRanking(data_pool_cfg: dict)[源代码]#
-
- run()[源代码]#
rank data pools according to specified evaluation metrics.
- Input:
N specified data pools
The evaluated metrics of these N data pools in dict with data paths as keys.
- (optional) Keys in the metrics to rank the data pools. Support '.' operator to get a nested key. Use the
whole metric obj in default.
(optional) whether to sort in descending. It's True in default
(optional) a number N that only return the top-N data pool paths.
Output: A ordered list of data pool paths according to their evaluated metrics.
- class data_juicer_sandbox.data_pool_manipulators.DataPoolDownsampling(data_pool_cfg: dict)[源代码]#
-
- run()[源代码]#
Randomly downsample data pools to specified scale.
- Input:
N specified data pools.
(optional) the target number of samples. It's decided by the smallest data pool in default.
(optional) seed for randomness.
Output: N downsampled data pools. They are named following the rule "<original_name>_<num_sample>.jsonl"
- class data_juicer_sandbox.data_pool_manipulators.DataPoolCartesianJoin(data_pool_cfg: dict)[源代码]#
-
- run()[源代码]#
join two sets of data pools with Cartesian Join.
- Example: Given two sets of data pools M and N, where M = {DP(A, B, C), DP(E, F), DP(G, H, I, J)} and
N = {DP(1), DP(2, 3)}. After this hook, they are Cartesian joined to: {
DP(A1, B1, C1), DP(A2, A3, B2, B3, C2, C3), DP(E1, F1), DP(E2, E3, F2, F3), DP(G1, H1, I1, J1), DP(G2, G3, H2, H3, I2, I3, J2, J3),
}
- Input:
M data pools.
N data pools.
- Output: M x N joined data pools MN, where MN(i, j) = M(i) x N(j).
They are named following the rule "<longest_common_prefix>_cartesian_join_{i}_{j}.jsonl"