data_juicer_sandbox.data_pool_manipulators module#

data_juicer_sandbox.data_pool_manipulators.make_hashable(obj)[源代码]#
data_juicer_sandbox.data_pool_manipulators.get_longest_common_prefix(list_of_strings)[源代码]#

Get the longest common prefix of the given list of strings.

data_juicer_sandbox.data_pool_manipulators.check_io_paths(input_paths, export_path)[源代码]#
data_juicer_sandbox.data_pool_manipulators.load_data_pool(ds_path) NestedDataset[源代码]#

Load dataset. Can only return NestedDataset.

class data_juicer_sandbox.data_pool_manipulators.BaseDataPoolManipulator(data_pool_cfg: dict)[源代码]#

基类:object

__init__(data_pool_cfg: dict)[源代码]#
run()[源代码]#

Manipulations for data pools.

class data_juicer_sandbox.data_pool_manipulators.DataPoolConstruction(data_pool_cfg: dict)[源代码]#

基类:BaseDataPoolManipulator

run()[源代码]#

construct data pool from specified analyzed data source

Input:
  • an analyzed dataset.

  • an output path.

  • (optional) split_ratios. It's [1/3, 2/3] in default, Support fraction in string format.

  • (optional) split_num. It's not activated in default. Support specifying the number of samples in one data

    pool. Use split_num first if both of split_ratios and split_num are specified.

  • (optional) ignore_stats. It's False in default. Whether to split the data pool according to the stats

    ranking.

Output: MxN data pools, where N is the number of types of analyzed stats and M means the number of split parts.

If ignore_stats is True, N is 1 and the result data pools are stored in the export_path directly. They are named following the rule "<stats_key_name>/<original_name>_<part_idx>.jsonl"

class data_juicer_sandbox.data_pool_manipulators.DataPoolCombination(data_pool_cfg: dict)[源代码]#

基类:BaseDataPoolManipulator

run()[源代码]#

combine data pool from specified data pools

Input:
  • N split data pools, which are already ordered by their ranks.

Output: 2^N combined data pools including the original N data pools. Equals to N + C(N, 2) + ... + C(N, N). They

are named following the rule "<longest_common_prefix>_top_<combined_ranks>_num_<num_samples>.jsonl"

class data_juicer_sandbox.data_pool_manipulators.DataPoolDuplication(data_pool_cfg: dict)[源代码]#

基类:BaseDataPoolManipulator

run()[源代码]#

duplicate a data pool for specified times

Input:
  • N specified data pools.

  • a list of duplicating times. E.g. [2, 4, 8]

  • whether to shuffle the duplicated dataset.

Output: NxM new duplicated data pools, where M means the length of the times list. They are named following the

rule "<original_name>_x<times>.jsonl"

class data_juicer_sandbox.data_pool_manipulators.DataPoolRanking(data_pool_cfg: dict)[源代码]#

基类:BaseDataPoolManipulator

run()[源代码]#

rank data pools according to specified evaluation metrics.

Input:
  • N specified data pools

  • The evaluated metrics of these N data pools in dict with data paths as keys.

  • (optional) Keys in the metrics to rank the data pools. Support '.' operator to get a nested key. Use the

    whole metric obj in default.

  • (optional) whether to sort in descending. It's True in default

  • (optional) a number N that only return the top-N data pool paths.

Output: A ordered list of data pool paths according to their evaluated metrics.

class data_juicer_sandbox.data_pool_manipulators.DataPoolDownsampling(data_pool_cfg: dict)[源代码]#

基类:BaseDataPoolManipulator

run()[源代码]#

Randomly downsample data pools to specified scale.

Input:
  • N specified data pools.

  • (optional) the target number of samples. It's decided by the smallest data pool in default.

  • (optional) seed for randomness.

Output: N downsampled data pools. They are named following the rule "<original_name>_<num_sample>.jsonl"

class data_juicer_sandbox.data_pool_manipulators.DataPoolMerging(data_pool_cfg: dict)[源代码]#

基类:BaseDataPoolManipulator

run()[源代码]#

merge data pools into one dataset or data pool.

Input:
  • N split data pools.

Output: 1 merged dataset/data pool, which is named following the rule "<longest_common_prefix>_merged.jsonl"

class data_juicer_sandbox.data_pool_manipulators.DataPoolCartesianJoin(data_pool_cfg: dict)[源代码]#

基类:BaseDataPoolManipulator

run()[源代码]#

join two sets of data pools with Cartesian Join.

Example: Given two sets of data pools M and N, where M = {DP(A, B, C), DP(E, F), DP(G, H, I, J)} and

N = {DP(1), DP(2, 3)}. After this hook, they are Cartesian joined to: {

DP(A1, B1, C1), DP(A2, A3, B2, B3, C2, C3), DP(E1, F1), DP(E2, E3, F2, F3), DP(G1, H1, I1, J1), DP(G2, G3, H2, H3, I2, I3, J2, J3),

}

Input:
  • M data pools.

  • N data pools.

Output: M x N joined data pools MN, where MN(i, j) = M(i) x N(j).

They are named following the rule "<longest_common_prefix>_cartesian_join_{i}_{j}.jsonl"