data_juicer_sandbox.utils module#

data_juicer_sandbox.utils.validate_hook_output(pipelines, output_key)[源代码]#

Validate whether a specified hook output is valid

This function parses the output_key and searches for the corresponding hook within the given pipeline list, then validates whether the hook contains the specified output key.

参数:
  • pipelines -- A list of pipeline objects, each containing a name attribute and job lists including probe_jobs, refine_recipe_jobs, execution_jobs, and evaluation_jobs

  • output_key -- The output key string with format "pipeline_name.hook_meta_name.output_name"

返回:

True if the corresponding pipeline, hook and output key are found and valid, otherwise False

data_juicer_sandbox.utils.guess_file_or_dir(path: str) str[源代码]#

Guess a path is a file or a directory.

If there is a "." in the basename of the path and the "." is not the first char, guess it's a file. Otherwise, guess it's a directory.

data_juicer_sandbox.utils.add_iter_subdir_to_paths(paths: List[str], iter_num: int) List[str][源代码]#

Add iteration number as a subdir to the specified paths.

示例

  1. files: "/a/b/c/d.jsonl" --> "/a/b/c/{iter_num}/d.jsonl"

  2. dirs: "/a/b/c" --> "/a/b/c/{iter_num}"

参数:
  • paths -- the input original paths

  • iter_num -- the iteration number to be added to the paths

返回:

the result paths with the same number as the original paths, with iteration numbers are added as the examples show.