data_juicer_sandbox.utils module#
- data_juicer_sandbox.utils.validate_hook_output(pipelines, output_key)[源代码]#
Validate whether a specified hook output is valid
This function parses the output_key and searches for the corresponding hook within the given pipeline list, then validates whether the hook contains the specified output key.
- 参数:
pipelines -- A list of pipeline objects, each containing a name attribute and job lists including probe_jobs, refine_recipe_jobs, execution_jobs, and evaluation_jobs
output_key -- The output key string with format "pipeline_name.hook_meta_name.output_name"
- 返回:
True if the corresponding pipeline, hook and output key are found and valid, otherwise False
- data_juicer_sandbox.utils.guess_file_or_dir(path: str) str[源代码]#
Guess a path is a file or a directory.
If there is a "." in the basename of the path and the "." is not the first char, guess it's a file. Otherwise, guess it's a directory.
- data_juicer_sandbox.utils.add_iter_subdir_to_paths(paths: List[str], iter_num: int) List[str][源代码]#
Add iteration number as a subdir to the specified paths.
示例
files: "/a/b/c/d.jsonl" --> "/a/b/c/{iter_num}/d.jsonl"
dirs: "/a/b/c" --> "/a/b/c/{iter_num}"
- 参数:
paths -- the input original paths
iter_num -- the iteration number to be added to the paths
- 返回:
the result paths with the same number as the original paths, with iteration numbers are added as the examples show.