data_juicer_sandbox.utils module#
- data_juicer_sandbox.utils.validate_hook_output(pipelines, output_key)[source]#
Validate whether a specified hook output is valid
This function parses the output_key and searches for the corresponding hook within the given pipeline list, then validates whether the hook contains the specified output key.
- Parameters:
pipelines – A list of pipeline objects, each containing a name attribute and job lists including probe_jobs, refine_recipe_jobs, execution_jobs, and evaluation_jobs
output_key – The output key string with format “pipeline_name.hook_meta_name.output_name”
- Returns:
True if the corresponding pipeline, hook and output key are found and valid, otherwise False
- data_juicer_sandbox.utils.guess_file_or_dir(path: str) str[source]#
Guess a path is a file or a directory.
If there is a “.” in the basename of the path and the “.” is not the first char, guess it’s a file. Otherwise, guess it’s a directory.
- data_juicer_sandbox.utils.add_iter_subdir_to_paths(paths: List[str], iter_num: int) List[str][source]#
Add iteration number as a subdir to the specified paths.
Example
files: “/a/b/c/d.jsonl” –> “/a/b/c/{iter_num}/d.jsonl”
dirs: “/a/b/c” –> “/a/b/c/{iter_num}”
- Parameters:
paths – the input original paths
iter_num – the iteration number to be added to the paths
- Returns:
the result paths with the same number as the original paths, with iteration numbers are added as the examples show.