data_juicer_sandbox.utils module#

data_juicer_sandbox.utils.validate_hook_output(pipelines, output_key)[source]#

Validate whether a specified hook output is valid

This function parses the output_key and searches for the corresponding hook within the given pipeline list, then validates whether the hook contains the specified output key.

Parameters:

pipelines – A list of pipeline objects, each containing a name attribute and job lists including probe_jobs, refine_recipe_jobs, execution_jobs, and evaluation_jobs
output_key – The output key string with format “pipeline_name.hook_meta_name.output_name”

Returns:

True if the corresponding pipeline, hook and output key are found and valid, otherwise False

data_juicer_sandbox.utils.guess_file_or_dir(path: str) → str[source]#

Guess a path is a file or a directory.

If there is a “.” in the basename of the path and the “.” is not the first char, guess it’s a file. Otherwise, guess it’s a directory.

data_juicer_sandbox.utils.add_iter_subdir_to_paths(paths: List[str], iter_num: int) → List[str][source]#

Add iteration number as a subdir to the specified paths.

Example

files: “/a/b/c/d.jsonl” –> “/a/b/c/{iter_num}/d.jsonl”
dirs: “/a/b/c” –> “/a/b/c/{iter_num}”

Parameters:

paths – the input original paths
iter_num – the iteration number to be added to the paths

Returns:

the result paths with the same number as the original paths, with iteration numbers are added as the examples show.

data_juicer_sandbox.utils module#

This Page