data_juicer_sandbox.pipelines module#

class data_juicer_sandbox.pipelines.SandBoxWatcher(sandbox_cfg)[source]#

Bases: object

Basic Watcher class to manage interested results, and manage the experiment within the sandbox based on WandB UI and it’s utilities.

__init__(sandbox_cfg)[source]#

Initialize the watcher with a reference to an executor instance.

query(meta_name: str)[source]#

Query the result from the logged_res.

watch(res, meta_name: str = '')[source]#

Flatten the result in dot structure and log it into WandB.

setup_sweep(hpo_config: dict = None, project_name: str = None)[source]#

Setup and start a new WandB sweep.

watch_cfgs(cfgs: List[tuple] = None)[source]#

Watch the configuration of the experiment.

class data_juicer_sandbox.pipelines.Target(iter_target_str: str = None, key: str = None, op: str = None, tgt_val: float = None)[source]#

Bases: object

SUPPORT_OPS = ['==', '>=', '<=', '>', '<']#
__init__(iter_target_str: str = None, key: str = None, op: str = None, tgt_val: float = None)[source]#
key: str#
op: str#
tgt_val: float#
parse_iter_targets(iter_target_str)[source]#
check_target(context_infos: ContextInfos)[source]#
class data_juicer_sandbox.pipelines.SandboxPipeline(pipeline_name='anonymous', pipeline_cfg=None, watcher=None)[source]#

Bases: object

__init__(pipeline_name='anonymous', pipeline_cfg=None, watcher=None)[source]#

Initialization method.

register_jobs()[source]#
run(context_infos: ContextInfos)[source]#

Running the sandbox pipeline at once or in HPO style.

one_trial(context_infos: ContextInfos)[source]#
Running the sandbox pipeline at once.
Users can flexibly conduct some steps of the whole sandbox pipeline

according to their own need and configuration. The watcher will automatically track the results in terms of data, model and specified evaluation metrics to the watcher.

execute_hpo_wandb(context_infos)[source]#
Running the sandbox pipeline in HPO style.
Users can flexibly conduct some steps of the whole sandbox pipeline

according to their own need and configuration. The watcher will automatically track the results in terms of data, model and specified evaluation metrics to the watcher.

class data_juicer_sandbox.pipelines.SandBoxExecutor(cfg=None)[source]#

Bases: object

This SandBoxExecutor class is used to provide a sandbox environment for
exploring data-model co-designs in a one-stop manner with fast feedback

and tiny model size, small data size, and high efficiency.

It plays as a middleware maintains the data-juicer’s data executor, a model processor (training and inference), and an auto-evaluator, where the latter two ones are usually from third-party libraries.

__init__(cfg=None)[source]#

Initialization method.

Parameters:

cfg – configuration of sandbox.

parse_pipelines(cfg)[source]#

Parse the pipeline configs.

Parameters:

cfg – the original config

Returns:

a list of SandBoxPipeline objects.

iterative_update_pipelines(current_pipelines: List[SandboxPipeline], last_context_infos: ContextInfos)[source]#
specify_job_configs(ori_config)[source]#
specify_jobs_configs(cfg)[source]#

Specify job configs by their dict objects or config file path strings.

Parameters:

cfg – the original config

Returns:

a dict of different configs.

run()[source]#