data_juicer_sandbox.pipelines module#

class data_juicer_sandbox.pipelines.SandBoxWatcher(sandbox_cfg)[源代码]#

基类:object

Basic Watcher class to manage interested results, and manage the experiment within the sandbox based on WandB UI and it's utilities.

__init__(sandbox_cfg)[源代码]#

Initialize the watcher with a reference to an executor instance.

query(meta_name: str)[源代码]#

Query the result from the logged_res.

watch(res, meta_name: str = '')[源代码]#

Flatten the result in dot structure and log it into WandB.

setup_sweep(hpo_config: dict = None, project_name: str = None)[源代码]#

Setup and start a new WandB sweep.

watch_cfgs(cfgs: List[tuple] = None)[源代码]#

Watch the configuration of the experiment.

class data_juicer_sandbox.pipelines.Target(iter_target_str: str = None, key: str = None, op: str = None, tgt_val: float = None)[源代码]#

基类:object

SUPPORT_OPS = ['==', '>=', '<=', '>', '<']#
__init__(iter_target_str: str = None, key: str = None, op: str = None, tgt_val: float = None)[源代码]#
key: str#
op: str#
tgt_val: float#
parse_iter_targets(iter_target_str)[源代码]#
check_target(context_infos: ContextInfos)[源代码]#
class data_juicer_sandbox.pipelines.SandboxPipeline(pipeline_name='anonymous', pipeline_cfg=None, watcher=None)[源代码]#

基类:object

__init__(pipeline_name='anonymous', pipeline_cfg=None, watcher=None)[源代码]#

Initialization method.

register_jobs()[源代码]#
run(context_infos: ContextInfos)[源代码]#

Running the sandbox pipeline at once or in HPO style.

one_trial(context_infos: ContextInfos)[源代码]#
Running the sandbox pipeline at once.
Users can flexibly conduct some steps of the whole sandbox pipeline

according to their own need and configuration. The watcher will automatically track the results in terms of data, model and specified evaluation metrics to the watcher.

execute_hpo_wandb(context_infos)[源代码]#
Running the sandbox pipeline in HPO style.
Users can flexibly conduct some steps of the whole sandbox pipeline

according to their own need and configuration. The watcher will automatically track the results in terms of data, model and specified evaluation metrics to the watcher.

class data_juicer_sandbox.pipelines.SandBoxExecutor(cfg=None)[源代码]#

基类:object

This SandBoxExecutor class is used to provide a sandbox environment for
exploring data-model co-designs in a one-stop manner with fast feedback

and tiny model size, small data size, and high efficiency.

It plays as a middleware maintains the data-juicer's data executor, a model processor (training and inference), and an auto-evaluator, where the latter two ones are usually from third-party libraries.

__init__(cfg=None)[源代码]#

Initialization method.

参数:

cfg -- configuration of sandbox.

parse_pipelines(cfg)[源代码]#

Parse the pipeline configs.

参数:

cfg -- the original config

返回:

a list of SandBoxPipeline objects.

iterative_update_pipelines(current_pipelines: List[SandboxPipeline], last_context_infos: ContextInfos)[源代码]#
specify_job_configs(ori_config)[源代码]#
specify_jobs_configs(cfg)[源代码]#

Specify job configs by their dict objects or config file path strings.

参数:

cfg -- the original config

返回:

a dict of different configs.

run()[源代码]#