data_juicer_sandbox.model_executors module#

class data_juicer_sandbox.model_executors.BaseModelExecutor(model_config: dict, watcher=None)[source]#

Bases: object

Base abstraction for model executor within the DataJuicer’s sandbox

__init__(model_config: dict, watcher=None)[source]#
async run(run_type, run_obj=None, **kwargs)[source]#
conduct some model-related execution tasks

given specified run_type and run_obj

run_subprocess(script_path, run_args, working_dir, cmd='bash')[source]#
async watch_run(run_type, run_obj=None, **kwargs)[source]#
watch the running process in an online manner, and

return the summarized results

data_connector(input_data, **kwargs)[source]#
convert input_data (usually in Data-Juicer’s Dataset format) into

the appropriate format for specific model executor

class data_juicer_sandbox.model_executors.ModelScopeExecutor(model_config: dict, watcher=None)[source]#

Bases: BaseModelExecutor

data_connector(input_data, split='train', key_remapping=None, **kwargs)[source]#
convert input_data (usually in Data-Juicer’s Dataset format) into

the appropriate format for specific model executor

class data_juicer_sandbox.model_executors.ModelscopeInferProbeExecutor(model_config: dict)[source]#

Bases: ModelScopeExecutor

__init__(model_config: dict)[source]#
class data_juicer_sandbox.model_executors.ModelscopeTrainExecutor(model_config, watcher=None)[source]#

Bases: ModelScopeExecutor

__init__(model_config, watcher=None)[source]#
cfg_modify_fn(cfg)[source]#
build_executor(model_name, trainer_name, work_dir, train_dataset=None, eval_dataset=None)[source]#
class data_juicer_sandbox.model_executors.LLMInferExecutor(model_config: dict, watcher=None)[source]#

Bases: BaseModelExecutor

A inference executor for LLM inference. The model preparation method should be implemented by the subclass for specific type of model.

The config file for this type of executor should at least include the following items: 1. type: model type. 2. build_messages_func: the helper func to build the messages. 3. parse_output_func: the helper func to build the messages. 4. dataset_path: the input datasets or data pools use to construct the input messages for LLM inference.

Only support jsonl files for now.

  1. export_path: the output dir to store the inference results.

  2. infer_res_key: the key name to store the inference results. It’s ā€œresponseā€ in default.

__init__(model_config: dict, watcher=None)[source]#
prepare_executor()[source]#
executor_infer(messages)[source]#
class data_juicer_sandbox.model_executors.HFTransformersInferExecutor(model_config: dict, watcher=None)[source]#

Bases: LLMInferExecutor

A inference executor for model inference with Huggingface Transformers.

The config file for this executor should at least include the following items: 1. type: must be ā€œhuggingfaceā€. 2. model_path: the path to the HF model. 3. model_params: extra parameters for the model. 4. sampling_params: extra sampling parameters for the model.

__init__(model_config: dict, watcher=None)[source]#
prepare_executor()[source]#
executor_infer(messages)[source]#
class data_juicer_sandbox.model_executors.VLLMInferExecutor(model_config: dict, watcher=None)[source]#

Bases: LLMInferExecutor

A inference executor for model inference with vLLM.

The config file for this executor should at least include the following items: 1. type: must be ā€œvllmā€. 2. model_path: the path to the vLLM model. 3. model_params: extra parameters for the model. 4. sampling_params: extra sampling parameters for the model. 5. other parameters can be referred to the class LLMInferExecutor

__init__(model_config: dict, watcher=None)[source]#
prepare_executor()[source]#
executor_infer(messages)[source]#
class data_juicer_sandbox.model_executors.APIModelInferExecutor(model_config: dict, watcher=None)[source]#

Bases: LLMInferExecutor

A inference executor for model inference with OpenAI API.

The config file for this executor should at least include the following items: 1. type: must be ā€œapiā€. 2. model: the API model used to inference. 3. model_params: extra parameters for the model. 4. sampling_params: extra sampling parameters for the model. 5. api_endpoint: URL endpoint for the API. 6. response_path: Path to extract content from the API response. Defaults to ā€˜choices.0.message.content’. 7. max_retry_num: the max number of retries when the API request fails. 8. other parameters can be referred to the class LLMInferExecutor

__init__(model_config: dict, watcher=None)[source]#
prepare_executor()[source]#
executor_infer(messages)[source]#
class data_juicer_sandbox.model_executors.LLaVAExecutor(model_config: dict)[source]#

Bases: BaseModelExecutor

__init__(model_config: dict)[source]#
class data_juicer_sandbox.model_executors.LLaMAFactoryExecutor(model_config: dict)[source]#

Bases: BaseModelExecutor

__init__(model_config: dict)[source]#
class data_juicer_sandbox.model_executors.MegatronExecutor(model_config: dict)[source]#

Bases: BaseModelExecutor

__init__(model_config: dict)[source]#