data_juicer_sandbox.model_executors module#

class data_juicer_sandbox.model_executors.BaseModelExecutor(model_config: dict, watcher=None)[源代码]#

基类:object

Base abstraction for model executor within the DataJuicer's sandbox

__init__(model_config: dict, watcher=None)[源代码]#
async run(run_type, run_obj=None, **kwargs)[源代码]#
conduct some model-related execution tasks

given specified run_type and run_obj

run_subprocess(script_path, run_args, working_dir, cmd='bash')[源代码]#
async watch_run(run_type, run_obj=None, **kwargs)[源代码]#
watch the running process in an online manner, and

return the summarized results

data_connector(input_data, **kwargs)[源代码]#
convert input_data (usually in Data-Juicer's Dataset format) into

the appropriate format for specific model executor

class data_juicer_sandbox.model_executors.ModelScopeExecutor(model_config: dict, watcher=None)[源代码]#

基类:BaseModelExecutor

data_connector(input_data, split='train', key_remapping=None, **kwargs)[源代码]#
convert input_data (usually in Data-Juicer's Dataset format) into

the appropriate format for specific model executor

class data_juicer_sandbox.model_executors.ModelscopeInferProbeExecutor(model_config: dict)[源代码]#

基类:ModelScopeExecutor

__init__(model_config: dict)[源代码]#
class data_juicer_sandbox.model_executors.ModelscopeTrainExecutor(model_config, watcher=None)[源代码]#

基类:ModelScopeExecutor

__init__(model_config, watcher=None)[源代码]#
cfg_modify_fn(cfg)[源代码]#
build_executor(model_name, trainer_name, work_dir, train_dataset=None, eval_dataset=None)[源代码]#
class data_juicer_sandbox.model_executors.LLMInferExecutor(model_config: dict, watcher=None)[源代码]#

基类:BaseModelExecutor

A inference executor for LLM inference. The model preparation method should be implemented by the subclass for specific type of model.

The config file for this type of executor should at least include the following items: 1. type: model type. 2. build_messages_func: the helper func to build the messages. 3. parse_output_func: the helper func to build the messages. 4. dataset_path: the input datasets or data pools use to construct the input messages for LLM inference.

Only support jsonl files for now.

  1. export_path: the output dir to store the inference results.

  2. infer_res_key: the key name to store the inference results. It's "response" in default.

__init__(model_config: dict, watcher=None)[源代码]#
prepare_executor()[源代码]#
executor_infer(messages)[源代码]#
class data_juicer_sandbox.model_executors.HFTransformersInferExecutor(model_config: dict, watcher=None)[源代码]#

基类:LLMInferExecutor

A inference executor for model inference with Huggingface Transformers.

The config file for this executor should at least include the following items: 1. type: must be "huggingface". 2. model_path: the path to the HF model. 3. model_params: extra parameters for the model. 4. sampling_params: extra sampling parameters for the model.

__init__(model_config: dict, watcher=None)[源代码]#
prepare_executor()[源代码]#
executor_infer(messages)[源代码]#
class data_juicer_sandbox.model_executors.VLLMInferExecutor(model_config: dict, watcher=None)[源代码]#

基类:LLMInferExecutor

A inference executor for model inference with vLLM.

The config file for this executor should at least include the following items: 1. type: must be "vllm". 2. model_path: the path to the vLLM model. 3. model_params: extra parameters for the model. 4. sampling_params: extra sampling parameters for the model. 5. other parameters can be referred to the class LLMInferExecutor

__init__(model_config: dict, watcher=None)[源代码]#
prepare_executor()[源代码]#
executor_infer(messages)[源代码]#
class data_juicer_sandbox.model_executors.APIModelInferExecutor(model_config: dict, watcher=None)[源代码]#

基类:LLMInferExecutor

A inference executor for model inference with OpenAI API.

The config file for this executor should at least include the following items: 1. type: must be "api". 2. model: the API model used to inference. 3. model_params: extra parameters for the model. 4. sampling_params: extra sampling parameters for the model. 5. api_endpoint: URL endpoint for the API. 6. response_path: Path to extract content from the API response. Defaults to 'choices.0.message.content'. 7. max_retry_num: the max number of retries when the API request fails. 8. other parameters can be referred to the class LLMInferExecutor

__init__(model_config: dict, watcher=None)[源代码]#
prepare_executor()[源代码]#
executor_infer(messages)[源代码]#
class data_juicer_sandbox.model_executors.LLaVAExecutor(model_config: dict)[源代码]#

基类:BaseModelExecutor

__init__(model_config: dict)[源代码]#
class data_juicer_sandbox.model_executors.LLaMAFactoryExecutor(model_config: dict)[源代码]#

基类:BaseModelExecutor

__init__(model_config: dict)[源代码]#
class data_juicer_sandbox.model_executors.MegatronExecutor(model_config: dict)[源代码]#

基类:BaseModelExecutor

__init__(model_config: dict)[源代码]#