data_juicer.ops.pipeline#

class data_juicer.ops.pipeline.LLMRayVLLMEnginePipeline(*args, **kwargs)[源代码]#

基类:RayVLLMEnginePipeline

Pipeline to generate response using vLLM engine on Ray. This pipeline leverages the vLLM engine for efficient large language model inference. More details about ray vLLM engine can be found at: https://docs.ray.io/en/latest/data/working-with-llms.html

__init__(api_or_hf_model: str = 'Qwen/Qwen2.5-7B-Instruct', is_hf_model: bool = True, *, system_prompt: str | None = None, accelerator_type: str | None = None, sampling_params: Dict | None = None, engine_kwargs: Dict | None = None, api_url: str = None, api_key: str = None, **kwargs)[源代码]#

Initialization method.

参数:
  • api_or_hf_model -- API or huggingface model name.

  • system_prompt -- System prompt for guiding the optimization task.

  • accelerator_type -- The type of accelerator to use (e.g., "V100", "A100"). Default to None, meaning that only the CPU will be used.

  • sampling_params -- Sampling parameters for text generation (e.g., {'temperature': 0.9, 'top_p': 0.95}).

  • engine_kwargs -- The kwargs to pass to the vLLM engine. See documentation for details: https://docs.vllm.ai/en/latest/api/vllm/engine/arg_utils/#vllm.engine.arg_utils.AsyncEngineArgs.

  • api_url -- Base URL of the OpenAI API

  • api_key -- API key for authentication

  • kwargs -- Extra keyword arguments.

static preprocess_fn(row: Dict, query_key: str, system_prompt: str | None, sampling_params: Dict) Dict[源代码]#
static postprocess_fn(row: Dict, response_key: str, ori_columns: list) Dict[源代码]#
static preprocess_fn_api(row: Dict, model: str, query_key: str, system_prompt: str | None, sampling_params: Dict | None = None) Dict[源代码]#
static postprocess_fn_api(row: Dict, response_key: str, ori_columns: list) Dict[源代码]#
run(dataset, *, exporter=None, tracer=None, reduce=True)[源代码]#
class data_juicer.ops.pipeline.VLMRayVLLMEnginePipeline(*args, **kwargs)[源代码]#

基类:RayVLLMEnginePipeline

Pipeline to generate response using vLLM engine on Ray. This pipeline leverages the vLLM engine for efficient large vision language model inference. More details about ray vLLM engine can be found at: https://docs.ray.io/en/latest/data/working-with-llms.html

__init__(api_or_hf_model: str = 'Qwen/Qwen2.5-7B-Instruct', is_hf_model: bool = True, *, system_prompt: str | None = None, accelerator_type: str | None = None, sampling_params: Dict | None = None, engine_kwargs: Dict | None = None, **kwargs)[源代码]#

Initialization method.

参数:
  • api_or_hf_model -- API or huggingface model name.

  • system_prompt -- System prompt for guiding the optimization task.

  • accelerator_type -- The type of accelerator to use (e.g., "V100", "A100"). Default to None, meaning that only the CPU will be used.

  • sampling_params -- Sampling parameters for text generation (e.g., {'temperature': 0.9, 'top_p': 0.95}).

  • engine_kwargs -- The kwargs to pass to the vLLM engine. See documentation for details: https://docs.vllm.ai/en/latest/api/vllm/engine/arg_utils/#vllm.engine.arg_utils.AsyncEngineArgs.

  • kwargs -- Extra keyword arguments.

static vision_preprocess(row: dict, query_key: str, image_key: str, system_prompt: str | None, sampling_params: Dict) dict[源代码]#

Preprocessing function for vision-language model inputs.

static postprocess_fn(row: Dict, response_key: str, ori_columns: list) Dict[源代码]#
run(dataset, *, exporter=None, tracer=None, reduce=True)[源代码]#