llm_extract_mapper#

Extract structured fields from text using an LLM; write results to meta. Part of the llm_* semantic ops family.

This operator uses an LLM to extract user-defined fields from each sample's text (or multiple input keys). You provide an output_schema (key → extraction instruction). Results are written to meta[meta_output_key] or to individual meta keys. Supports structured (JSON) and unstructured (e.g. plain text, jsonl) input. Token/cost usage is recorded in meta[llm_semantic_usage] (prompt_tokens, completion_tokens, total_tokens, optional cost_estimate).

使用 LLM 从文本中提取用户定义的结构化字段；结果写入 meta。支持结构化与非结构化输入，并记录 token/cost 用量。

Type 算子类型: mapper

Tags 标签: gpu, vllm, hf, api

🔧 Parameter Configuration 参数配置#

name 参数名	type 类型	default 默认值	desc 说明
`input_keys`	list	required	Sample keys to build input text (e.g. `["text"]` or `["query","response"]`).
`output_schema`	dict	required	`{output_key: "extraction instruction"}`.
`api_or_hf_model`	str	`'gpt-4o'`	Model name for API or HuggingFace.
`meta_output_key`	str, optional	`'llm_extract'`	If set, write full result to `meta[meta_output_key]`.
`knowledge_grounding_key`	str, optional	`None`	Optional sample key for per-sample grounding.
`knowledge_grounding_fixed`	str, optional	`None`	Optional fixed grounding string.
`is_hf_model`	bool	`False`	If true, use HuggingFace/Transformers.
`enable_vllm`	bool	`False`	If true, use vLLM backend.
`api_endpoint`	str, optional	`None`	URL endpoint for the API.
`response_path`	str, optional	`None`	Path to extract content from API response.
`system_prompt`	str, optional	`None`	Override default extraction system prompt.
`try_num`	int	`3`	Retries on parse/API failure.
`model_params`	dict	`{}`	Parameters for model init.
`sampling_params`	dict	`{}`	Sampling params (e.g. temperature, top_p).

📊 Effect demonstration 效果演示#

The examples below match the unit tests. Concrete field values depend on the model and API; only shape and keys are guaranteed.
下列示例与单元测试场景一致；具体抽取内容随模型与接口变化，文档中仅示意典型结果。

test_extract_default#

LLMExtractMapper(
    input_keys=["text"],
    output_schema={
        "topic": "One short phrase: main topic.",
        "sentiment": "One word: positive, negative, or neutral.",
    },
    api_or_hf_model="gpt-4o",
    meta_output_key="llm_extract",
    try_num=2,
)

📥 input data 输入数据#

Sample 1: text

The stock market rose today. Investors are optimistic.

Sample 2: text

Bad weather caused delays. Many people were upset.

📤 output data 输出数据#

Sample 1: text

The stock market rose today. Investors are optimistic.

meta
llm_extract	{"topic": "stock market / finance", "sentiment": "positive"}
llm_semantic_usage	{"prompt_tokens": …, "completion_tokens": …, "total_tokens": …}

Sample 2: text

Bad weather caused delays. Many people were upset.

meta
llm_extract	{"topic": "weather / travel disruption", "sentiment": "negative"}
llm_semantic_usage	{"prompt_tokens": …, "completion_tokens": …, "total_tokens": …}

✨ explanation 解释#

dataset.map(op.process, batch_size=1) fills meta["llm_extract"] with keys from output_schema. When the API returns usage, meta["llm_semantic_usage"] holds token counts (and optional cost).
对每条样本调用 map 后，meta 中写入抽取结果；若接口返回用量，则同时写入 llm_semantic_usage。

test_extract_empty_input#

LLMExtractMapper(
    input_keys=["text"],
    output_schema={"topic": "Main topic.", "sentiment": "Sentiment."},
    api_or_hf_model="gpt-4o",
    meta_output_key="llm_extract",
)

📥 input data 输入数据#

Sample 1: text

📤 output data 输出数据#

Sample 1: text

meta
llm_extract	{"topic": null, "sentiment": null}

✨ explanation 解释#

Empty concatenated input skips the LLM call and sets all schema fields to null as in the unit test.
空输入时不调用模型，各字段为 null，与单测一致。

🌐 DashScope / OpenAI-compatible 环境变量#

使用阿里云 DashScope OpenAI 兼容模式（REST）时，可只配环境变量，无需在 recipe 里写 key：

export OPENAI_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
export OPENAI_API_KEY=<你的 DashScope API Key>
# 或使用：export DASHSCOPE_API_KEY=<同上>

默认算子模型名为 gpt-4o，在 DashScope 上不可用。若检测到上述兼容 Base URL，会自动改用 qwen-plus（可通过 DASHSCOPE_DEFAULT_MODEL 或 OPENAI_DEFAULT_MODEL 覆盖），或在配置里显式设置 api_or_hf_model: qwen-plus。

OPENAI_API_URL 与 OPENAI_BASE_URL 等价（任选其一）。

📊 Cost / usage 成本与用量#

Each sample gets meta[llm_semantic_usage] with:

prompt_tokens, completion_tokens, total_tokens
cost_estimate (optional, when pricing is available)

llm_extract_mapper#

🔧 Parameter Configuration 参数配置#

📊 Effect demonstration 效果演示#

test_extract_default#

📥 input data 输入数据#

📤 output data 输出数据#

✨ explanation 解释#

test_extract_empty_input#

📥 input data 输入数据#

📤 output data 输出数据#

✨ explanation 解释#

🌐 DashScope / OpenAI-compatible 环境变量#

📊 Cost / usage 成本与用量#

🔗 Related links 相关链接#

本页