llm_extract_mapper#
Extract structured fields from text using an LLM; write results to meta. Part of the llm_* semantic ops family.
This operator uses an LLM to extract user-defined fields from each sample’s text (or multiple input keys). You provide an output_schema (key → extraction instruction). Results are written to meta[meta_output_key] or to individual meta keys. Supports structured (JSON) and unstructured (e.g. plain text, jsonl) input. Token/cost usage is recorded in meta[llm_semantic_usage] (prompt_tokens, completion_tokens, total_tokens, optional cost_estimate).
使用 LLM 从文本中提取用户定义的结构化字段;结果写入 meta。支持结构化与非结构化输入,并记录 token/cost 用量。
Type 算子类型: mapper
Tags 标签: gpu, vllm, hf, api
🔧 Parameter Configuration 参数配置#
name 参数名 |
type 类型 |
default 默认值 |
desc 说明 |
|---|---|---|---|
|
list |
required |
Sample keys to build input text (e.g. |
|
dict |
required |
|
|
str |
|
Model name for API or HuggingFace. |
|
str, optional |
|
If set, write full result to |
|
str, optional |
|
Optional sample key for per-sample grounding. |
|
str, optional |
|
Optional fixed grounding string. |
|
bool |
|
If true, use HuggingFace/Transformers. |
|
bool |
|
If true, use vLLM backend. |
|
str, optional |
|
URL endpoint for the API. |
|
str, optional |
|
Path to extract content from API response. |
|
str, optional |
|
Override default extraction system prompt. |
|
int |
|
Retries on parse/API failure. |
|
dict |
|
Parameters for model init. |
|
dict |
|
Sampling params (e.g. temperature, top_p). |
📊 Effect demonstration 效果演示#
The examples below match the unit tests. Concrete field values depend on the model and API; only shape and keys are guaranteed.
下列示例与单元测试场景一致;具体抽取内容随模型与接口变化,文档中仅示意典型结果。
test_extract_default#
LLMExtractMapper(
input_keys=["text"],
output_schema={
"topic": "One short phrase: main topic.",
"sentiment": "One word: positive, negative, or neutral.",
},
api_or_hf_model="gpt-4o",
meta_output_key="llm_extract",
try_num=2,
)
📥 input data 输入数据#
The stock market rose today. Investors are optimistic.
Bad weather caused delays. Many people were upset.
📤 output data 输出数据#
The stock market rose today. Investors are optimistic.
Bad weather caused delays. Many people were upset.
✨ explanation 解释#
dataset.map(op.process, batch_size=1) fills meta["llm_extract"] with keys from output_schema. When the API returns usage, meta["llm_semantic_usage"] holds token counts (and optional cost).
对每条样本调用 map 后,meta 中写入抽取结果;若接口返回用量,则同时写入 llm_semantic_usage。
test_extract_empty_input#
LLMExtractMapper(
input_keys=["text"],
output_schema={"topic": "Main topic.", "sentiment": "Sentiment."},
api_or_hf_model="gpt-4o",
meta_output_key="llm_extract",
)
📥 input data 输入数据#
📤 output data 输出数据#
✨ explanation 解释#
Empty concatenated input skips the LLM call and sets all schema fields to null as in the unit test.
空输入时不调用模型,各字段为 null,与单测一致。
🌐 DashScope / OpenAI-compatible 环境变量#
使用阿里云 DashScope OpenAI 兼容模式(REST)时,可只配环境变量,无需在 recipe 里写 key:
export OPENAI_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
export OPENAI_API_KEY=<你的 DashScope API Key>
# 或使用:export DASHSCOPE_API_KEY=<同上>
默认算子模型名为 gpt-4o,在 DashScope 上不可用。若检测到上述兼容 Base URL,会自动改用 qwen-plus(可通过 DASHSCOPE_DEFAULT_MODEL 或 OPENAI_DEFAULT_MODEL 覆盖),或在配置里显式设置 api_or_hf_model: qwen-plus。
OPENAI_API_URL 与 OPENAI_BASE_URL 等价(任选其一)。
📊 Cost / usage 成本与用量#
Each sample gets meta[llm_semantic_usage] with:
prompt_tokens,completion_tokens,total_tokenscost_estimate(optional, when pricing is available)