llm_extract_mapper#

Extract structured fields from text using an LLM; write results to meta. Part of the llm_* semantic ops family.

This operator uses an LLM to extract user-defined fields from each sample's text (or multiple input keys). You provide an output_schema (key → extraction instruction). Results are written to meta[meta_output_key] or to individual meta keys. Supports structured (JSON) and unstructured (e.g. plain text, jsonl) input. Token/cost usage is recorded in meta[llm_semantic_usage] (prompt_tokens, completion_tokens, total_tokens, optional cost_estimate).

使用 LLM 从文本中提取用户定义的结构化字段;结果写入 meta。支持结构化与非结构化输入,并记录 token/cost 用量。

Type 算子类型: mapper

Tags 标签: gpu, vllm, hf, api

🔧 Parameter Configuration 参数配置#

name 参数名

type 类型

default 默认值

desc 说明

input_keys

list

required

Sample keys to build input text (e.g. ["text"] or ["query","response"]).

output_schema

dict

required

{output_key: "extraction instruction"}.

api_or_hf_model

str

'gpt-4o'

Model name for API or HuggingFace.

meta_output_key

str, optional

'llm_extract'

If set, write full result to meta[meta_output_key].

knowledge_grounding_key

str, optional

None

Optional sample key for per-sample grounding.

knowledge_grounding_fixed

str, optional

None

Optional fixed grounding string.

is_hf_model

bool

False

If true, use HuggingFace/Transformers.

enable_vllm

bool

False

If true, use vLLM backend.

api_endpoint

str, optional

None

URL endpoint for the API.

response_path

str, optional

None

Path to extract content from API response.

system_prompt

str, optional

None

Override default extraction system prompt.

try_num

int

3

Retries on parse/API failure.

model_params

dict

{}

Parameters for model init.

sampling_params

dict

{}

Sampling params (e.g. temperature, top_p).

📊 Effect demonstration 效果演示#

The examples below match the unit tests. Concrete field values depend on the model and API; only shape and keys are guaranteed.
下列示例与单元测试场景一致;具体抽取内容随模型与接口变化,文档中仅示意典型结果。

test_extract_default#

LLMExtractMapper(
    input_keys=["text"],
    output_schema={
        "topic": "One short phrase: main topic.",
        "sentiment": "One word: positive, negative, or neutral.",
    },
    api_or_hf_model="gpt-4o",
    meta_output_key="llm_extract",
    try_num=2,
)

📥 input data 输入数据#

Sample 1: text
The stock market rose today. Investors are optimistic.
Sample 2: text
Bad weather caused delays. Many people were upset.

📤 output data 输出数据#

Sample 1: text
The stock market rose today. Investors are optimistic.
meta
llm_extract
{"topic": "stock market / finance", "sentiment": "positive"}
llm_semantic_usage
{"prompt_tokens": …, "completion_tokens": …, "total_tokens": …}
Sample 2: text
Bad weather caused delays. Many people were upset.
meta
llm_extract
{"topic": "weather / travel disruption", "sentiment": "negative"}
llm_semantic_usage
{"prompt_tokens": …, "completion_tokens": …, "total_tokens": …}

✨ explanation 解释#

dataset.map(op.process, batch_size=1) fills meta["llm_extract"] with keys from output_schema. When the API returns usage, meta["llm_semantic_usage"] holds token counts (and optional cost).
对每条样本调用 map 后,meta 中写入抽取结果;若接口返回用量,则同时写入 llm_semantic_usage

test_extract_empty_input#

LLMExtractMapper(
    input_keys=["text"],
    output_schema={"topic": "Main topic.", "sentiment": "Sentiment."},
    api_or_hf_model="gpt-4o",
    meta_output_key="llm_extract",
)

📥 input data 输入数据#

Sample 1: text

📤 output data 输出数据#

Sample 1: text
meta
llm_extract
{"topic": null, "sentiment": null}

✨ explanation 解释#

Empty concatenated input skips the LLM call and sets all schema fields to null as in the unit test.
空输入时不调用模型,各字段为 null,与单测一致。

🌐 DashScope / OpenAI-compatible 环境变量#

使用阿里云 DashScope OpenAI 兼容模式(REST)时,可只配环境变量,无需在 recipe 里写 key:

export OPENAI_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
export OPENAI_API_KEY=<你的 DashScope API Key>
# 或使用:export DASHSCOPE_API_KEY=<同上>

默认算子模型名为 gpt-4o,在 DashScope 上不可用。若检测到上述兼容 Base URL,会自动改用 qwen-plus(可通过 DASHSCOPE_DEFAULT_MODELOPENAI_DEFAULT_MODEL 覆盖),或在配置里显式设置 api_or_hf_model: qwen-plus

OPENAI_API_URLOPENAI_BASE_URL 等价(任选其一)。

📊 Cost / usage 成本与用量#

Each sample gets meta[llm_semantic_usage] with:

  • prompt_tokens, completion_tokens, total_tokens

  • cost_estimate (optional, when pricing is available)