latex_figure_context_extractor_mapper#
Extracts figures and their citing context from LaTeX source.
This operator parses figure environments from a paper's LaTeX source, extracts each figure's caption, label, and image path(s), and finds the prose paragraphs that cite each figure. It fans out one paper row into N figure rows (one per figure or subfigure). Samples that contain no figures with images are dropped from the output. Supported figure environments: figure, figure*, wrapfigure, subfigure (environment), \subfigure (command), \subfloat (command, subfig package). Supported caption commands: \caption, \caption*, \subcaption, \captionof{figure}. Figures without \includegraphics are skipped. Subfigures inherit citing paragraphs from their parent figure's label. When building citing paragraphs, float/display environments (figures, tables, tabulars, equations, algorithms, etc.) are stripped so only prose text is searched.
Note: This operator expects the full LaTeX source as a single string. It does not resolve
\inputor\includedirectives. If your documents span multiple.texfiles, concatenate them into a single text field before applying this mapper.
从LaTeX源码中提取图片及其引用上下文。
该算子解析论文LaTeX源码中的figure环境,提取每个图片的标题、标签和图片路径,并找到引用该图片的段落文本。它将一行论文数据展开为N行图片数据(每个图片或子图一行)。不包含带图片的figure环境的样本将被丢弃。 支持的图片环境:figure、figure*、wrapfigure、subfigure(环境)、\subfigure(命令)、\subfloat(命令,subfig宏包)。支持的标题命令:\caption、\caption*、\subcaption、\captionof{figure}。没有\includegraphics的图片会被跳过。子图会继承父图标签的引用段落。构建引用段落时,浮动/展示环境(图片、表格、公式、算法等)会被去除,仅在正文文本中搜索。
注意: 该算子要求完整的LaTeX源码作为单个字符串输入。它不会解析
\input或\include指令。如果您的文档分散在多个.tex文件中,请在使用此算子之前将它们合并到一个文本字段中。
Type 算子类型: mapper
Tags 标签: cpu, text
🔧 Parameter Configuration 参数配置#
name 参数名 |
type 类型 |
default 默认值 |
desc 说明 |
|---|---|---|---|
|
|
|
LaTeX reference commands to search for when finding citing paragraphs. |
|
|
|
Pattern for splitting LaTeX text into paragraphs. |
|
|
|
Output field name for the figure caption. |
|
|
|
Output field name for the LaTeX label. |
|
|
|
Output field name for citing paragraphs. |
|
|
|
Output field name for the parent figure's caption. For subfigures this carries the parent figure environment's caption; empty for standalone figures. |
|
|
|
Output field name for the parent figure's label. Useful for grouping subfigures that belong to the same figure environment; empty for standalone figures. |
|
|
extra args |
|
|
|
extra args |
📤 Output Fields 输出字段#
In addition to all input fields, each output row contains:
除所有输入字段外,每行输出还包含:
field 字段 |
type 类型 |
desc 说明 |
|---|---|---|
|
|
Image paths from |
|
|
Figure caption text. 图片标题文本。 |
|
|
LaTeX label string. LaTeX标签字符串。 |
|
|
Paragraphs that cite this figure. 引用该图片的段落。 |
|
|
Parent figure caption (subfigures only; empty for standalone). 父图标题(仅子图;独立图为空)。 |
|
|
Parent figure label (subfigures only; empty for standalone). 父图标签(仅子图;独立图为空)。 |
📊 Effect demonstration 效果演示#
test_single_figure#
LatexFigureContextExtractorMapper()
📥 input data 输入数据#
A LaTeX document with a single figure:
\begin{document}
Some intro text.
As shown in \ref{fig:arch}, the architecture is novel.
\begin{figure}
\centering
\includegraphics[width=0.8\linewidth]{img/arch.pdf}
\caption{Overall architecture}
\label{fig:arch}
\end{figure}
\end{document}
📤 output data 输出数据#
One row is produced:
caption:"Overall architecture"label:"fig:arch"images:["img/arch.pdf"]citing_paragraphs:["As shown in \\ref{fig:arch}, the architecture is novel."]parent_caption:""(standalone figure, no parent)parent_label:""(standalone figure, no parent)
✨ explanation 解释#
The operator extracts the figure environment, parses its caption, label, and image path, then searches the document paragraphs (with float/display/tabular environments stripped) for any paragraph containing \ref{fig:arch}. The matching paragraph is returned as the citing context.
算子提取figure环境,解析其标题、标签和图片路径,然后在文档段落中(去除浮动/展示/表格环境后)搜索包含\ref{fig:arch}的段落。匹配的段落作为引用上下文返回。