data_juicer.ops.mapper.latex_figure_context_extractor_mapper module#
- class data_juicer.ops.mapper.latex_figure_context_extractor_mapper.SubFigure(caption: str = '', label: str = '', image_paths: List[str] = <factory>)[source]#
Bases:
objectA subfigure within a figure environment.
- caption: str = ''#
- label: str = ''#
- image_paths: List[str]#
- __init__(caption: str = '', label: str = '', image_paths: List[str] = <factory>) None#
- class data_juicer.ops.mapper.latex_figure_context_extractor_mapper.Figure(caption: str = '', label: str = '', image_paths: List[str] = <factory>, sub_figures: List[SubFigure] = <factory>)[source]#
Bases:
objectA top-level figure/figure* environment.
- caption: str = ''#
- label: str = ''#
- image_paths: List[str]#
- class data_juicer.ops.mapper.latex_figure_context_extractor_mapper.LatexFigureContextExtractorMapper(*args, **kwargs)[source]#
Bases:
MapperExtracts figures and their citing context from LaTeX source.
This operator parses figure environments from a paper’s LaTeX source, extracts each figure’s caption, label, and image path(s), and finds the prose paragraphs that cite each figure. It fans out one paper row into N figure rows (one per figure or subfigure). Samples that contain no figures with images are dropped from the output.
- Supported figure environments: figure, figure*, wrapfigure,
subfigure (environment), subfigure (command), subfloat (command, subfig package).
- Supported caption commands: caption, caption*,
subcaption, captionof{figure}.
Figures without includegraphics are skipped. Subfigures inherit citing paragraphs from their parent figure’s label.
Output fields (in addition to all input fields):
<image_key>(defaultimages, inherited from base class): list of image paths from\includegraphics.<caption_key>(defaultcaption): figure caption text.<label_key>(defaultlabel): LaTeX label string.<context_key>(defaultciting_paragraphs): list of paragraphs that cite this figure.<parent_caption_key>(defaultparent_caption): parent figure caption (subfigures only; empty for standalone figures).<parent_label_key>(defaultparent_label): parent figure label (subfigures only; empty for standalone figures).
Note: this operator expects the full LaTeX source as a single string. It does not resolve
\inputor\includedirectives. If your documents span multiple.texfiles, concatenate them into a single text field before applying this mapper.- __init__(citation_commands: List[str] | None = None, paragraph_separator: str = '\n\n', caption_key: str = 'caption', label_key: str = 'label', context_key: str = 'citing_paragraphs', parent_caption_key: str = 'parent_caption', parent_label_key: str = 'parent_label', *args, **kwargs)[source]#
Initialization method.
- Parameters:
citation_commands – LaTeX reference commands to search for when finding citing paragraphs. Defaults to [’ref’, ‘cref’, ‘Cref’, ‘autoref’]. Comma-separated label lists (e.g.
\cref{fig:a,fig:b}) are handled automatically.paragraph_separator – Pattern for splitting LaTeX text into paragraphs. Defaults to ‘nn’.
caption_key – Output field name for the figure caption.
label_key – Output field name for the LaTeX label.
context_key – Output field name for citing paragraphs.
parent_caption_key – Output field name for the parent figure’s caption. For subfigures this carries the parent figure environment’s caption; for standalone figures it is an empty string.
parent_label_key – Output field name for the parent figure’s label. Useful for grouping subfigures that belong to the same figure environment. Empty string for standalone figures.
args – extra args
kwargs – extra args. Notably
text_key(default'text') controls which input field contains the LaTeX source, andimage_key(default'images') controls the output field name for extracted image paths. Both are inherited from the baseOPclass.