data_juicer.ops.mapper.latex_figure_context_extractor_mapper module#

class data_juicer.ops.mapper.latex_figure_context_extractor_mapper.SubFigure(caption: str = '', label: str = '', image_paths: List[str] = <factory>)[source]#

Bases: object

A subfigure within a figure environment.

caption: str = ''#
label: str = ''#
image_paths: List[str]#
__init__(caption: str = '', label: str = '', image_paths: List[str] = <factory>) None#
class data_juicer.ops.mapper.latex_figure_context_extractor_mapper.Figure(caption: str = '', label: str = '', image_paths: List[str] = <factory>, sub_figures: List[SubFigure] = <factory>)[source]#

Bases: object

A top-level figure/figure* environment.

caption: str = ''#
label: str = ''#
image_paths: List[str]#
sub_figures: List[SubFigure]#
__init__(caption: str = '', label: str = '', image_paths: List[str] = <factory>, sub_figures: List[SubFigure] = <factory>) None#
class data_juicer.ops.mapper.latex_figure_context_extractor_mapper.LatexFigureContextExtractorMapper(*args, **kwargs)[source]#

Bases: Mapper

Extracts figures and their citing context from LaTeX source.

This operator parses figure environments from a paper’s LaTeX source, extracts each figure’s caption, label, and image path(s), and finds the prose paragraphs that cite each figure. It fans out one paper row into N figure rows (one per figure or subfigure). Samples that contain no figures with images are dropped from the output.

Supported figure environments: figure, figure*, wrapfigure,

subfigure (environment), subfigure (command), subfloat (command, subfig package).

Supported caption commands: caption, caption*,

subcaption, captionof{figure}.

Figures without includegraphics are skipped. Subfigures inherit citing paragraphs from their parent figure’s label.

Output fields (in addition to all input fields):

  • <image_key> (default images, inherited from base class): list of image paths from \includegraphics.

  • <caption_key> (default caption): figure caption text.

  • <label_key> (default label): LaTeX label string.

  • <context_key> (default citing_paragraphs): list of paragraphs that cite this figure.

  • <parent_caption_key> (default parent_caption): parent figure caption (subfigures only; empty for standalone figures).

  • <parent_label_key> (default parent_label): parent figure label (subfigures only; empty for standalone figures).

Note: this operator expects the full LaTeX source as a single string. It does not resolve \input or \include directives. If your documents span multiple .tex files, concatenate them into a single text field before applying this mapper.

__init__(citation_commands: List[str] | None = None, paragraph_separator: str = '\n\n', caption_key: str = 'caption', label_key: str = 'label', context_key: str = 'citing_paragraphs', parent_caption_key: str = 'parent_caption', parent_label_key: str = 'parent_label', *args, **kwargs)[source]#

Initialization method.

Parameters:
  • citation_commands – LaTeX reference commands to search for when finding citing paragraphs. Defaults to [’ref’, ‘cref’, ‘Cref’, ‘autoref’]. Comma-separated label lists (e.g. \cref{fig:a,fig:b}) are handled automatically.

  • paragraph_separator – Pattern for splitting LaTeX text into paragraphs. Defaults to ‘nn’.

  • caption_key – Output field name for the figure caption.

  • label_key – Output field name for the LaTeX label.

  • context_key – Output field name for citing paragraphs.

  • parent_caption_key – Output field name for the parent figure’s caption. For subfigures this carries the parent figure environment’s caption; for standalone figures it is an empty string.

  • parent_label_key – Output field name for the parent figure’s label. Useful for grouping subfigures that belong to the same figure environment. Empty string for standalone figures.

  • args – extra args

  • kwargs – extra args. Notably text_key (default 'text') controls which input field contains the LaTeX source, and image_key (default 'images') controls the output field name for extracted image paths. Both are inherited from the base OP class.

process_batched(samples)[source]#