data_juicer.ops.mapper.latex_merge_tex_mapper module#

class data_juicer.ops.mapper.latex_merge_tex_mapper.LatexMergeTexMapper(*args, **kwargs)[source]#

Bases: Mapper

Extracts and concatenates all .tex files from a compressed LaTeX project archive into a single text field.

Supported archive formats: .tar, .tar.gz / .tgz, and .zip. Plain .gz (single-file gzip) is not supported because gzip archives carry no filename metadata, making it impossible to verify that the content is actually a .tex file.

All .tex files found inside the archive are read in-memory and joined with a configurable separator. No ordering or deduplication is applied.

This operator is typically placed before LaTeX-processing operators such as remove_comments_mapper, expand_macro_mapper, or latex_figure_context_extractor_mapper.

__init__(compressed_file_key: str = 'compressed_file', separator: str = '\n\n', max_file_size: int = 52428800, max_total_size: int = 104857600, *args, **kwargs)[source]#

Initialization method.

Parameters:
  • compressed_file_key – Field name that stores the archive file path.

  • separator – String used to join the contents of multiple .tex files.

  • max_file_size – Maximum allowed uncompressed size in bytes for a single .tex entry inside the archive. Entries exceeding this limit are skipped with a warning. Set to None or 0 to disable the check.

  • max_total_size – Maximum allowed cumulative size in bytes for all extracted .tex content combined. Once this limit is reached, remaining files in the archive are skipped with a warning. Set to None or 0 to disable the check.

  • args – extra args

  • kwargs – extra args

process_single(sample)[source]#

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample