data_juicer.ops.mapper.latex_merge_tex_mapper module#
- class data_juicer.ops.mapper.latex_merge_tex_mapper.LatexMergeTexMapper(*args, **kwargs)[source]#
Bases:
MapperExtracts and concatenates all
.texfiles from a compressed LaTeX project archive into a single text field.Supported archive formats:
.tar,.tar.gz/.tgz, and.zip. Plain.gz(single-file gzip) is not supported because gzip archives carry no filename metadata, making it impossible to verify that the content is actually a.texfile.All
.texfiles found inside the archive are read in-memory and joined with a configurable separator. No ordering or deduplication is applied.This operator is typically placed before LaTeX-processing operators such as
remove_comments_mapper,expand_macro_mapper, orlatex_figure_context_extractor_mapper.- __init__(compressed_file_key: str = 'compressed_file', separator: str = '\n\n', max_file_size: int = 52428800, max_total_size: int = 104857600, *args, **kwargs)[source]#
Initialization method.
- Parameters:
compressed_file_key â Field name that stores the archive file path.
separator â String used to join the contents of multiple
.texfiles.max_file_size â Maximum allowed uncompressed size in bytes for a single
.texentry inside the archive. Entries exceeding this limit are skipped with a warning. Set toNoneor0to disable the check.max_total_size â Maximum allowed cumulative size in bytes for all extracted
.texcontent combined. Once this limit is reached, remaining files in the archive are skipped with a warning. Set toNoneor0to disable the check.args â extra args
kwargs â extra args