# text_chunk_mapper Split input text into chunks based on specified criteria. - Splits the input text into multiple chunks using a specified maximum length and a split pattern. - If `max_len` is provided, the text is split into chunks with a maximum length of `max_len`. - If `split_pattern` is provided, the text is split at occurrences of the pattern. If the length exceeds `max_len`, it will force a cut. - The `overlap_len` parameter specifies the overlap length between consecutive chunks if the split does not occur at the pattern. - Uses a Hugging Face tokenizer to calculate the text length in tokens if a tokenizer name is provided; otherwise, it uses the string length. - Caches the following stats: 'chunk_count' (number of chunks generated for each sample). - Raises a `ValueError` if both `max_len` and `split_pattern` are `None` or if `overlap_len` is greater than or equal to `max_len`. 根据指定的标准将输入文本拆分成多个块。 - 使用指定的最大长度和拆分模式将输入文本拆分成多个块。 - 如果提供了`max_len`,则将文本拆分成最大长度为`max_len`的块。 - 如果提供了`split_pattern`,则在模式出现处拆分文本。如果长度超过`max_len`,则会强制切割。 - `overlap_len`参数指定连续块之间的重叠长度,如果拆分不在模式处发生。 - 如果提供了tokenizer名称,则使用Hugging Face tokenizer计算token长度;否则,使用字符串长度。 - 缓存以下统计信息:'chunk_count'(为每个样本生成的块数)。 - 如果`max_len`和`split_pattern`都为`None`,或者`overlap_len`大于或等于`max_len`,则引发`ValueError`。 Type 算子类型: **mapper** Tags 标签: cpu, api, text ## 🔧 Parameter Configuration 参数配置 | name 参数名 | type 类型 | default 默认值 | desc 说明 | |--------|------|--------|------| | `max_len` | typing.Optional[typing.Annotated[int, Gt(gt=0)]] | `None` | Split text into multi texts with this max len if not None. | | `split_pattern` | typing.Optional[str] | `'\n\n'` | Make sure split in this pattern if it is not None and force cut if the length exceeds max_len. | | `overlap_len` | typing.Annotated[int, Ge(ge=0)] | `0` | Overlap length of the split texts if not split in the split pattern. | | `tokenizer` | typing.Optional[str] | `None` | The tokenizer name of Hugging Face tokenizers. The text length will be calculate as the token num if it is offered. Otherwise, the text length equals to string length. Support tiktoken tokenizer (such as gpt-4o), dashscope tokenizer ( such as qwen2.5-72b-instruct) and huggingface tokenizer. | | `trust_remote_code` | | `False` | whether to trust the remote code of HF models. | | `args` | | `''` | extra args | | `kwargs` | | `''` | extra args | ## 📊 Effect demonstration 效果演示 ### test_naive_text_chunk ```python TextChunkMapper(split_pattern='\n') ``` #### 📥 input data 输入数据
Sample 1: text
Today is Sunday and it's a happy day!
Sample 2: text
Sur la plateforme MT4, plusieurs manières d'accéder à 
ces fonctionnalités sont conçues simultanément.
Sample 3: text
欢迎来到阿里巴巴!
#### 📤 output data 输出数据
Sample 1: text
Today is Sunday and it's a happy day!
Sample 2: text
Sur la plateforme MT4, plusieurs manières d'accéder à 
Sample 3: text
ces fonctionnalités sont conçues simultanément.
Sample 4: text
欢迎来到阿里巴巴!
#### ✨ explanation 解释 This example shows how the operator splits the input text into chunks based on a specified split pattern. Here, the split pattern is '\n', which means the text will be split at each newline character. In this case, only the second sample contains a newline, so it is split into two parts. The other samples do not contain newlines and remain unchanged. 这个例子展示了算子如何根据指定的分割模式将输入文本分割成多个块。这里,分割模式是'\n',意味着文本会在每个换行符处被分割。在这种情况下,只有第二个样本包含换行符,因此它被分成两部分。其他样本不包含换行符,所以保持不变。 ### test_max_len_text_chunk ```python TextChunkMapper(max_len=20, split_pattern=None) ``` #### 📥 input data 输入数据
Sample 1: text
Today is Sunday and it's a happy day!
Sample 2: text
Sur la plateforme MT4, plusieurs manières d'accéder à ces fonctionnalités sont conçues simultanément.
Sample 3: text
欢迎来到阿里巴巴!
#### 📤 output data 输出数据
Sample 1: text
Today is Sunday and 
Sample 2: text
it's a happy day!
Sample 3: text
Sur la plateforme MT
Sample 4: text
4, plusieurs manière
Sample 5: text
s d'accéder à ces fo
Sample 6: text
nctionnalités sont c
Sample 7: text
onçues simultanément
Sample 8: text
.
Sample 9: text
欢迎来到阿里巴巴!
#### ✨ explanation 解释 This example demonstrates how the operator splits the input text into chunks with a maximum length of 20 characters. The text is split into multiple segments, each no longer than 20 characters. If a word or phrase is cut off, it is included in the next segment. This ensures that the text is divided into manageable pieces without breaking words in the middle. 这个例子展示了算子如何将输入文本分割成最大长度为20个字符的多个块。文本被分割成多个段,每段不超过20个字符。如果某个词或短语被截断,它会被包含在下一段中。这样可以确保文本被分割成易于处理的部分,而不会在中间打断单词。 ## 🔗 related links 相关链接 - [source code 源代码](../../../data_juicer/ops/mapper/text_chunk_mapper.py) - [unit test 单元测试](../../../tests/ops/mapper/test_text_chunk_mapper.py) - [Return operator list 返回算子列表](../../Operators.md)