# random_selector Randomly selects a subset of samples from the dataset. This operator randomly selects a subset of samples based on either a specified ratio or a fixed number. If both `select_ratio` and `select_num` are provided, the one that results in fewer samples is used. The selection is skipped if the dataset has only one or no samples. The `random_sample` function is used to perform the actual sampling. - `select_ratio`: The ratio of samples to select (0 to 1). - `select_num`: The exact number of samples to select. - If neither `select_ratio` nor `select_num` is set, the dataset remains unchanged. 从数据集中随机选择一部分样本。 该算子根据指定的比例或固定数量随机选择一部分样本。如果同时提供了 `select_ratio` 和 `select_num`,则使用导致样本数量较少的那个。如果数据集中只有一个或没有样本,则跳过选择。实际采样使用 `random_sample` 函数执行。 - `select_ratio`:要选择的样本比例(0 到 1)。 - `select_num`:要选择的确切样本数量。 - 如果既未设置 `select_ratio` 也未设置 `select_num`,则数据集保持不变。 Type 算子类型: **selector** Tags 标签: cpu ## 🔧 Parameter Configuration 参数配置 | name 参数名 | type 类型 | default 默认值 | desc 说明 | |--------|------|--------|------| | `select_ratio` | typing.Optional[typing.Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])]] | `None` | The ratio to select. When both select_ratio and select_num are set, the value corresponding to the smaller number of samples will be applied. | | `select_num` | typing.Optional[typing.Annotated[int, Gt(gt=0)]] | `None` | The number of samples to select. When both select_ratio and select_num are set, the value corresponding to the smaller number of samples will be applied. | | `args` | | `''` | extra args | | `kwargs` | | `''` | extra args | ## 📊 Effect demonstration 效果演示 ### test_ratio_select ```python RandomSelector(select_ratio=0.2, select_num=5) ``` #### 📥 input data 输入数据
Sample 1: text
Today is Sun
count101
meta
suffix.pdf
key1
key2
count34
count5
Sample 2: text
a v s e c s f e f g a a a  
count16
meta
suffix.docx
key1
key2
count243
count63
Sample 3: text
中文也是一个字算一个长度
count162
meta
suffix.txt
key1
key2
countNone
count23
Sample 4: text
,。、„”“«»1」「《》´∶:?!
countNone
meta
suffix.html
key1
key2
count18
count48
Sample 5: text
他的英文名字叫Harry Potter
count88
meta
suffix.pdf
key1
key2
count551
count78
Sample 6: text
这是一个测试
countNone
meta
suffix.py
key1
key2
count89
count3
Sample 7: text
我出生于2023年12月15日
countNone
meta
suffix.java
key1
key2
count354.32
count67
Sample 8: text
emoji表情测试下😊,😸31231
count2
meta
suffix.html
key1
key2
count354.32
count32
Sample 9: text
a=1
b
c=1+2+3+5
d=6
count178
meta
suffix.pdf
key1
key2
count33
count33
Sample 10: text
使用片段分词器对每个页面进行分词,使用语言
count666
meta
suffix.xml
key1
key2
count18
count48
#### 📤 output data 输出数据
Sample 1: text
这是一个测试
countNone
meta
key1
count3
key2
count89.0
suffix.py
Sample 2: text
我出生于2023年12月15日
countNone
meta
key1
count67
key2
count354.32
suffix.java
### test_num_select ```python RandomSelector(select_ratio=0.5, select_num=4) ``` #### 📥 input data 输入数据
Sample 1: text
Today is Sun
count101
meta
suffix.pdf
key1
key2
count34
count5
Sample 2: text
a v s e c s f e f g a a a  
count16
meta
suffix.docx
key1
key2
count243
count63
Sample 3: text
中文也是一个字算一个长度
count162
meta
suffix.txt
key1
key2
countNone
count23
Sample 4: text
,。、„”“«»1」「《》´∶:?!
countNone
meta
suffix.html
key1
key2
count18
count48
Sample 5: text
他的英文名字叫Harry Potter
count88
meta
suffix.pdf
key1
key2
count551
count78
Sample 6: text
这是一个测试
countNone
meta
suffix.py
key1
key2
count89
count3
Sample 7: text
我出生于2023年12月15日
countNone
meta
suffix.java
key1
key2
count354.32
count67
Sample 8: text
emoji表情测试下😊,😸31231
count2
meta
suffix.html
key1
key2
count354.32
count32
Sample 9: text
a=1
b
c=1+2+3+5
d=6
count178
meta
suffix.pdf
key1
key2
count33
count33
Sample 10: text
使用片段分词器对每个页面进行分词,使用语言
count666
meta
suffix.xml
key1
key2
count18
count48
#### 📤 output data 输出数据
Sample 1: text
这是一个测试
countNone
meta
key1
count3
key2
count89.0
suffix.py
Sample 2: text
我出生于2023年12月15日
countNone
meta
key1
count67
key2
count354.32
suffix.java
Sample 3: text
Today is Sun
count101
meta
key1
count5
key2
count34.0
suffix.pdf
Sample 4: text
emoji表情测试下😊,😸31231
count2
meta
key1
count32
key2
count354.32
suffix.html
## 🔗 related links 相关链接 - [source code 源代码](../../../data_juicer/ops/selector/random_selector.py) - [unit test 单元测试](../../../tests/ops/selector/test_random_selector.py) - [Return operator list 返回算子列表](../../Operators.md)