random_selector#

Randomly selects a subset of samples from the dataset.

This operator randomly selects a subset of samples based on either a specified ratio or a fixed number. If both select_ratio and select_num are provided, the one that results in fewer samples is used. The selection is skipped if the dataset has only one or no samples. The random_sample function is used to perform the actual sampling.

  • select_ratio: The ratio of samples to select (0 to 1).

  • select_num: The exact number of samples to select.

  • If neither select_ratio nor select_num is set, the dataset remains unchanged.

从数据集中随机选择一部分样本。

该算子根据指定的比例或固定数量随机选择一部分样本。如果同时提供了 select_ratioselect_num,则使用导致样本数量较少的那个。如果数据集中只有一个或没有样本,则跳过选择。实际采样使用 random_sample 函数执行。

  • select_ratio:要选择的样本比例(0 到 1)。

  • select_num:要选择的确切样本数量。

  • 如果既未设置 select_ratio 也未设置 select_num,则数据集保持不变。

Type 算子类型: selector

Tags 标签: cpu

🔧 Parameter Configuration 参数配置#

name 参数名

type 类型

default 默认值

desc 说明

select_ratio

typing.Optional[typing.Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])]]

None

The ratio to select. When both select_ratio and select_num are set, the value corresponding to the smaller number of samples will be applied.

select_num

typing.Optional[typing.Annotated[int, Gt(gt=0)]]

None

The number of samples to select. When both select_ratio and select_num are set, the value corresponding to the smaller number of samples will be applied.

args

''

extra args

kwargs

''

extra args

📊 Effect demonstration 效果演示#

test_ratio_select#

RandomSelector(select_ratio=0.2, select_num=5)

📥 input data 输入数据#

Sample 1: text
Today is Sun
count101
meta
suffix.pdf
key1
key2
count34
count5
Sample 2: text
a v s e c s f e f g a a a  
count16
meta
suffix.docx
key1
key2
count243
count63
Sample 3: text
中文也是一个字算一个长度
count162
meta
suffix.txt
key1
key2
countNone
count23
Sample 4: text
,。、„”“«»1」「《》´∶:?!
countNone
meta
suffix.html
key1
key2
count18
count48
Sample 5: text
他的英文名字叫Harry Potter
count88
meta
suffix.pdf
key1
key2
count551
count78
Sample 6: text
这是一个测试
countNone
meta
suffix.py
key1
key2
count89
count3
Sample 7: text
我出生于2023年12月15日
countNone
meta
suffix.java
key1
key2
count354.32
count67
Sample 8: text
emoji表情测试下😊,😸31231
count2
meta
suffix.html
key1
key2
count354.32
count32
Sample 9: text
a=1
b
c=1+2+3+5
d=6
count178
meta
suffix.pdf
key1
key2
count33
count33
Sample 10: text
使用片段分词器对每个页面进行分词,使用语言
count666
meta
suffix.xml
key1
key2
count18
count48

📤 output data 输出数据#

Sample 1: text
这是一个测试
countNone
meta
key1
count3
key2
count89.0
suffix.py
Sample 2: text
我出生于2023年12月15日
countNone
meta
key1
count67
key2
count354.32
suffix.java

test_num_select#

RandomSelector(select_ratio=0.5, select_num=4)

📥 input data 输入数据#

Sample 1: text
Today is Sun
count101
meta
suffix.pdf
key1
key2
count34
count5
Sample 2: text
a v s e c s f e f g a a a  
count16
meta
suffix.docx
key1
key2
count243
count63
Sample 3: text
中文也是一个字算一个长度
count162
meta
suffix.txt
key1
key2
countNone
count23
Sample 4: text
,。、„”“«»1」「《》´∶:?!
countNone
meta
suffix.html
key1
key2
count18
count48
Sample 5: text
他的英文名字叫Harry Potter
count88
meta
suffix.pdf
key1
key2
count551
count78
Sample 6: text
这是一个测试
countNone
meta
suffix.py
key1
key2
count89
count3
Sample 7: text
我出生于2023年12月15日
countNone
meta
suffix.java
key1
key2
count354.32
count67
Sample 8: text
emoji表情测试下😊,😸31231
count2
meta
suffix.html
key1
key2
count354.32
count32
Sample 9: text
a=1
b
c=1+2+3+5
d=6
count178
meta
suffix.pdf
key1
key2
count33
count33
Sample 10: text
使用片段分词器对每个页面进行分词,使用语言
count666
meta
suffix.xml
key1
key2
count18
count48

📤 output data 输出数据#

Sample 1: text
这是一个测试
countNone
meta
key1
count3
key2
count89.0
suffix.py
Sample 2: text
我出生于2023年12月15日
countNone
meta
key1
count67
key2
count354.32
suffix.java
Sample 3: text
Today is Sun
count101
meta
key1
count5
key2
count34.0
suffix.pdf
Sample 4: text
emoji表情测试下😊,😸31231
count2
meta
key1
count32
key2
count354.32
suffix.html