data_juicer.ops.pipeline.ray_repartition_pipeline module#

class data_juicer.ops.pipeline.ray_repartition_pipeline.RayRepartitionPipeline(*args, **kwargs)[源代码]#

基类:Pipeline

Repartition a Ray Dataset into a target number of blocks.

This operator performs dataset-level block repartitioning through Ray Dataset's repartition API. It is intended for Ray executor pipelines only because local datasets do not expose Ray Dataset blocks.

__init__(num_blocks: int = 1, shuffle: bool = False, *args, **kwargs)[源代码]#

Initialization method.

参数:
  • num_blocks -- target number of Ray Dataset blocks.

  • shuffle -- whether to shuffle records during repartition.

run(dataset, *, exporter=None, tracer=None)[源代码]#