data_juicer.ops.pipeline.ray_repartition_pipeline module#

class data_juicer.ops.pipeline.ray_repartition_pipeline.RayRepartitionPipeline(*args, **kwargs)[source]#

Bases: Pipeline

Repartition a Ray Dataset into a target number of blocks.

This operator performs dataset-level block repartitioning through Ray Dataset’s repartition API. It is intended for Ray executor pipelines only because local datasets do not expose Ray Dataset blocks.

__init__(num_blocks: int = 1, shuffle: bool = False, *args, **kwargs)[source]#

Initialization method.

Parameters:
  • num_blocks – target number of Ray Dataset blocks.

  • shuffle – whether to shuffle records during repartition.

run(dataset, *, exporter=None, tracer=None)[source]#