data_juicer.utils.process_utils module

data_juicer.utils.process_utils.setup_mp(method=None)[source]
data_juicer.utils.process_utils.get_min_cuda_memory()[source]
data_juicer.utils.process_utils.calculate_np(name, mem_required, cpu_required, use_cuda=False, gpu_required=0)[source]

Calculate the optimum number of processes for the given OP automatically。

data_juicer.utils.process_utils.calculate_ray_np(operators)[source]

Automatically calculates optimal concurrency for Ray Data operator. This function handles both task and actor based operators, considering resource requirements and user specifications. The computation follows Ray Data’s concurrency semantics while optimizing resource utilization.

Key Concepts: - Resource Ratio: Individual operator’s resource requirement (GPU/CPU/memory)

compared to total cluster resources, using max(cpu_ratio, gpu_ratio, adjusted_mem_ratio)

  • Fixed Allocation: Portion of resources reserved by operators with user-specified num_proc

  • Dynamic Allocation: Remaining resources distributed among auto-scaling operators

Design Logic: 1. User Specification Priority:

  • If user provides concurrency setting, directly return it

  • Applies to both task and actor based operators

  1. Task Operators (equivalent to a cpu operator in dj):
    1. When unspecified: Return None to let Ray determine implicitly

    2. Auto-calculation: Returns maximum concurrency based on available

      resources and operator requirements

  2. Actor Operators (equivalent to a gpu operator in dj):
    1. Mandatory concurrency - set required gpus to 1 if unspecified, and then refer to the following b

      to calculate automatically based on this setting

    2. Auto-calculation returns tuple (min_concurrency, max_concurrency):
      1. Minimum: Ensures baseline resource allocation in remaining resources

        when all operators are active simultaneously (proportionally)

      2. Maximum: Allows full utilization of remaining resources by single operator when others are idle