data_juicer.utils.process_utils module#

data_juicer.utils.process_utils.setup_worker_threads(num_threads=1)[source]#

Configure thread limits for worker processes to prevent thread over-subscription.

When running with multiple worker processes (e.g., num_proc > 1), each worker using multiple threads leads to severe performance degradation due to thread contention. This function limits threads per worker to prevent this issue.

Parameters:: num_threads – Number of threads per worker process (default: 1)

data_juicer.utils.process_utils.setup_mp(method=None)[source]#

data_juicer.utils.process_utils.get_min_cuda_memory()[source]#

data_juicer.utils.process_utils.calculate_np(name, memory, num_cpus, use_cuda=False, num_gpus=0)[source]#: Calculate the optimum number of processes for the given OP automatically。

data_juicer.utils.process_utils.calculate_ray_np(operators)[source]#

Automatically calculates optimal concurrency for Ray Data operator. This function handles both task and actor based operators, considering resource requirements and user specifications. The computation follows Ray Data’s concurrency semantics while optimizing resource utilization.

Key Concepts: - Resource Ratio: Individual operator’s resource requirement (GPU/CPU/memory)

compared to total cluster resources, using max(cpu_ratio, gpu_ratio, adjusted_mem_ratio)

Fixed Allocation: Portion of resources reserved by operators with user-specified num_proc
Dynamic Allocation: Remaining resources distributed among auto-scaling operators

Design Logic: 1. User Specification Priority:

If user provides concurrency setting, directly return it

Applies to both task and actor based operators

Task Operators (equivalent to a cpu operator in dj):
1. When unspecified: Return None to let Ray determine implicitly
2. Auto-calculation: Returns maximum concurrency based on available
  resources and operator requirements
Actor Operators (equivalent to a gpu operator in dj):
1. Mandatory concurrency - set required gpus to 1 if unspecified, and then refer to the following b
  to calculate automatically based on this setting
2. Auto-calculation returns tuple (min_concurrency, max_concurrency):
  
  Minimum: Ensures baseline resource allocation in remaining resources
  when all operators are active simultaneously in streaming mode (proportionally). If the resources are insufficient, back to the batch mode, only guarantee the actor operators.
  
  Maximum: Allows full utilization of remaining resources by single operator when others are idle

data_juicer.utils.process_utils module#

This Page