data_juicer.utils.job.common module#

DataJuicer Job Utilities - Common Functions

Shared utilities for job stopping and monitoring operations.

class data_juicer.utils.job.common.JobUtils(job_id: str, work_dir: str = None, base_dir: str = None)[源代码]#

基类:object

Common utilities for DataJuicer job operations.

__init__(job_id: str, work_dir: str = None, base_dir: str = None)[源代码]#

Initialize job utilities.

参数:
  • job_id -- The job ID to work with

  • work_dir -- Work directory that already includes job_id (preferred)

  • base_dir -- Base directory containing job outputs (deprecated, use work_dir instead)

load_job_summary() Dict[str, Any] | None[源代码]#

Load job summary from the work directory.

load_dataset_mapping() Dict[str, Any][源代码]#

Load dataset mapping information.

load_event_logs() List[Dict[str, Any]][源代码]#

Load and parse event logs.

extract_process_thread_ids() Dict[str, Set[int]][源代码]#

Extract process and thread IDs from event logs. Returns a dict with 'process_ids' and 'thread_ids' sets.

find_processes_by_ids(process_ids: Set[int]) List[Process][源代码]#

Find running processes by their PIDs.

find_threads_by_ids(thread_ids: Set[int]) List[Thread][源代码]#

Find running threads by their IDs (if possible).

get_partition_status() Dict[int, Dict[str, Any]][源代码]#

Get current status of all partitions.

calculate_overall_progress() Dict[str, Any][源代码]#

Calculate overall job progress.

get_operation_pipeline() List[Dict[str, Any]][源代码]#

Get the operation pipeline from config.

data_juicer.utils.job.common.list_running_jobs(base_dir: str = 'outputs/partition-checkpoint-eventlog') List[Dict[str, Any]][源代码]#

List all DataJuicer jobs and their status.