data_juicer.utils.job.common module#

DataJuicer Job Utilities - Common Functions

Shared utilities for job stopping and monitoring operations.

class data_juicer.utils.job.common.JobUtils(job_id: str, work_dir: str = None, base_dir: str = None)[source]#

Bases: object

Common utilities for DataJuicer job operations.

__init__(job_id: str, work_dir: str = None, base_dir: str = None)[source]#

Initialize job utilities.

Parameters:
  • job_id – The job ID to work with

  • work_dir – Work directory that already includes job_id (preferred)

  • base_dir – Base directory containing job outputs (deprecated, use work_dir instead)

load_job_summary() Dict[str, Any] | None[source]#

Load job summary from the work directory.

load_dataset_mapping() Dict[str, Any][source]#

Load dataset mapping information.

load_event_logs() List[Dict[str, Any]][source]#

Load and parse event logs.

extract_process_thread_ids() Dict[str, Set[int]][source]#

Extract process and thread IDs from event logs. Returns a dict with ‘process_ids’ and ‘thread_ids’ sets.

find_processes_by_ids(process_ids: Set[int]) List[Process][source]#

Find running processes by their PIDs.

find_threads_by_ids(thread_ids: Set[int]) List[Thread][source]#

Find running threads by their IDs (if possible).

get_partition_status() Dict[int, Dict[str, Any]][source]#

Get current status of all partitions.

calculate_overall_progress() Dict[str, Any][source]#

Calculate overall job progress.

get_operation_pipeline() List[Dict[str, Any]][source]#

Get the operation pipeline from config.

data_juicer.utils.job.common.list_running_jobs(base_dir: str = 'outputs/partition-checkpoint-eventlog') List[Dict[str, Any]][source]#

List all DataJuicer jobs and their status.