data_juicer.core.data.load_strategy module#

class data_juicer.core.data.load_strategy.StrategyKey(executor_type: str, data_type: str, data_source: str)[source]#

Bases: object

Immutable key for strategy registration with wildcard support

executor_type: str#
data_type: str#
data_source: str#
matches(other: StrategyKey) bool[source]#

Check if this key matches another key with wildcard support

Supports Unix-style wildcards: - ‘*’ matches any string - ‘?’ matches any single character - ‘[seq]’ matches any character in seq - ‘[!seq]’ matches any character not in seq

__init__(executor_type: str, data_type: str, data_source: str) None#
class data_juicer.core.data.load_strategy.DataLoadStrategy(ds_config: Dict, cfg: Namespace)[source]#

Bases: ABC, ConfigValidator

abstract class for data load strategy

__init__(ds_config: Dict, cfg: Namespace)[source]#
abstractmethod load_data(**kwargs) DJDataset[source]#

Need to be implemented in the

class data_juicer.core.data.load_strategy.DataLoadStrategyRegistry[source]#

Bases: object

Flexible strategy registry with wildcard matching

classmethod get_strategy_class(executor_type: str, data_type: str, data_source: str) Type[DataLoadStrategy] | None[source]#

Retrieve the most specific matching strategy

Matching priority: 1. Exact match 2. Wildcard matches from most specific to most general

classmethod register(executor_type: str, data_type: str, data_source: str)[source]#

Decorator for registering data load strategies with wildcard support

Parameters:
  • executor_type – Type of executor (e.g., ‘default’, ‘ray’)

  • data_type – Type of data (e.g., ‘local’, ‘remote’)

  • data_source – Specific data source (e.g., ‘arxiv’, ‘s3’)

Returns:

Decorator function

class data_juicer.core.data.load_strategy.RayDataLoadStrategy(ds_config: Dict, cfg: Namespace)[source]#

Bases: DataLoadStrategy

abstract class for data load strategy for RayExecutor

abstractmethod load_data(**kwargs) DJDataset[source]#

Need to be implemented in the

class data_juicer.core.data.load_strategy.DefaultDataLoadStrategy(ds_config: Dict, cfg: Namespace)[source]#

Bases: DataLoadStrategy

abstract class for data load strategy for LocalExecutor

abstractmethod load_data(**kwargs) DJDataset[source]#

Need to be implemented in the

class data_juicer.core.data.load_strategy.RayLocalJsonDataLoadStrategy(ds_config: Dict, cfg: Namespace)[source]#

Bases: RayDataLoadStrategy

CONFIG_VALIDATION_RULES = {'custom_validators': {}, 'field_types': {'path': <class 'str'>}, 'required_fields': ['path']}#
load_data(**kwargs)[source]#

Need to be implemented in the

class data_juicer.core.data.load_strategy.RayHuggingfaceDataLoadStrategy(ds_config: Dict, cfg: Namespace)[source]#

Bases: RayDataLoadStrategy

CONFIG_VALIDATION_RULES = {'custom_validators': {}, 'field_types': {'path': <class 'str'>}, 'required_fields': ['path']}#
load_data(**kwargs)[source]#

Need to be implemented in the

class data_juicer.core.data.load_strategy.DefaultLocalDataLoadStrategy(ds_config: Dict, cfg: Namespace)[source]#

Bases: DefaultDataLoadStrategy

data load strategy for on disk data for LocalExecutor rely on AutoFormatter for actual data loading

CONFIG_VALIDATION_RULES = {'custom_validators': {}, 'field_types': {'path': <class 'str'>}, 'required_fields': ['path']}#
load_data(**kwargs)[source]#

Need to be implemented in the

class data_juicer.core.data.load_strategy.DefaultHuggingfaceDataLoadStrategy(ds_config: Dict, cfg: Namespace)[source]#

Bases: DefaultDataLoadStrategy

data load strategy for Huggingface dataset for LocalExecutor

CONFIG_VALIDATION_RULES = {'custom_validators': {}, 'field_types': {'path': <class 'str'>}, 'optional_fields': ['split', 'limit', 'name', 'data_files', 'data_dir'], 'required_fields': ['path']}#
load_data(**kwargs)[source]#

Need to be implemented in the

class data_juicer.core.data.load_strategy.DefaultModelScopeDataLoadStrategy(ds_config: Dict, cfg: Namespace)[source]#

Bases: DefaultDataLoadStrategy

data load strategy for ModelScope dataset for LocalExecutor

load_data(**kwargs)[source]#

Need to be implemented in the

class data_juicer.core.data.load_strategy.DefaultArxivDataLoadStrategy(ds_config: Dict, cfg: Namespace)[source]#

Bases: DefaultDataLoadStrategy

data load strategy for arxiv dataset for LocalExecutor

CONFIG_VALIDATION_RULES = {'custom_validators': {}, 'field_types': {'path': <class 'str'>}, 'required_fields': ['path']}#
load_data(**kwargs)[source]#

Need to be implemented in the

class data_juicer.core.data.load_strategy.DefaultWikiDataLoadStrategy(ds_config: Dict, cfg: Namespace)[source]#

Bases: DefaultDataLoadStrategy

data load strategy for wiki dataset for LocalExecutor

CONFIG_VALIDATION_RULES = {'custom_validators': {}, 'field_types': {'path': <class 'str'>}, 'required_fields': ['path']}#
load_data(**kwargs)[source]#

Need to be implemented in the

class data_juicer.core.data.load_strategy.DefaultCommonCrawlDataLoadStrategy(ds_config: Dict, cfg: Namespace)[source]#

Bases: DefaultDataLoadStrategy

data load strategy for commoncrawl dataset for LocalExecutor

CONFIG_VALIDATION_RULES = {'custom_validators': {'end_snapshot': <function validate_snapshot_format>, 'start_snashot': <function validate_snapshot_format>, 'url_limit': <function DefaultCommonCrawlDataLoadStrategy.<lambda>>}, 'field_types': {'end_snapshot': <class 'str'>, 'start_snapshot': <class 'str'>}, 'optional_fields': ['aws', 'url_limit'], 'required_fields': ['start_snapshot', 'end_snapshot']}#
load_data(**kwargs)[source]#

Need to be implemented in the

class data_juicer.core.data.load_strategy.DefaultS3DataLoadStrategy(ds_config: Dict, cfg: Namespace)[source]#

Bases: DefaultDataLoadStrategy

data load strategy for S3 datasets for LocalExecutor Uses fsspec/s3fs to access S3 files

CONFIG_VALIDATION_RULES = {'custom_validators': {'path': <function DefaultS3DataLoadStrategy.<lambda>>}, 'field_types': {'path': <class 'str'>}, 'optional_fields': ['aws_access_key_id', 'aws_secret_access_key', 'aws_session_token', 'aws_region', 'endpoint_url'], 'required_fields': ['path']}#
load_data(**kwargs)[source]#

Need to be implemented in the

class data_juicer.core.data.load_strategy.RayS3DataLoadStrategy(ds_config: Dict, cfg: Namespace)[source]#

Bases: RayDataLoadStrategy

data load strategy for S3 datasets for RayExecutor Uses PyArrow’s filesystem to read from S3

CONFIG_VALIDATION_RULES = {'custom_validators': {'path': <function RayS3DataLoadStrategy.<lambda>>}, 'field_types': {'path': <class 'str'>}, 'optional_fields': ['aws_access_key_id', 'aws_secret_access_key', 'aws_session_token', 'aws_region', 'endpoint_url'], 'required_fields': ['path']}#
load_data(**kwargs)[source]#

Need to be implemented in the