data_juicer_agents.tools.context#
Context-oriented tools.
- class data_juicer_agents.tools.context.InspectDatasetInput(*, dataset_source: DatasetSource, sample_size: Annotated[int, Ge(ge=1)] = 20)[源代码]#
基类:
BaseModel- dataset_source: DatasetSource#
- sample_size: int#
- model_config = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class data_juicer_agents.tools.context.ListDatasetFieldsInput(*, filter_prefix: str | None = None, include_descriptions: bool = True)[源代码]#
基类:
BaseModelInput for list_dataset_fields.
This tool lists all dataset-related configuration fields recognized by Data-Juicer, including their types, default values, and descriptions. Use this before build_dataset_spec to discover advanced dataset options such as export_type, export_shard_size, load_dataset_kwargs, suffixes, or modality special tokens.
- filter_prefix: str | None#
- include_descriptions: bool#
- model_config = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class data_juicer_agents.tools.context.ListDatasetFormattersInput(*, include_ray: bool = True)[源代码]#
基类:
BaseModelInput for list_dataset_formatters.
Discovers which dataset formatters (dynamic data generators) are available in the current Data-Juicer installation. Use this BEFORE build_dataset_spec when you need to configure the dataset_source.generated field for dynamic dataset generation (e.g., EmptyFormatter for creating empty datasets).
- include_ray: bool#
- model_config = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class data_juicer_agents.tools.context.ListDatasetLoadStrategiesInput(*, executor_type: str = 'default')[源代码]#
基类:
BaseModelInput for list_dataset_load_strategies.
Discovers which dataset loading strategies are truly implemented in the current Data-Juicer installation. Use this BEFORE build_dataset_spec when you need to configure non-trivial dataset sources via dataset_source.config (e.g., remote S3, mixed weights). For simple single local files, use dataset_source.path directly.
- executor_type: str#
- model_config = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class data_juicer_agents.tools.context.ListSystemConfigInput(*, filter_prefix: str | None = None, include_descriptions: bool = True)[源代码]#
基类:
BaseModelInput for list_system_config.
This tool lists the complete system configuration from Data-Juicer, including all available parameters, their types, default values, and descriptions. Use this before build_system_spec to discover available configuration options.
- filter_prefix: str | None#
- include_descriptions: bool#
- model_config = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- data_juicer_agents.tools.context.inspect_dataset_schema(dataset_source=None, sample_size: int = 20) Dict[str, Any][源代码]#
Inspect a small sample of a dataset and infer keys/modality for planning.
Accepts a DatasetSource object that encapsulates the dataset path and config. When dataset_source is None, returns a friendly error dict instead of raising.
- data_juicer_agents.tools.context.list_dataset_fields(*, filter_prefix: str | None = None, include_descriptions: bool = True) Dict[str, Any][源代码]#
List dataset-related configuration fields from Data-Juicer.
This function lists all available dataset configuration parameters from Data-Juicer, including their types, default values, and descriptions.
- 参数:
filter_prefix -- Optional filter to show only parameters matching this prefix
include_descriptions -- Whether to include parameter descriptions
- 返回:
Dict containing configuration information and available parameters
- data_juicer_agents.tools.context.list_dataset_formatters(*, include_ray: bool = True) Dict[str, Any][源代码]#
List available dataset formatters from Data-Juicer.
Discovers which dataset formatters (dynamic data generators) are registered in the current Data-Juicer installation by comparing OPSearcher results with and without formatter inclusion.
- 参数:
include_ray -- Whether to include Ray-specific formatters.
- 返回:
Dict with 'formatters' list and metadata.
- data_juicer_agents.tools.context.list_dataset_load_strategies(*, executor_type: str = 'default') Dict[str, Any][源代码]#
List truly implemented dataset load strategies from Data-Juicer.
Uses dynamic source-code inspection to filter out placeholder strategies that raise NotImplementedError, ensuring the returned list reflects what actually works at runtime.
- 参数:
executor_type -- Filter by executor type ('default', 'ray', or '*' for all).
- 返回:
Dict with 'strategies' list and metadata.
- data_juicer_agents.tools.context.list_system_config(*, filter_prefix: str | None = None, include_descriptions: bool = True) Dict[str, Any][源代码]#
List system configuration from Data-Juicer.
This function lists all available system configuration parameters from Data-Juicer, including their types, default values, and descriptions.
- 参数:
filter_prefix -- Optional filter to show only parameters matching this prefix
include_descriptions -- Whether to include parameter descriptions
- 返回:
Dict containing configuration information and available parameters