data_juicer_agents.tools.context#

Context-oriented tools.

class data_juicer_agents.tools.context.InspectDatasetInput(*, dataset_source: DatasetSource, sample_size: Annotated[int, Ge(ge=1)] = 20)[源代码]#

基类:BaseModel

dataset_source: DatasetSource#
sample_size: int#
model_config = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class data_juicer_agents.tools.context.ListDatasetFieldsInput(*, filter_prefix: str | None = None, include_descriptions: bool = True)[源代码]#

基类:BaseModel

Input for list_dataset_fields.

This tool lists all dataset-related configuration fields recognized by Data-Juicer, including their types, default values, and descriptions. Use this before build_dataset_spec to discover advanced dataset options such as export_type, export_shard_size, load_dataset_kwargs, suffixes, or modality special tokens.

filter_prefix: str | None#
include_descriptions: bool#
model_config = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class data_juicer_agents.tools.context.ListDatasetFormattersInput(*, include_ray: bool = True)[源代码]#

基类:BaseModel

Input for list_dataset_formatters.

Discovers which dataset formatters (dynamic data generators) are available in the current Data-Juicer installation. Use this BEFORE build_dataset_spec when you need to configure the dataset_source.generated field for dynamic dataset generation (e.g., EmptyFormatter for creating empty datasets).

include_ray: bool#
model_config = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class data_juicer_agents.tools.context.ListDatasetLoadStrategiesInput(*, executor_type: str = 'default')[源代码]#

基类:BaseModel

Input for list_dataset_load_strategies.

Discovers which dataset loading strategies are truly implemented in the current Data-Juicer installation. Use this BEFORE build_dataset_spec when you need to configure non-trivial dataset sources via dataset_source.config (e.g., remote S3, mixed weights). For simple single local files, use dataset_source.path directly.

executor_type: str#
model_config = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class data_juicer_agents.tools.context.ListSystemConfigInput(*, filter_prefix: str | None = None, include_descriptions: bool = True)[源代码]#

基类:BaseModel

Input for list_system_config.

This tool lists the complete system configuration from Data-Juicer, including all available parameters, their types, default values, and descriptions. Use this before build_system_spec to discover available configuration options.

filter_prefix: str | None#
include_descriptions: bool#
model_config = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

data_juicer_agents.tools.context.inspect_dataset_schema(dataset_source=None, sample_size: int = 20) Dict[str, Any][源代码]#

Inspect a small sample of a dataset and infer keys/modality for planning.

Accepts a DatasetSource object that encapsulates the dataset path and config. When dataset_source is None, returns a friendly error dict instead of raising.

data_juicer_agents.tools.context.list_dataset_fields(*, filter_prefix: str | None = None, include_descriptions: bool = True) Dict[str, Any][源代码]#

List dataset-related configuration fields from Data-Juicer.

This function lists all available dataset configuration parameters from Data-Juicer, including their types, default values, and descriptions.

参数:
  • filter_prefix -- Optional filter to show only parameters matching this prefix

  • include_descriptions -- Whether to include parameter descriptions

返回:

Dict containing configuration information and available parameters

data_juicer_agents.tools.context.list_dataset_formatters(*, include_ray: bool = True) Dict[str, Any][源代码]#

List available dataset formatters from Data-Juicer.

Discovers which dataset formatters (dynamic data generators) are registered in the current Data-Juicer installation by comparing OPSearcher results with and without formatter inclusion.

参数:

include_ray -- Whether to include Ray-specific formatters.

返回:

Dict with 'formatters' list and metadata.

data_juicer_agents.tools.context.list_dataset_load_strategies(*, executor_type: str = 'default') Dict[str, Any][源代码]#

List truly implemented dataset load strategies from Data-Juicer.

Uses dynamic source-code inspection to filter out placeholder strategies that raise NotImplementedError, ensuring the returned list reflects what actually works at runtime.

参数:

executor_type -- Filter by executor type ('default', 'ray', or '*' for all).

返回:

Dict with 'strategies' list and metadata.

data_juicer_agents.tools.context.list_system_config(*, filter_prefix: str | None = None, include_descriptions: bool = True) Dict[str, Any][源代码]#

List system configuration from Data-Juicer.

This function lists all available system configuration parameters from Data-Juicer, including their types, default values, and descriptions.

参数:
  • filter_prefix -- Optional filter to show only parameters matching this prefix

  • include_descriptions -- Whether to include parameter descriptions

返回:

Dict containing configuration information and available parameters