跳转至主要内容
Ctrl+K

Data Juicer

  • 文档
  • API
  • Sandbox
  • Hub
  • Agents
  • GitHub
English 简体中文
main v1.5.1 v1.5.0 v1.4.6 v1.4.5 v1.4.4 v1.4.3 v1.4.2 v1.4.1 v1.4.0
  • 文档
  • API
  • Sandbox
  • Hub
  • Agents
  • GitHub
English 简体中文
main v1.5.1 v1.5.0 v1.4.6 v1.4.5 v1.4.4 v1.4.3 v1.4.2 v1.4.1 v1.4.0
  • data_juicer.tools.op_search module

data_juicer.tools.op_search module#

Operator Searcher - A tool for filtering and searching Data-Juicer operators

data_juicer.tools.op_search.get_source_path(cls)[源代码]#
data_juicer.tools.op_search.find_test_by_searching_content(tests_dir, test_class_name)[源代码]#

Fallback: brute-force search for test files containing the test class name.

data_juicer.tools.op_search.analyze_modality_tag(code, op_prefix)[源代码]#

Analyze the modality tag for the given code content string. Should be one of the "Modality Tags" in tagging_mappings.json. It makes the choice by finding the usages of attributes {modality}_key and the prefix of the OP name. If there are multiple modality keys are used, the 'multimodal' tag will be returned instead.

data_juicer.tools.op_search.analyze_resource_tag(cls)[源代码]#

Analyze resource tags by reading the class attribute_accelerator. Should be one of the "Resource Tags" in tagging_mappings.json. It makes the choice according to their assigning statement to attribute _accelerator.

data_juicer.tools.op_search.analyze_model_tags(cls)[源代码]#

Analyze the model tag for the given code content string. SHOULD be one of the "Model Tags" in tagging_mappings.json. It makes the choice by finding the model_type arg in prepare_model method invocation.

data_juicer.tools.op_search.analyze_tag_with_inheritance(op_cls, analyze_func, default_tags=None, other_parm=None)[源代码]#

Universal inheritance chain label analysis function

data_juicer.tools.op_search.analyze_tag_from_cls(op_cls, op_name)[源代码]#

Analyze the tags for the OP from the given cls.

data_juicer.tools.op_search.extract_param_docstring(docstring)[源代码]#

Extract parameter descriptions from __init__ method docstring.

class data_juicer.tools.op_search.OPRecord(name: str, op_cls: type, op_type: str | None = None)[源代码]#

基类:object

A record class for storing operator metadata

__init__(name: str, op_cls: type, op_type: str | None = None)[源代码]#
to_dict()[源代码]#
class data_juicer.tools.op_search.OPSearcher(specified_op_list: List[str] | None = None, include_formatter: bool = False)[源代码]#

基类:object

Operator search engine

__init__(specified_op_list: List[str] | None = None, include_formatter: bool = False)[源代码]#
search(tags: List[str] | None = None, op_type: str | None = None, match_all: bool = True) → List[Dict][源代码]#

Search operators by tag and type criteria.

参数:
  • tags -- List of tags to match

  • op_type -- Operator type (mapper/filter/etc)

  • match_all -- True requires matching all tags, False matches any

返回:

List of matched operator record dicts

search_by_regex(query: str, fields: List[str] | None = None, tags: List[str] | None = None, op_type: str | None = None, match_all: bool = True) → List[Dict][源代码]#

Search operators using a Python regex pattern.

The pattern is matched against the specified fields of each operator. If the query is not a valid regex, an empty list is returned.

参数:
  • query -- Regex pattern to search for

  • fields -- List of OPRecord fields to search in. Defaults to ["name", "desc", "param_desc"]

  • tags -- Optional tag filter applied before regex search

  • op_type -- Optional type filter applied before regex search

  • match_all -- Tag matching mode (all vs any)

返回:

List of matched operator record dicts

search_by_bm25(query: str, fields: List[str] | None = None, top_k: int = 10, score_threshold: float = 0.0, tags: List[str] | None = None, op_type: str | None = None, match_all: bool = True) → List[Dict][源代码]#

Search operators using BM25 keyword matching via rank_bm25.

Uses the BM25Okapi algorithm from the rank_bm25 library to rank operators by relevance to a natural language query. The index is built lazily on first call and cached for subsequent queries.

参数:
  • query -- Natural language query string

  • fields -- List of OPRecord fields to index. Defaults to ["name", "desc", "param_desc"]

  • top_k -- Maximum number of results to return

  • score_threshold -- Minimum BM25 score to include a result. Results with scores at or below this threshold are excluded. Defaults to 0.0.

  • tags -- Optional tag filter applied before BM25 ranking

  • op_type -- Optional type filter applied before BM25 ranking

  • match_all -- Tag matching mode (all vs any)

返回:

List of matched operator record dicts, sorted by BM25 score descending

property records_map#
data_juicer.tools.op_search.main(query, tags, op_type)[源代码]#
当前页面
  • get_source_path()
  • find_test_by_searching_content()
  • analyze_modality_tag()
  • analyze_resource_tag()
  • analyze_model_tags()
  • analyze_tag_with_inheritance()
  • analyze_tag_from_cls()
  • extract_param_docstring()
  • OPRecord
    • OPRecord.__init__()
    • OPRecord.to_dict()
  • OPSearcher
    • OPSearcher.__init__()
    • OPSearcher.search()
    • OPSearcher.search_by_regex()
    • OPSearcher.search_by_bm25()
    • OPSearcher.records_map
  • main()

本页

  • 显示源代码

© Copyright 2024, Data-Juicer Team.

由 Sphinx 9.0.4创建。

使用 PyData Sphinx Theme 0.16.1构建.