data_juicer.tools.op_search module#
Operator Searcher - A tool for filtering and searching Data-Juicer operators
- data_juicer.tools.op_search.find_test_by_searching_content(tests_dir, test_class_name)[source]#
Fallback: brute-force search for test files containing the test class name.
- data_juicer.tools.op_search.analyze_modality_tag(code, op_prefix)[source]#
Analyze the modality tag for the given code content string. Should be one of the “Modality Tags” in tagging_mappings.json. It makes the choice by finding the usages of attributes {modality}_key and the prefix of the OP name. If there are multiple modality keys are used, the ‘multimodal’ tag will be returned instead.
- data_juicer.tools.op_search.analyze_resource_tag(cls)[source]#
Analyze resource tags by reading the class attribute_accelerator. Should be one of the “Resource Tags” in tagging_mappings.json. It makes the choice according to their assigning statement to attribute _accelerator.
- data_juicer.tools.op_search.analyze_model_tags(cls)[source]#
Analyze the model tag for the given code content string. SHOULD be one of the “Model Tags” in tagging_mappings.json. It makes the choice by finding the model_type arg in prepare_model method invocation.
- data_juicer.tools.op_search.analyze_tag_with_inheritance(op_cls, analyze_func, default_tags=None, other_parm=None)[source]#
Universal inheritance chain label analysis function
- data_juicer.tools.op_search.analyze_tag_from_cls(op_cls, op_name)[source]#
Analyze the tags for the OP from the given cls.
- data_juicer.tools.op_search.extract_param_docstring(docstring)[source]#
Extract parameter descriptions from __init__ method docstring.
- class data_juicer.tools.op_search.OPRecord(name: str, op_cls: type, op_type: str | None = None)[source]#
Bases:
objectA record class for storing operator metadata
- class data_juicer.tools.op_search.OPSearcher(specified_op_list: List[str] | None = None, include_formatter: bool = False)[source]#
Bases:
objectOperator search engine
- search(tags: List[str] | None = None, op_type: str | None = None, match_all: bool = True) List[Dict][source]#
Search operators by tag and type criteria.
- Parameters:
tags – List of tags to match
op_type – Operator type (mapper/filter/etc)
match_all – True requires matching all tags, False matches any
- Returns:
List of matched operator record dicts
- search_by_regex(query: str, fields: List[str] | None = None, tags: List[str] | None = None, op_type: str | None = None, match_all: bool = True) List[Dict][source]#
Search operators using a Python regex pattern.
The pattern is matched against the specified fields of each operator. If the query is not a valid regex, an empty list is returned.
- Parameters:
query – Regex pattern to search for
fields – List of OPRecord fields to search in. Defaults to [“name”, “desc”, “param_desc”]
tags – Optional tag filter applied before regex search
op_type – Optional type filter applied before regex search
match_all – Tag matching mode (all vs any)
- Returns:
List of matched operator record dicts
- search_by_bm25(query: str, fields: List[str] | None = None, top_k: int = 10, score_threshold: float = 0.0, tags: List[str] | None = None, op_type: str | None = None, match_all: bool = True) List[Dict][source]#
Search operators using BM25 keyword matching via rank_bm25.
Uses the BM25Okapi algorithm from the
rank_bm25library to rank operators by relevance to a natural language query. The index is built lazily on first call and cached for subsequent queries.- Parameters:
query – Natural language query string
fields – List of OPRecord fields to index. Defaults to [“name”, “desc”, “param_desc”]
top_k – Maximum number of results to return
score_threshold – Minimum BM25 score to include a result. Results with scores at or below this threshold are excluded. Defaults to 0.0.
tags – Optional tag filter applied before BM25 ranking
op_type – Optional type filter applied before BM25 ranking
match_all – Tag matching mode (all vs any)
- Returns:
List of matched operator record dicts, sorted by BM25 score descending
- property records_map#