data_juicer.ops.mapper.pii_llm_suspect_mapper module#
- data_juicer.ops.mapper.pii_llm_suspect_mapper.ensure_spacy_pipeline_installed(model_name: str, *, auto_download: bool) None[source]#
If
auto_downloadand pipeline missing, runspacy.cli.download(needs network).
- class data_juicer.ops.mapper.pii_llm_suspect_mapper.PiiLlmSuspectMapper(*args, **kwargs)[source]#
Bases:
MapperLLM audit (and optional redaction) for possibly missed PII.
Writes JSON to
meta[result_key](defaultMetaKeys.pii_llm_suspect). Setredaction_modetoevidenceorwhole_fieldto also modifyinspect_keysstring fields (andmessageswhen listed). Place afterpii_redaction_mapper.Use
gate_mode="heuristic"to call the API only when cheap patterns suggest residual risk (long digit runs, @, secret-like keywords, etc.).Pre-LLM extensions (still no API cost unless you enable spaCy):
heuristic_name_rules(default True): contextual CJK / English name cues so person-heavy text is not skipped when the base heuristic fires only on digits and secrets.spacy_ner_models: optional list of spaCy pipeline names (e.g.["zh_core_web_sm", "en_core_web_sm"]) so one job loads both and runs NER on the same text prefix until aPERSON/PERhit.spacy_ner_model: legacy single name; merged afterspacy_ner_models(deduped). Install withpython -m spacy download <name>.spacy_auto_download(default True): if the pipeline is missing, run spaCy’s downloader beforespacy.load(needs network, uses pip). Disable in air-gapped jobs or set envPII_SPACY_AUTO_DOWNLOAD=0.
- __init__(api_model: str = 'qwen-turbo', *, inspect_keys: List[str] | None = None, messages_key: str | None = 'messages', max_messages_for_prompt: Annotated[int, Gt(gt=0)] = 4, max_chars_per_field: Annotated[int, Gt(gt=0)] = 6000, max_chars_messages_excerpt: Annotated[int, Gt(gt=0)] = 8000, gate_mode: str = 'heuristic', result_key: str = 'pii_llm_suspect', raw_key: str = 'pii_llm_suspect_raw', overwrite: bool = False, api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, preferred_output_lang: str = 'zh', try_num: Annotated[int, Gt(gt=0)] = 2, model_params: Dict | None = None, sampling_params: Dict | None = None, text_key: str = 'text', heuristic_name_rules: bool = True, spacy_ner_model: str | None = None, spacy_ner_models: List[str] | None = None, spacy_ner_max_chars: Annotated[int, Gt(gt=0)] = 4000, spacy_auto_download: bool = True, redaction_mode: str = 'none', redaction_placeholder: str = '[LLM_PII_SUSPECT_REDACTED]', **kwargs)[source]#
Base class that conducts data editing.
- Parameters:
text_key – the key name of field that stores sample texts to be processed.
image_key – the key name of field that stores sample image list to be processed
audio_key – the key name of field that stores sample audio list to be processed
video_key – the key name of field that stores sample video list to be processed
image_bytes_key – the key name of field that stores sample image bytes list to be processed
query_key – the key name of field that stores sample queries
response_key – the key name of field that stores responses
history_key – the key name of field that stores history of queries and responses