data_juicer.ops.mapper.pii_llm_suspect_mapper module#

data_juicer.ops.mapper.pii_llm_suspect_mapper.ensure_spacy_pipeline_installed(model_name: str, *, auto_download: bool) None[source]#

If auto_download and pipeline missing, run spacy.cli.download (needs network).

class data_juicer.ops.mapper.pii_llm_suspect_mapper.PiiLlmSuspectMapper(*args, **kwargs)[source]#

Bases: Mapper

LLM audit (and optional redaction) for possibly missed PII.

Writes JSON to meta[result_key] (default MetaKeys.pii_llm_suspect). Set redaction_mode to evidence or whole_field to also modify inspect_keys string fields (and messages when listed). Place after pii_redaction_mapper.

Use gate_mode="heuristic" to call the API only when cheap patterns suggest residual risk (long digit runs, @, secret-like keywords, etc.).

Pre-LLM extensions (still no API cost unless you enable spaCy):

  • heuristic_name_rules (default True): contextual CJK / English name cues so person-heavy text is not skipped when the base heuristic fires only on digits and secrets.

  • spacy_ner_models: optional list of spaCy pipeline names (e.g. ["zh_core_web_sm", "en_core_web_sm"]) so one job loads both and runs NER on the same text prefix until a PERSON / PER hit.

  • spacy_ner_model: legacy single name; merged after spacy_ner_models (deduped). Install with python -m spacy download <name>.

  • spacy_auto_download (default True): if the pipeline is missing, run spaCy’s downloader before spacy.load (needs network, uses pip). Disable in air-gapped jobs or set env PII_SPACY_AUTO_DOWNLOAD=0.

__init__(api_model: str = 'qwen-turbo', *, inspect_keys: List[str] | None = None, messages_key: str | None = 'messages', max_messages_for_prompt: Annotated[int, Gt(gt=0)] = 4, max_chars_per_field: Annotated[int, Gt(gt=0)] = 6000, max_chars_messages_excerpt: Annotated[int, Gt(gt=0)] = 8000, gate_mode: str = 'heuristic', result_key: str = 'pii_llm_suspect', raw_key: str = 'pii_llm_suspect_raw', overwrite: bool = False, api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, preferred_output_lang: str = 'zh', try_num: Annotated[int, Gt(gt=0)] = 2, model_params: Dict | None = None, sampling_params: Dict | None = None, text_key: str = 'text', heuristic_name_rules: bool = True, spacy_ner_model: str | None = None, spacy_ner_models: List[str] | None = None, spacy_ner_max_chars: Annotated[int, Gt(gt=0)] = 4000, spacy_auto_download: bool = True, redaction_mode: str = 'none', redaction_placeholder: str = '[LLM_PII_SUSPECT_REDACTED]', **kwargs)[source]#

Base class that conducts data editing.

Parameters:
  • text_key – the key name of field that stores sample texts to be processed.

  • image_key – the key name of field that stores sample image list to be processed

  • audio_key – the key name of field that stores sample audio list to be processed

  • video_key – the key name of field that stores sample video list to be processed

  • image_bytes_key – the key name of field that stores sample image bytes list to be processed

  • query_key – the key name of field that stores sample queries

  • response_key – the key name of field that stores responses

  • history_key – the key name of field that stores history of queries and responses

process_single(sample, rank=None)[source]#

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample