data_juicer.ops.mapper.pii_redaction_mapper module#

class data_juicer.ops.mapper.pii_redaction_mapper.PiiRedactionMapper(*args, **kwargs)[source]#

Bases: Mapper

Redact PII in text and optionally in messages/query/response.

Covers paths (Unix/Windows), emails, secrets, IDs, phones, agent channel identifiers (飞书/钉钉/企业微信 open_id, channel: feishu|dingtalk|email). Optional: PEM blocks, JWT-shaped tokens, http(s) URLs, IPv4, bracketed IPv6, MAC addresses (see mask_extended_pii or individual flags). Use redact_keys to apply to text, query, response, and/or messages (recursive).

__init__(mask_paths: bool = True, mask_emails: bool = True, mask_secrets: bool = True, mask_ids: bool = True, mask_phones: bool = True, mask_id_cards: bool = True, mask_channel_ids: bool = True, mask_platform_open_ids: bool = True, mask_pem: bool = True, mask_jwt: bool = True, mask_urls: bool = False, mask_ips: bool = True, mask_macs: bool = True, path_replacement: str = '[PATH_REDACTED]', email_replacement: str = '[EMAIL_REDACTED]', secret_replacement: str = '[REDACTED]', id_replacement: str = '[ID_REDACTED]', phone_replacement: str = '[PHONE_REDACTED]', id_card_replacement: str = '[ID_CARD_REDACTED]', channel_id_replacement: str = '[CHANNEL_ID_REDACTED]', pem_replacement: str = '[PEM_REDACTED]', jwt_replacement: str = '[JWT_REDACTED]', url_replacement: str = '[URL_REDACTED]', ip_replacement: str = '[IP_REDACTED]', mac_replacement: str = '[MAC_REDACTED]', extra_patterns: List[Tuple[str, str]] | None = None, text_key: str = 'text', redact_keys: List[str] | None = None, messages_key: str | None = 'messages', **kwargs)[source]#

Base class that conducts data editing.

Parameters:
  • text_key – the key name of field that stores sample texts to be processed.

  • image_key – the key name of field that stores sample image list to be processed

  • audio_key – the key name of field that stores sample audio list to be processed

  • video_key – the key name of field that stores sample video list to be processed

  • image_bytes_key – the key name of field that stores sample image bytes list to be processed

  • query_key – the key name of field that stores sample queries

  • response_key – the key name of field that stores responses

  • history_key – the key name of field that stores history of queries and responses

process_single(sample: dict) dict[source]#

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample