data_juicer.ops.mapper.pii_redaction_mapper module#
- class data_juicer.ops.mapper.pii_redaction_mapper.PiiRedactionMapper(*args, **kwargs)[源代码]#
基类:
MapperRedact PII in text and optionally in messages/query/response.
Covers paths (Unix/Windows), emails, secrets, IDs, phones, agent channel identifiers (飞书/钉钉/企业微信 open_id, channel: feishu|dingtalk|email). Optional: PEM blocks, JWT-shaped tokens, http(s) URLs, IPv4, bracketed IPv6, MAC addresses (see
mask_extended_piior individual flags). Use redact_keys to apply to text, query, response, and/or messages (recursive).- __init__(mask_paths: bool = True, mask_emails: bool = True, mask_secrets: bool = True, mask_ids: bool = True, mask_phones: bool = True, mask_id_cards: bool = True, mask_channel_ids: bool = True, mask_platform_open_ids: bool = True, mask_pem: bool = True, mask_jwt: bool = True, mask_urls: bool = False, mask_ips: bool = True, mask_macs: bool = True, path_replacement: str = '[PATH_REDACTED]', email_replacement: str = '[EMAIL_REDACTED]', secret_replacement: str = '[REDACTED]', id_replacement: str = '[ID_REDACTED]', phone_replacement: str = '[PHONE_REDACTED]', id_card_replacement: str = '[ID_CARD_REDACTED]', channel_id_replacement: str = '[CHANNEL_ID_REDACTED]', pem_replacement: str = '[PEM_REDACTED]', jwt_replacement: str = '[JWT_REDACTED]', url_replacement: str = '[URL_REDACTED]', ip_replacement: str = '[IP_REDACTED]', mac_replacement: str = '[MAC_REDACTED]', extra_patterns: List[Tuple[str, str]] | None = None, text_key: str = 'text', redact_keys: List[str] | None = None, messages_key: str | None = 'messages', **kwargs)[源代码]#
Base class that conducts data editing.
- 参数:
text_key -- the key name of field that stores sample texts to be processed.
image_key -- the key name of field that stores sample image list to be processed
audio_key -- the key name of field that stores sample audio list to be processed
video_key -- the key name of field that stores sample video list to be processed
image_bytes_key -- the key name of field that stores sample image bytes list to be processed
query_key -- the key name of field that stores sample queries
response_key -- the key name of field that stores responses
history_key -- the key name of field that stores history of queries and responses