data_juicer.utils.encryption_utils module#

data_juicer.utils.encryption_utils.load_fernet_key(key_path=None)[源代码]#

Load a Fernet key from a file or environment variable.

Priority order: 1. key_path file (if provided and exists) 2. Environment variable DJ_ENCRYPTION_KEY

参数:

key_path -- path to a file containing the Fernet key as a base64 url-safe string. Pass None to fall back to the environment variable.

返回:

a cryptography.fernet.Fernet instance ready for encryption / decryption.

抛出:

ValueError -- if no key can be found or the key is invalid.

data_juicer.utils.encryption_utils.encrypt_file(src_path, dst_path, fernet)[源代码]#

Encrypt a file with Fernet and write the ciphertext to dst_path.

When src_path == dst_path the file is encrypted in-place: the plaintext is read into memory, the file is overwritten with ciphertext, and the original plaintext is never written back to disk.

参数:
  • src_path -- path to the plaintext source file.

  • dst_path -- path where the encrypted file will be written. May be the same as src_path for in-place encryption.

  • fernet -- a cryptography.fernet.Fernet instance.

data_juicer.utils.encryption_utils.decrypt_file_to_bytes(src_path, fernet)[源代码]#

Decrypt an encrypted file and return the plaintext as bytes.

The plaintext is never written to disk — only returned in memory.

参数:
  • src_path -- path to the Fernet-encrypted file.

  • fernet -- a cryptography.fernet.Fernet instance.

返回:

decrypted plaintext as bytes.

抛出:

cryptography.fernet.InvalidToken -- if the file cannot be decrypted with the provided key.

data_juicer.utils.encryption_utils.get_secure_tmpdir()[源代码]#

Return the best available temporary directory for plaintext data.

Priority: 1. /dev/shm — Linux in-memory tmpfs, plaintext never touches disk. 2. System default (/tmp or TMPDIR) — plaintext exists briefly on

disk until the caller removes the file.

返回:

path string to use as the dir argument of tempfile.NamedTemporaryFile().

data_juicer.utils.encryption_utils.decrypt_file_to_bytesio(src_path, fernet)[源代码]#

Decrypt an encrypted file and return an io.BytesIO buffer.

Convenience wrapper around decrypt_file_to_bytes() that wraps the result in a seekable in-memory buffer, ready to be passed directly to HuggingFace load_dataset or PDF/DOCX parsers.

参数:
  • src_path -- path to the Fernet-encrypted file.

  • fernet -- a cryptography.fernet.Fernet instance.

返回:

io.BytesIO positioned at offset 0.