data_juicer.utils.encryption_utils module#

data_juicer.utils.encryption_utils.load_fernet_key(key_path=None)[source]#

Load a Fernet key from a file or environment variable.

Priority order: 1. key_path file (if provided and exists) 2. Environment variable DJ_ENCRYPTION_KEY

Parameters:

key_path – path to a file containing the Fernet key as a base64 url-safe string. Pass None to fall back to the environment variable.

Returns:

a cryptography.fernet.Fernet instance ready for encryption / decryption.

Raises:

ValueError – if no key can be found or the key is invalid.

data_juicer.utils.encryption_utils.encrypt_file(src_path, dst_path, fernet)[source]#

Encrypt a file with Fernet and write the ciphertext to dst_path.

When src_path == dst_path the file is encrypted in-place: the plaintext is read into memory, the file is overwritten with ciphertext, and the original plaintext is never written back to disk.

Parameters:
  • src_path – path to the plaintext source file.

  • dst_path – path where the encrypted file will be written. May be the same as src_path for in-place encryption.

  • fernet – a cryptography.fernet.Fernet instance.

data_juicer.utils.encryption_utils.decrypt_file_to_bytes(src_path, fernet)[source]#

Decrypt an encrypted file and return the plaintext as bytes.

The plaintext is never written to disk — only returned in memory.

Parameters:
  • src_path – path to the Fernet-encrypted file.

  • fernet – a cryptography.fernet.Fernet instance.

Returns:

decrypted plaintext as bytes.

Raises:

cryptography.fernet.InvalidToken – if the file cannot be decrypted with the provided key.

data_juicer.utils.encryption_utils.get_secure_tmpdir()[source]#

Return the best available temporary directory for plaintext data.

Priority: 1. /dev/shm — Linux in-memory tmpfs, plaintext never touches disk. 2. System default (/tmp or TMPDIR) — plaintext exists briefly on

disk until the caller removes the file.

Returns:

path string to use as the dir argument of tempfile.NamedTemporaryFile().

data_juicer.utils.encryption_utils.decrypt_file_to_bytesio(src_path, fernet)[source]#

Decrypt an encrypted file and return an io.BytesIO buffer.

Convenience wrapper around decrypt_file_to_bytes() that wraps the result in a seekable in-memory buffer, ready to be passed directly to HuggingFace load_dataset or PDF/DOCX parsers.

Parameters:
  • src_path – path to the Fernet-encrypted file.

  • fernet – a cryptography.fernet.Fernet instance.

Returns:

io.BytesIO positioned at offset 0.