data_juicer.ops.deduplicator.ray_basic_deduplicator module#

class data_juicer.ops.deduplicator.ray_basic_deduplicator.DedupSet[source]#

Bases: object

__init__()[source]#
is_unique(key)[source]#
data_juicer.ops.deduplicator.ray_basic_deduplicator.get_remote_dedup_set()[source]#

Get the remote version of DedupSet with Ray decorator applied at runtime.

class data_juicer.ops.deduplicator.ray_basic_deduplicator.Backend(*args, **kwargs)[source]#

Bases: ABC

Backend for deduplicator.

abstractmethod __init__(*args, **kwargs)[source]#
abstractmethod is_unique(md5_value: str)[source]#
class data_juicer.ops.deduplicator.ray_basic_deduplicator.ActorBackend(dedup_set_num: int | str, RemoteDedupSet=None)[source]#

Bases: Backend

Ray actor backend for deduplicator. Uses lazy initialization to defer actor creation until first use, allowing the cluster to autoscale before actors consume resources.

__init__(dedup_set_num: int | str, RemoteDedupSet=None)[source]#
property dedup_set_num#

Get actual dedup_set_num, calculating from cluster resources if ‘auto’.

is_unique(md5_value: str)[source]#
class data_juicer.ops.deduplicator.ray_basic_deduplicator.RedisBackend(redis_address: str)[source]#

Bases: Backend

Redis backend for deduplicator.

__init__(redis_address: str)[source]#
is_unique(md5_value: str)[source]#
class data_juicer.ops.deduplicator.ray_basic_deduplicator.RayBasicDeduplicator(backend: str = 'ray_actor', redis_address: str = 'redis://localhost:6379', dedup_set_num: int | str = 'auto', *args, **kwargs)[source]#

Bases: Filter

A basic exact matching deduplicator for RAY. Although its functionality is deduplication, it is implemented as Filter sub-class.

EMPTY_HASH_VALUE = 'EMPTY'#
__init__(backend: str = 'ray_actor', redis_address: str = 'redis://localhost:6379', dedup_set_num: int | str = 'auto', *args, **kwargs)[source]#

Initialization. :param backend: the backend for dedup, either ‘ray_actor’ or ‘redis’ :param redis_address: the address of redis server :param dedup_set_num: number of dedup set actors, or ‘auto’ to use CPU/2 :param args: extra args :param kwargs: extra args

calculate_hash(sample, context=False)[source]#

Calculate hash value for the sample.

compute_stats_single(sample, context=False)[source]#

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample)[source]#

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering