data_juicer.ops.filter.image_nsfw_filter module#

class data_juicer.ops.filter.image_nsfw_filter.ImageNSFWFilter(hf_nsfw_model: str = 'Falconsai/nsfw_image_detection', trust_remote_code: bool = False, max_score: float = 0.5, any_or_all: str = 'any', *args, **kwargs)[source]#

Bases: Filter

Filter to keep samples whose images have low nsfw scores.

__init__(hf_nsfw_model: str = 'Falconsai/nsfw_image_detection', trust_remote_code: bool = False, max_score: float = 0.5, any_or_all: str = 'any', *args, **kwargs)[source]#

Initialization method.

Parameters:
  • hf_nsfw_model โ€“ nsfw detection model name on huggingface.

  • max_score โ€“ the nsfw score threshold for samples. range from 0 to 1. Samples with nsfw score less than this threshold will be kept.

  • any_or_all โ€“ keep this sample with โ€˜anyโ€™ or โ€˜allโ€™ strategy of all images. โ€˜anyโ€™: keep this sample if any images meet the condition. โ€˜allโ€™: keep this sample only if all images meet the condition.

  • args โ€“ extra args

  • kwargs โ€“ extra args

compute_stats_single(sample, rank=None, context=False)[source]#

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample โ€“ input sample.

  • context โ€“ whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample, rank=None)[source]#

For sample level, sample โ€“> Boolean.

Parameters:

sample โ€“ sample to decide whether to filter

Returns:

true for keeping and false for filtering