data_juicer.ops.filter.image_subplot_filter module#
- class data_juicer.ops.filter.image_subplot_filter.ImageSubplotFilter(*args, **kwargs)[source]#
Bases:
FilterFilter to detect and remove samples with images containing subplots.
This filter uses Hough Line Transform to detect straight lines in images, which is particularly effective for detecting grid-like subplot layouts with perfectly straight edges.
The algorithm works by: 1. Converting images to grayscale and applying edge detection 2. Using Hough Line Transform to detect straight lines 3. Classifying lines as horizontal or vertical based on angle 4. Counting lines that meet length and angle requirements 5. Calculating confidence based on line counts and distribution
- __init__(min_horizontal_lines: int = 3, min_vertical_lines: int = 3, min_confidence: float = 0.5, any_or_all: str = 'any', canny_threshold1: int = 70, canny_threshold2: int = 190, hough_threshold: int = 110, min_line_length: int = 110, max_line_gap: int = 18, angle_tolerance: float = 4.0, *args, **kwargs)[source]#
Initialization method.
- Parameters:
min_horizontal_lines โ Minimum number of horizontal lines to consider an image as containing subplots.
min_vertical_lines โ Minimum number of vertical lines to consider an image as containing subplots.
min_confidence โ Minimum confidence score for filtering. Images with subplot confidence above this threshold will be considered as containing subplots.
any_or_all โ Strategy for multi-image samples. โanyโ filters the sample if any image contains subplots. โallโ filters the sample only if all images contain subplots.
canny_threshold1 โ First threshold for Canny edge detector.
canny_threshold2 โ Second threshold for Canny edge detector.
hough_threshold โ Accumulator threshold for Hough transform.
min_line_length โ Minimum line length to be detected.
max_line_gap โ Maximum gap between line segments to be treated as a single line.
angle_tolerance โ Tolerance in degrees for classifying lines as horizontal/vertical.
args โ Extra args.
kwargs โ Extra args.