Identifying Value Clusters Within 5% Tolerance in SQLite
Value Grouping by Percentage Proximity: Challenges and Resolutions
Determining Similarity Through Relative Percentage Thresholds
Issue Overview
The core challenge revolves around identifying groups of numerical values in a SQLite table where each member of a group is within ±5% of at least one other member. The original dataset contains a Sampled
table with ID
(auto-incrementing integer) and Value
(real number) columns. The goal is to aggregate values into clusters where each value is within 5% of another value in the same cluster.
For example, given values like 20.1, 20.4, and 20.2, these would form a cluster because each value is within 5% of the others. However, larger values like 100.2 or outliers like 25.1 would form singleton clusters. The desired output includes:
- The number of values in each cluster (
MostSimilar
count). - The average of the clustered values.
- Optional: The minimum and maximum values in each cluster.
The complexity arises from the non-transitive nature of percentage-based similarity. If Value A is within 5% of Value B, and Value B is within 5% of Value C, it does not guarantee that Value A is within 5% of Value C. This creates overlapping clusters and ambiguous groupings, making it impossible to define a unique "correct" grouping without additional constraints.
Mathematical and Logical Constraints
Non-Transitive Relationships:
Percentage-based similarity violates the transitive property. For example:- 20.0 and 21.0 are similar (21.0 is 5% greater than 20.0).
- 21.0 and 22.05 are similar (22.05 is 5% greater than 21.0).
- However, 20.0 and 22.05 are not similar (22.05 is 10.25% greater than 20.0).
This leads to chains of similarity that do not form a single coherent cluster.
Zero Handling:
Comparing values near zero requires special handling. A 5% range around zero (e.g., -5% to +5%) collapses to a tiny interval (e.g., -0.05 to +0.05), which is often impractical. Worse, dividing by zero (when a value is zero) is undefined.Self-Joins and Duplicate Groups:
Naive self-joins between the table and itself can produce duplicate clusters. For instance, grouping byA.ID
and counting matches inB
might yield multiple clusters that are permutations of the same group (e.g., cluster centered at ID 2 vs. ID 7 in the extended dataset).
Troubleshooting Steps, Solutions & Fixes
Step 1: Define the Similarity Metric Rigorously
The first step is to formalize the similarity condition. Two values A
and B
are considered similar if:
ABS(B - A) / NULLIF(A, 0) < 0.05
This avoids division by zero when A
is zero. However, this definition still allows chains of similarity that defy intuitive clustering.
Example Fix:
SELECT
A.ID AS AnchorID,
B.ID AS SimilarID,
A.Value AS AnchorValue,
B.Value AS SimilarValue
FROM Sampled A
JOIN Sampled B
ON ABS(B.Value - A.Value) / NULLIF(A.Value, 0) < 0.05
AND A.ID != B.ID; -- Exclude self-comparisons
This query identifies pairs of similar values but does not resolve them into distinct clusters.
Step 2: Aggregate Similar Values into Candidate Clusters
To form clusters, group values by an "anchor" value and count how many values are similar to it:
WITH CandidateClusters AS (
SELECT
A.ID AS AnchorID,
A.Value AS AnchorValue,
COUNT(*) AS ClusterSize,
AVG(B.Value) AS ClusterAverage,
MIN(B.Value) AS ClusterMin,
MAX(B.Value) AS ClusterMax
FROM Sampled A
JOIN Sampled B
ON ABS(B.Value - A.Value) / NULLIF(A.Value, 0) < 0.05
GROUP BY A.ID
)
SELECT
ClusterSize,
ClusterAverage,
ClusterMin,
ClusterMax
FROM CandidateClusters
ORDER BY ClusterSize DESC;
Limitation: This approach produces overlapping clusters. For example, if IDs 1, 2, and 3 are mutually similar, the query will list three clusters (anchored at each ID), all including the same three values.
Step 3: Deduplicate Overlapping Clusters
To avoid redundant clusters, use DISTINCT
or rank clusters by size and uniqueness:
Using DISTINCT:
WITH Clusters AS (
SELECT
A.ID,
COUNT(*) AS ClusterSize,
AVG(B.Value) AS ClusterAverage,
MIN(B.Value) AS ClusterMin,
MAX(B.Value) AS ClusterMax
FROM Sampled A
JOIN Sampled B
ON ABS(B.Value - A.Value) / NULLIF(A.Value, 0) < 0.05
GROUP BY A.ID
)
SELECT DISTINCT
ClusterSize,
ClusterAverage,
ClusterMin,
ClusterMax
FROM Clusters
ORDER BY ClusterSize DESC;
Drawback: This only merges clusters with identical size, average, min, and max. Clusters that are permutations of the same group (e.g., anchored at different IDs) will still appear as separate entries.
Using Window Functions for Ranking:
WITH RankedClusters AS (
SELECT
A.ID,
COUNT(*) AS ClusterSize,
AVG(B.Value) AS ClusterAverage,
MIN(B.Value) AS ClusterMin,
MAX(B.Value) AS ClusterMax,
ROW_NUMBER() OVER (PARTITION BY COUNT(*) ORDER BY A.ID) AS Rank
FROM Sampled A
JOIN Sampled B
ON ABS(B.Value - A.Value) / NULLIF(A.Value, 0) < 0.05
GROUP BY A.ID
)
SELECT
ClusterSize,
ClusterAverage,
ClusterMin,
ClusterMax
FROM RankedClusters
WHERE Rank = 1
ORDER BY ClusterSize DESC;
This ranks clusters by size and selects only one representative per size, reducing duplication.
Step 4: Handle Zero Values Appropriately
To avoid division by zero and nonsensical ranges around zero:
- Exclude zero values from percentage comparisons.
- Treat zero as a special case with an absolute tolerance (e.g., ±0.05).
Example Fix:
JOIN Sampled B ON
(
A.Value = 0 AND ABS(B.Value) <= 0.05
) OR (
A.Value != 0 AND ABS(B.Value - A.Value) / A.Value < 0.05
)
Step 5: Resolve Non-Transitive Chains with Iterative Clustering
For datasets where chains of similarity exist (A~B, B~C, but A≁C), use a recursive CTE to merge overlapping clusters:
WITH RECURSIVE ClusterGroups AS (
SELECT
ID AS RootID,
ID AS MemberID,
Value AS MemberValue
FROM Sampled
UNION ALL
SELECT
cg.RootID,
s.ID AS MemberID,
s.Value AS MemberValue
FROM ClusterGroups cg
JOIN Sampled s
ON ABS(s.Value - cg.MemberValue) / NULLIF(cg.MemberValue, 0) < 0.05
AND s.ID NOT IN (SELECT MemberID FROM ClusterGroups WHERE RootID = cg.RootID)
)
SELECT
RootID,
COUNT(*) AS ClusterSize,
AVG(MemberValue) AS ClusterAverage,
MIN(MemberValue) AS ClusterMin,
MAX(MemberValue) AS ClusterMax
FROM ClusterGroups
GROUP BY RootID
ORDER BY ClusterSize DESC;
This recursively expands each cluster to include all transitively similar values.
Step 6: Predefined Categories for Deterministic Grouping
If ad hoc clustering is insufficient, predefine category boundaries and assign values to the nearest category:
WITH Categories AS (
SELECT 20.0 AS CategoryLower, 21.0 AS CategoryUpper
UNION ALL
SELECT 21.0, 22.05
UNION ALL
SELECT 22.05, 23.1525
-- Continue for additional categories
)
SELECT
c.CategoryLower,
c.CategoryUpper,
COUNT(s.Value) AS ValuesInCategory,
AVG(s.Value) AS CategoryAverage,
MIN(s.Value) AS CategoryMin,
MAX(s.Value) AS CategoryMax
FROM Categories c
LEFT JOIN Sampled s
ON s.Value >= c.CategoryLower
AND s.Value < c.CategoryUpper
GROUP BY c.CategoryLower, c.CategoryUpper;
This ensures non-overlapping clusters but requires prior knowledge of value ranges.
Step 7: Validation and Edge Case Testing
Test the solution against edge cases:
Zero and Near-Zero Values:
INSERT INTO Sampled (Value) VALUES (0.0), (0.04), (-0.05);
Ensure these are grouped correctly without arithmetic errors.
Chains of Similarity:
INSERT INTO Sampled (Value) VALUES (20.0), (21.0), (22.05), (23.1525);
Verify whether these form a single cluster or multiple overlapping ones.
Large Value Spreads:
INSERT INTO Sampled (Value) VALUES (1000.0), (1050.0), (1102.5);
Confirm that the 5% tolerance is applied relative to each value.
Final Solution Code
Combining deduplication, zero handling, and ranking:
WITH Clusters AS (
SELECT
A.ID,
COUNT(*) AS ClusterSize,
AVG(B.Value) AS ClusterAverage,
MIN(B.Value) AS ClusterMin,
MAX(B.Value) AS ClusterMax,
ROW_NUMBER() OVER (
PARTITION BY COUNT(*)
ORDER BY A.ID
) AS Rank
FROM Sampled A
JOIN Sampled B ON
(
A.Value = 0 AND ABS(B.Value) <= 0.05
) OR (
A.Value != 0 AND ABS(B.Value - A.Value) / A.Value < 0.05
)
GROUP BY A.ID
)
SELECT
ClusterSize,
ClusterAverage,
ClusterMin,
ClusterMax
FROM Clusters
WHERE Rank = 1
ORDER BY ClusterSize DESC;
This approach balances performance with accuracy, providing a pragmatic solution for identifying value clusters within a 5% tolerance in SQLite.