Identifying Value Clusters Within 5% Tolerance in SQLite

Value Grouping by Percentage Proximity: Challenges and Resolutions

Determining Similarity Through Relative Percentage Thresholds

Issue Overview

The core challenge revolves around identifying groups of numerical values in a SQLite table where each member of a group is within ±5% of at least one other member. The original dataset contains a Sampled table with ID (auto-incrementing integer) and Value (real number) columns. The goal is to aggregate values into clusters where each value is within 5% of another value in the same cluster.

For example, given values like 20.1, 20.4, and 20.2, these would form a cluster because each value is within 5% of the others. However, larger values like 100.2 or outliers like 25.1 would form singleton clusters. The desired output includes:

  • The number of values in each cluster (MostSimilar count).
  • The average of the clustered values.
  • Optional: The minimum and maximum values in each cluster.

The complexity arises from the non-transitive nature of percentage-based similarity. If Value A is within 5% of Value B, and Value B is within 5% of Value C, it does not guarantee that Value A is within 5% of Value C. This creates overlapping clusters and ambiguous groupings, making it impossible to define a unique "correct" grouping without additional constraints.

Mathematical and Logical Constraints

  1. Non-Transitive Relationships:
    Percentage-based similarity violates the transitive property. For example:

    • 20.0 and 21.0 are similar (21.0 is 5% greater than 20.0).
    • 21.0 and 22.05 are similar (22.05 is 5% greater than 21.0).
    • However, 20.0 and 22.05 are not similar (22.05 is 10.25% greater than 20.0).
      This leads to chains of similarity that do not form a single coherent cluster.
  2. Zero Handling:
    Comparing values near zero requires special handling. A 5% range around zero (e.g., -5% to +5%) collapses to a tiny interval (e.g., -0.05 to +0.05), which is often impractical. Worse, dividing by zero (when a value is zero) is undefined.

  3. Self-Joins and Duplicate Groups:
    Naive self-joins between the table and itself can produce duplicate clusters. For instance, grouping by A.ID and counting matches in B might yield multiple clusters that are permutations of the same group (e.g., cluster centered at ID 2 vs. ID 7 in the extended dataset).

Troubleshooting Steps, Solutions & Fixes

Step 1: Define the Similarity Metric Rigorously

The first step is to formalize the similarity condition. Two values A and B are considered similar if:

ABS(B - A) / NULLIF(A, 0) < 0.05  

This avoids division by zero when A is zero. However, this definition still allows chains of similarity that defy intuitive clustering.

Example Fix:

SELECT 
  A.ID AS AnchorID,
  B.ID AS SimilarID,
  A.Value AS AnchorValue,
  B.Value AS SimilarValue
FROM Sampled A  
JOIN Sampled B  
  ON ABS(B.Value - A.Value) / NULLIF(A.Value, 0) < 0.05  
  AND A.ID != B.ID;  -- Exclude self-comparisons  

This query identifies pairs of similar values but does not resolve them into distinct clusters.

Step 2: Aggregate Similar Values into Candidate Clusters

To form clusters, group values by an "anchor" value and count how many values are similar to it:

WITH CandidateClusters AS (  
  SELECT  
    A.ID AS AnchorID,  
    A.Value AS AnchorValue,  
    COUNT(*) AS ClusterSize,  
    AVG(B.Value) AS ClusterAverage,  
    MIN(B.Value) AS ClusterMin,  
    MAX(B.Value) AS ClusterMax  
  FROM Sampled A  
  JOIN Sampled B  
    ON ABS(B.Value - A.Value) / NULLIF(A.Value, 0) < 0.05  
  GROUP BY A.ID  
)  
SELECT  
  ClusterSize,  
  ClusterAverage,  
  ClusterMin,  
  ClusterMax  
FROM CandidateClusters  
ORDER BY ClusterSize DESC;  

Limitation: This approach produces overlapping clusters. For example, if IDs 1, 2, and 3 are mutually similar, the query will list three clusters (anchored at each ID), all including the same three values.

Step 3: Deduplicate Overlapping Clusters

To avoid redundant clusters, use DISTINCT or rank clusters by size and uniqueness:

Using DISTINCT:

WITH Clusters AS (  
  SELECT  
    A.ID,  
    COUNT(*) AS ClusterSize,  
    AVG(B.Value) AS ClusterAverage,  
    MIN(B.Value) AS ClusterMin,  
    MAX(B.Value) AS ClusterMax  
  FROM Sampled A  
  JOIN Sampled B  
    ON ABS(B.Value - A.Value) / NULLIF(A.Value, 0) < 0.05  
  GROUP BY A.ID  
)  
SELECT DISTINCT  
  ClusterSize,  
  ClusterAverage,  
  ClusterMin,  
  ClusterMax  
FROM Clusters  
ORDER BY ClusterSize DESC;  

Drawback: This only merges clusters with identical size, average, min, and max. Clusters that are permutations of the same group (e.g., anchored at different IDs) will still appear as separate entries.

Using Window Functions for Ranking:

WITH RankedClusters AS (  
  SELECT  
    A.ID,  
    COUNT(*) AS ClusterSize,  
    AVG(B.Value) AS ClusterAverage,  
    MIN(B.Value) AS ClusterMin,  
    MAX(B.Value) AS ClusterMax,  
    ROW_NUMBER() OVER (PARTITION BY COUNT(*) ORDER BY A.ID) AS Rank  
  FROM Sampled A  
  JOIN Sampled B  
    ON ABS(B.Value - A.Value) / NULLIF(A.Value, 0) < 0.05  
  GROUP BY A.ID  
)  
SELECT  
  ClusterSize,  
  ClusterAverage,  
  ClusterMin,  
  ClusterMax  
FROM RankedClusters  
WHERE Rank = 1  
ORDER BY ClusterSize DESC;  

This ranks clusters by size and selects only one representative per size, reducing duplication.

Step 4: Handle Zero Values Appropriately

To avoid division by zero and nonsensical ranges around zero:

  1. Exclude zero values from percentage comparisons.
  2. Treat zero as a special case with an absolute tolerance (e.g., ±0.05).

Example Fix:

JOIN Sampled B ON  
  (  
    A.Value = 0 AND ABS(B.Value) <= 0.05  
  ) OR (  
    A.Value != 0 AND ABS(B.Value - A.Value) / A.Value < 0.05  
  )  

Step 5: Resolve Non-Transitive Chains with Iterative Clustering

For datasets where chains of similarity exist (A~B, B~C, but A≁C), use a recursive CTE to merge overlapping clusters:

WITH RECURSIVE ClusterGroups AS (  
  SELECT  
    ID AS RootID,  
    ID AS MemberID,  
    Value AS MemberValue  
  FROM Sampled  
  UNION ALL  
  SELECT  
    cg.RootID,  
    s.ID AS MemberID,  
    s.Value AS MemberValue  
  FROM ClusterGroups cg  
  JOIN Sampled s  
    ON ABS(s.Value - cg.MemberValue) / NULLIF(cg.MemberValue, 0) < 0.05  
    AND s.ID NOT IN (SELECT MemberID FROM ClusterGroups WHERE RootID = cg.RootID)  
)  
SELECT  
  RootID,  
  COUNT(*) AS ClusterSize,  
  AVG(MemberValue) AS ClusterAverage,  
  MIN(MemberValue) AS ClusterMin,  
  MAX(MemberValue) AS ClusterMax  
FROM ClusterGroups  
GROUP BY RootID  
ORDER BY ClusterSize DESC;  

This recursively expands each cluster to include all transitively similar values.

Step 6: Predefined Categories for Deterministic Grouping

If ad hoc clustering is insufficient, predefine category boundaries and assign values to the nearest category:

WITH Categories AS (  
  SELECT 20.0 AS CategoryLower, 21.0 AS CategoryUpper  
  UNION ALL  
  SELECT 21.0, 22.05  
  UNION ALL  
  SELECT 22.05, 23.1525  
  -- Continue for additional categories  
)  
SELECT  
  c.CategoryLower,  
  c.CategoryUpper,  
  COUNT(s.Value) AS ValuesInCategory,  
  AVG(s.Value) AS CategoryAverage,  
  MIN(s.Value) AS CategoryMin,  
  MAX(s.Value) AS CategoryMax  
FROM Categories c  
LEFT JOIN Sampled s  
  ON s.Value >= c.CategoryLower  
  AND s.Value < c.CategoryUpper  
GROUP BY c.CategoryLower, c.CategoryUpper;  

This ensures non-overlapping clusters but requires prior knowledge of value ranges.

Step 7: Validation and Edge Case Testing

Test the solution against edge cases:

  1. Zero and Near-Zero Values:

    INSERT INTO Sampled (Value) VALUES (0.0), (0.04), (-0.05);  
    

    Ensure these are grouped correctly without arithmetic errors.

  2. Chains of Similarity:

    INSERT INTO Sampled (Value) VALUES (20.0), (21.0), (22.05), (23.1525);  
    

    Verify whether these form a single cluster or multiple overlapping ones.

  3. Large Value Spreads:

    INSERT INTO Sampled (Value) VALUES (1000.0), (1050.0), (1102.5);  
    

    Confirm that the 5% tolerance is applied relative to each value.

Final Solution Code

Combining deduplication, zero handling, and ranking:

WITH Clusters AS (  
  SELECT  
    A.ID,  
    COUNT(*) AS ClusterSize,  
    AVG(B.Value) AS ClusterAverage,  
    MIN(B.Value) AS ClusterMin,  
    MAX(B.Value) AS ClusterMax,  
    ROW_NUMBER() OVER (  
      PARTITION BY COUNT(*)  
      ORDER BY A.ID  
    ) AS Rank  
  FROM Sampled A  
  JOIN Sampled B ON  
    (  
      A.Value = 0 AND ABS(B.Value) <= 0.05  
    ) OR (  
      A.Value != 0 AND ABS(B.Value - A.Value) / A.Value < 0.05  
    )  
  GROUP BY A.ID  
)  
SELECT  
  ClusterSize,  
  ClusterAverage,  
  ClusterMin,  
  ClusterMax  
FROM Clusters  
WHERE Rank = 1  
ORDER BY ClusterSize DESC;  

This approach balances performance with accuracy, providing a pragmatic solution for identifying value clusters within a 5% tolerance in SQLite.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *