Validating Numeric Latitude/Longitude Stored as Text in SQLite
Understanding the Core Challenge of Filtering Valid Geographic Coordinates in Text Columns
The task of filtering valid latitude and longitude values stored as TEXT columns in SQLite involves addressing two critical issues:
- Ensuring that the stored values are actually numeric (integers or real numbers) despite being stored as text
- Validating that these numeric values fall within the geographic bounds of -90 to 90 for latitude and -180 to 180 for longitude
This becomes complex due to SQLite’s type affinity system, which automatically converts values to the column’s declared type (TEXT in this case) during storage and performs implicit casting during query operations. The discussion reveals multiple failed attempts to solve this problem through basic casting, typeof checks, and GLOB patterns – failures rooted in fundamental misunderstandings of SQLite’s type handling mechanics. We will dissect these issues through three lenses: the structural limitations of SQLite’s type system, the pitfalls of implicit casting, and robust solutions using advanced SQLite features.
Structural Limitations of Text-Based Coordinate Storage and Implicit Casting
1. Type Affinity vs. Storage Class Mismatch
SQLite uses dynamic typing where any column (except INTEGER PRIMARY KEY) can store any data type. However, type affinity influences how values are stored and retrieved. When a column has TEXT affinity:
- Numeric values inserted without quotes are converted to TEXT (e.g.,
INSERT INTO t VALUES (-3, 45.04)
stores "-3" and "45.04" as text) - Values with non-numeric characters (e.g., ‘6°’) remain as text but may be partially converted during queries
This creates a critical problem: Numeric operations like ABS(lat)
implicitly cast text values to REAL/INTEGER, silently converting invalid text like ‘6°’ to 6.0 without warning. The database engine doesn’t distinguish between "clean" numeric text ("-3") and "dirty" numeric text ("6°") during casting – both become numbers, but only one is actually valid.
2. Regex Pattern Limitations in String Validation
The initial attempt to use GLOB '*[0-9]*'
fails because it only checks for the presence of digits, not the exclusion of non-digit characters. For example:
- ‘6°’ contains digits but also invalid characters
- ‘-34.5-6.23’ has multiple minus signs and decimal points, making it non-numeric despite containing digits
This highlights the need for exact pattern matching rather than partial digit detection. Basic string matching functions like GLOB and LIKE lack the precision required for strict numeric validation.
3. typeof() Function Shortcomings
Using typeof(lat) = "integer"
fails because:
- All values inserted into TEXT columns are stored as TEXT storage class
- Even numeric-looking values like "-3" are stored as TEXT
- typeof() returns "text" for these values regardless of their content
Thus, typeof() cannot distinguish between "123" (textual number) and "abc" (non-numeric text) when stored in a TEXT column – both return "text".
Strategies for Robust Coordinate Validation in Text Columns
1. Regular Expression-Based Content Validation
The most reliable method involves using SQLite’s REGEXP operator with carefully crafted patterns to identify properly formatted numbers. The regex pattern must account for all valid numeric formats:
- Optional leading/trailing whitespace
- Optional sign (+/-)
- Integer or decimal components (e.g., 123, 123.45, .45)
- Scientific notation (e.g., 1.2e-3)
Regex Pattern Breakdown:
^ * # Leading whitespace
(\+|-)? # Optional sign
(
\d+\.?\d* # Integer with optional decimal (e.g., 123 or 123.45)
| # OR
\d*\.?\d+ # Decimal starting with . (e.g., .45)
)
([eE](\+|-)?\d+)? # Optional exponent part
*$ # Trailing whitespace
Implementation Example:
SELECT lat, lng
FROM (
SELECT
CAST(lat AS REAL) AS lat,
CAST(lng AS REAL) AS lng
FROM t
WHERE
lat REGEXP '^ *(\+|-)?(\d+\.?\d*|\d*\.?\d+)([eE](\+|-)?\d+)? *$' AND
lng REGEXP '^ *(\+|-)?(\d+\.?\d*|\d*\.?\d+)([eE](\+|-)?\d+)? *$'
)
WHERE
ABS(lat) <= 90 AND
ABS(lng) <= 180;
Key Requirements:
- The regexp extension must be loaded (e.g., via
LOAD_EXTENSION
or compiled-in) - Casting occurs after regex validation to avoid errors from invalid text
2. Generated Columns for Persistent Validation
For frequently queried data, using stored generated columns automates validation and improves query performance. This approach creates auxiliary REAL columns that store casted values only when the text is valid:
CREATE TABLE t (
lat TEXT,
lng TEXT,
rlat REAL GENERATED ALWAYS AS (
CASE WHEN lat REGEXP '^ *(\+|-)?(\d+\.?\d*|\d*\.?\d+)([eE](\+|-)?\d+)? *$'
THEN CAST(lat AS REAL)
ELSE NULL
END
) STORED,
rlng REAL GENERATED ALWAYS AS (
CASE WHEN lng REGEXP '^ *(\+|-)?(\d+\.?\d*|\d*\.?\d+)([eE](\+|-)?\d+)? *$'
THEN CAST(lng AS REAL)
ELSE NULL
END
) STORED,
valid INTEGER GENERATED ALWAYS AS (
(ABS(rlat) <= 90 AND ABS(rlng) <= 180) IS TRUE
) STORED
);
Advantages:
- Automatic maintenance: rlat/rlng update when lat/lng change
- Efficient querying:
SELECT * FROM t WHERE valid = 1;
- Clear data profiling:
SELECT COUNT(*) FROM t WHERE valid = 0;
Implementation Notes:
- The
STORED
keyword ensures values are physically stored (requires SQLite 3.31+) IS TRUE
converts the boolean result to 1 (true) or 0 (false), avoiding NULLs from invalid rlat/rlng
3. Application-Layer Input Sanitization
While SQL-based solutions work, input validation should ideally occur before data reaches the database:
Python Example:
import re
def is_valid_coord(value, is_latitude=True):
pattern = r'^ *[+-]?(\d+\.?\d*|\.\d+)([eE][+-]?\d+)? *$'
if not re.fullmatch(pattern, str(value)):
return False
num = float(value)
if is_latitude:
return -90 <= num <= 90
else:
return -180 <= num <= 180
# Usage:
if is_valid_coord(lat_input, is_latitude=True) and is_valid_coord(lng_input, is_latitude=False):
insert_into_db(lat_input, lng_input)
else:
handle_invalid_input()
Benefits:
- Prevents invalid data from entering the database
- Allows custom error handling/logging for bad inputs
- Reduces database-side computation overhead
Comprehensive Implementation Guide with Edge Case Handling
Step 1: Enable REGEXP Support
SQLite doesn’t include REGEXP by default. Enable it using one of these methods:
Method A: Load Extension at Runtime (SQLite CLI)
.load /path/to/regexp
Method B: Compile with REGEXP Support
#include "sqlite3ext.h"
SQLITE_EXTENSION_INIT1
#include <sqlite3.h>
#include <pcre.h> /* PCRE library headers */
/* ... (Implement regexp function using PCRE) ... */
int sqlite3_extension_init(
sqlite3 *db,
char **pzErrMsg,
const sqlite3_api_routines *pApi
){
SQLITE_EXTENSION_INIT2(pApi);
sqlite3_create_function(db, "regexp", 2, SQLITE_UTF8, 0, regexp_func, 0, 0);
return SQLITE_OK;
}
Step 2: Create Optimized Table Schema
Combine generated columns with indexes for optimal performance:
CREATE TABLE geographic_data (
raw_lat TEXT CHECK (length(raw_lat) <= 32), -- Optional length constraint
raw_lng TEXT CHECK (length(raw_lng) <= 32),
clean_lat REAL GENERATED ALWAYS AS (
CASE WHEN raw_lat REGEXP '^ *[+-]?(\d+\.?\d*|\.\d+)([eE][+-]?\d+)? *$'
THEN CAST(raw_lat AS REAL)
ELSE NULL END
) STORED,
clean_lng REAL GENERATED ALWAYS AS (
CASE WHEN raw_lng REGEXP '^ *[+-]?(\d+\.?\d*|\.\d+)([eE][+-]?\d+)? *$'
THEN CAST(raw_lng AS REAL)
ELSE NULL END
) STORED,
is_valid INTEGER GENERATED ALWAYS AS (
(ABS(clean_lat) <= 90 AND ABS(clean_lng) <= 180) IS TRUE
) STORED,
CHECK (typeof(raw_lat) = 'text' AND typeof(raw_lng) = 'text')
);
CREATE INDEX idx_valid_coords ON geographic_data(is_valid) WHERE is_valid = 1;
Schema Features:
- Explicit text type checks on raw inputs
- Index filtered to valid entries only
- Length constraints to prevent excessively long values
Step 3: Query Patterns for Different Use Cases
Case 1: Retrieve All Valid Coordinates
SELECT raw_lat, raw_lng, clean_lat, clean_lng
FROM geographic_data
WHERE is_valid = 1;
Case 2: Find Invalid Entries with Diagnostic Info
SELECT
raw_lat,
raw_lng,
CASE
WHEN clean_lat IS NULL THEN 'Invalid latitude format'
WHEN ABS(clean_lat) > 90 THEN 'Latitude out of bounds'
ELSE ''
END || '; ' ||
CASE
WHEN clean_lng IS NULL THEN 'Invalid longitude format'
WHEN ABS(clean_lng) > 180 THEN 'Longitude out of bounds'
ELSE ''
END AS diagnostic
FROM geographic_data
WHERE is_valid = 0;
Case 3: Statistical Analysis of Data Quality
SELECT
COUNT(*) AS total,
SUM(is_valid) AS valid_count,
SUM(CASE WHEN clean_lat IS NULL THEN 1 ELSE 0 END) AS invalid_lat_format,
SUM(CASE WHEN clean_lng IS NULL THEN 1 ELSE 0 END) AS invalid_lng_format,
SUM(CASE WHEN clean_lat IS NOT NULL AND ABS(clean_lat) > 90 THEN 1 ELSE 0 END) AS out_of_bounds_lat,
SUM(CASE WHEN clean_lng IS NOT NULL AND ABS(clean_lng) > 180 THEN 1 ELSE 0 END) AS out_of_bounds_lng
FROM geographic_data;
Step 4: Handling Scientific Notation and Localization
The regex pattern already accounts for scientific notation (e.g., 1.2E-3), but localization issues like comma decimal separators require preprocessing:
Preprocessing Comma Decimals:
UPDATE geographic_data
SET raw_lat = REPLACE(raw_lat, ',', '.'),
raw_lng = REPLACE(raw_lng, ',', '.')
WHERE raw_lat LIKE '%,%' OR raw_lng LIKE '%,%';
Handling Thousand Separators:
UPDATE geographic_data
SET raw_lat = REPLACE(raw_lat, '''', ''), -- Swiss-style ' as thousand sep
raw_lng = REPLACE(raw_lng, '''', '')
WHERE raw_lat LIKE '%''%' OR raw_lng LIKE '%''%';
Step 5: Performance Optimization Techniques
1. Precomputed Validation Columns
The generated is_valid
column enables fast filtering without runtime computation.
2. Partial Indexes
CREATE INDEX idx_invalid_lat ON geographic_data(raw_lat)
WHERE clean_lat IS NULL OR ABS(clean_lat) > 90;
CREATE INDEX idx_invalid_lng ON geographic_data(raw_lng)
WHERE clean_lng IS NULL OR ABS(clean_lng) > 180;
3. Materialized Views for Frequent Access
For read-heavy applications:
CREATE TABLE valid_coords_mv AS
SELECT raw_lat, raw_lng, clean_lat, clean_lng
FROM geographic_data
WHERE is_valid = 1;
CREATE TRIGGER trg_geo_data_insert
AFTER INSERT ON geographic_data
WHEN NEW.is_valid = 1
BEGIN
INSERT INTO valid_coords_mv VALUES (NEW.raw_lat, NEW.raw_lng, NEW.clean_lat, NEW.clean_lng);
END;
Advanced Edge Cases and Mitigation Strategies
1. Hexadecimal and Other Base Representations
Some systems may store coordinates in hexadecimal (e.g., 0x1A.3B). Handle these during preprocessing:
UPDATE geographic_data
SET raw_lat = CAST(CAST(raw_lat AS INTEGER) AS TEXT),
raw_lng = CAST(CAST(raw_lng AS INTEGER) AS TEXT)
WHERE raw_lat GLOB '0x*' OR raw_lng GLOB '0x*';
2. Degree-Minute-Second (DMS) Formats
While ideally handled during input sanitization, DMS values like ’40°26′46″N’ can be converted using SQLite functions:
CREATE FUNCTION dms_to_decimal(dms TEXT) RETURNS REAL BEGIN
-- Implementation parsing degrees, minutes, seconds, direction
END;
UPDATE geographic_data
SET raw_lat = dms_to_decimal(raw_lat),
raw_lng = dms_to_decimal(raw_lng)
WHERE raw_lat GLOB '*°*' OR raw_lng GLOB '*°*';
3. Handling Overflow Values
Extremely large numbers that overflow REAL (8-byte IEEE float) need special handling:
ALTER TABLE geographic_data ADD COLUMN lat_is_integer INTEGER
GENERATED ALWAYS AS (CAST(raw_lat AS INTEGER) == CAST(raw_lat AS REAL)) STORED;
SELECT *
FROM geographic_data
WHERE lat_is_integer = 1 AND ABS(CAST(raw_lat AS INTEGER)) <= 90;
Final Recommendations for Production Systems
- Input Validation Layer: Implement strict validation before data enters the database to minimize cleanup needs.
- Two-Table Approach: Store raw inputs in one table and sanitized coordinates in another, linked by foreign key.
- Regular Expression Caching: Precompile regex patterns in application code for faster validation.
- Versioning: Include a
data_quality_version
column to track changes in validation logic over time. - Monitoring: Set up alerts for sudden increases in invalid entries indicating upstream data issues.
By combining rigorous regex validation, generated columns for cleaned data, and application-layer sanitization, developers can maintain both raw and sanitized coordinate data efficiently in SQLite while ensuring geographic validity.