Optimizing SQLite Database Design for Large-Scale Stock Historical Data

Understanding the Data Structure and Initial Design Challenges

The core issue revolves around efficiently storing and querying a massive amount of stock historical data, specifically bid/ask quotes tick data, in an SQLite database. The data is currently stored in a directory structure, with each directory named by date (e.g., 20180102) and containing over 4,000 CSV files, each corresponding to a stock code. Each CSV file contains between 50,000 to 100,000 rows of tick data, with each row having 7 columns. The user initially proposed three potential database designs, each with significant drawbacks, and sought advice on the most efficient and scalable approach.

The first proposed solution involved creating a database with 4,000 tables, each corresponding to a stock and containing all its historical data. The second and third solutions suggested creating either 4,000 databases (one per stock) or 750 databases (one per date), each containing multiple tables corresponding to the CSV files. These designs were quickly dismissed due to their impracticality and inefficiency, leading to a more streamlined approach: a single database with a single table, augmented with additional columns for date and stock code.

The primary challenge lies in handling the sheer volume of data—approximately 210 billion rows if all tick data is imported. While SQLite is capable of handling large datasets, the efficiency of queries, the size of the database, and the management of indexes become critical considerations. Additionally, the user’s need to occasionally reprocess raw data for new strategies introduces a layer of complexity, as it requires balancing the storage of raw tick data with summarized data for daily analysis.

Evaluating the Feasibility of a Single-Table Design and Its Implications

The single-table design, as suggested by the forum participants, is the most straightforward and efficient approach for this use case. By consolidating all data into one table with additional columns for date and stock code, the user can leverage SQLite’s indexing capabilities to perform fast and flexible queries. However, this design raises several important considerations:

  1. Database Size and Performance: With 210 billion rows, the database could grow to several terabytes, depending on the size of each row. SQLite’s default page size of 4 KB may not be optimal for such a large dataset, and adjusting the page size to 64 KB could improve performance and allow for larger databases. However, this also increases the memory footprint and requires careful tuning.

  2. Indexing Strategy: Proper indexing is crucial for query performance. Indexes on the date and stock code columns will enable efficient filtering and sorting. However, creating indexes on such a large dataset requires significant storage space and can slow down data insertion. The user must balance the need for fast queries with the overhead of maintaining indexes.

  3. Data Import and Transactions: Importing such a large dataset requires careful management of transactions to ensure efficiency and data integrity. Importing each directory (date) in a single transaction, as suggested, minimizes the overhead of transaction management and ensures that the database remains consistent.

  4. Summarized Data vs. Raw Data: While storing raw tick data provides maximum flexibility, it may not be necessary for most analyses. Summarizing the data at the end of each day and storing only the summary results can drastically reduce the database size and improve query performance. The raw data can be archived or processed on-demand when needed for specific strategies.

  5. Long-Term Maintenance: Over time, the relevance of intra-day tick data diminishes, and it may be practical to archive or delete older data to keep the database size manageable. This requires a clear data retention policy and automated processes for data archiving.

Step-by-Step Guide to Implementing and Optimizing the Database

Step 1: Define the Table Schema

Start by defining a table schema that accommodates the raw tick data, along with additional columns for date and stock code. For example:

CREATE TABLE stock_ticks (
    date TEXT NOT NULL,
    stock_code TEXT NOT NULL,
    timestamp INTEGER NOT NULL,
    bid_price REAL,
    ask_price REAL,
    bid_size INTEGER,
    ask_size INTEGER,
    trade_price REAL,
    trade_size INTEGER
);

This schema includes columns for the date, stock code, and the seven fields from the CSV files. The date and stock_code columns will be used for filtering and grouping data.

Step 2: Optimize the Database Configuration

Before importing data, configure the database for optimal performance:

  1. Set the Page Size: Adjust the page size to 64 KB to support a larger database and improve I/O efficiency:

    PRAGMA page_size = 65536;
    
  2. Enable WAL Mode: Write-Ahead Logging (WAL) mode improves concurrency and performance for large datasets:

    PRAGMA journal_mode = WAL;
    
  3. Disable Synchronous Writes: For faster data import, disable synchronous writes (use with caution, as this risks data loss in case of a crash):

    PRAGMA synchronous = OFF;
    

Step 3: Import Data Efficiently

Import the data in batches, with each directory (date) processed in a single transaction. Use Python to read the CSV files and insert the data into the database:

import sqlite3
import csv
import os

def import_csv_to_db(db_path, csv_dir):
    conn = sqlite3.connect(db_path)
    cursor = conn.cursor()
    for root, dirs, files in os.walk(csv_dir):
        for file in files:
            if file.endswith('.csv'):
                stock_code = file.split('.')[0]
                date = os.path.basename(root)
                with open(os.path.join(root, file), 'r') as f:
                    reader = csv.reader(f)
                    cursor.executemany(
                        "INSERT INTO stock_ticks VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)",
                        [(date, stock_code) + tuple(row) for row in reader]
                    )
        conn.commit()
    conn.close()

This script reads each CSV file, extracts the date and stock code, and inserts the data into the database in a single transaction per directory.

Step 4: Create Indexes

After importing the data, create indexes on the date and stock_code columns to optimize query performance:

CREATE INDEX idx_date ON stock_ticks (date);
CREATE INDEX idx_stock_code ON stock_ticks (stock_code);

These indexes will speed up queries that filter by date or stock code.

Step 5: Summarize Data for Daily Analysis

To reduce the database size and improve query performance, create a summary table that stores end-of-day (EOD) data:

CREATE TABLE stock_summary (
    date TEXT NOT NULL,
    stock_code TEXT NOT NULL,
    open_price REAL,
    close_price REAL,
    high_price REAL,
    low_price REAL,
    volume INTEGER,
    PRIMARY KEY (date, stock_code)
);

Populate this table by summarizing the raw tick data:

INSERT INTO stock_summary
SELECT
    date,
    stock_code,
    MIN(trade_price) AS open_price,
    MAX(trade_price) AS close_price,
    MAX(trade_price) AS high_price,
    MIN(trade_price) AS low_price,
    SUM(trade_size) AS volume
FROM stock_ticks
GROUP BY date, stock_code;

This summary table will be significantly smaller and faster to query for daily analysis.

Step 6: Implement Data Retention and Archiving

To manage the database size over time, implement a data retention policy that archives or deletes old tick data. For example, you can archive data older than one year:

-- Archive old data
ATTACH DATABASE 'archive.db' AS archive;
CREATE TABLE archive.stock_ticks AS SELECT * FROM stock_ticks WHERE date < '2022-01-01';
DETACH DATABASE archive;

-- Delete archived data
DELETE FROM stock_ticks WHERE date < '2022-01-01';

This process can be automated using a script or scheduled task.

Step 7: Monitor and Optimize Performance

Regularly monitor the database performance and optimize as needed:

  1. Vacuum the Database: Reclaim unused space and optimize storage:

    VACUUM;
    
  2. Analyze Query Performance: Use the EXPLAIN QUERY PLAN statement to analyze and optimize slow queries:

    EXPLAIN QUERY PLAN SELECT * FROM stock_ticks WHERE date = '20210101' AND stock_code = 'AAPL';
    
  3. Adjust Indexes: Add or remove indexes based on query patterns to balance performance and storage overhead.

By following these steps, you can create a highly efficient and scalable SQLite database for storing and querying large-scale stock historical data. The single-table design, combined with proper indexing and summarization, ensures optimal performance while maintaining flexibility for advanced analysis.

Related Guides

Leave a Reply

Your email address will not be published. Required fields are marked *