Skip to content

vec0 virtual table corrupts on first write after cross-platform rsync (macOS arm64 → Linux x86_64, both 0.1.9) #297

@tstephx

Description

@tstephx

Summary

A vec0 virtual table created and populated on macOS (Darwin arm64) corrupts on the first write attempt from a Linux (x86_64) host after the underlying SQLite file is byte-for-byte rsync'd. Read-only queries against the rsync'd DB work fine. Both hosts run sqlite-vec 0.1.9.

Environment

macOS (writer) Linux (reader/writer)
OS macOS 15.x (Darwin arm64) Debian (x86_64)
Python 3.12 3.12
sqlite-vec (pip) 0.1.9 0.1.9
SQLite 3.49.x 3.40.x

Reproduction

  1. On macOS: create a DB with a vec0 virtual table and populate ~50k rows.
    conn = sqlite3.connect("rss.db")
    conn.enable_load_extension(True); sqlite_vec.load(conn)
    conn.execute("CREATE VIRTUAL TABLE vec_articles USING vec0(article_id INTEGER PRIMARY KEY, embedding float[768])")
    # ... insert 50,593 rows
  2. Rsync the file to Linux: rsync -avz rss.db host:rss.db. Verify sha256sum matches and sqlite3 rss.db "PRAGMA integrity_check" returns ok on the Linux side.
  3. On Linux, perform any write that touches vec_articles via the extension — even one that should be a no-op when there's nothing new to insert. Example via a higher-level wrapper:
    conn.enable_load_extension(True); sqlite_vec.load(conn)
    # any CREATE TABLE IF NOT EXISTS / INSERT path that re-creates or touches vec_articles
  4. Result: DB file shrinks (~620 MB → ~470 MB), subsequent PRAGMA integrity_check reports database disk image is malformed. The article rows and vec_articles_rowids shadow table both lose data.

Expected

The DB should remain valid across platforms when bytes are unchanged. Either:

  • vec0 should detect platform-incompatible shadow table state and refuse to write (loud failure), or
  • vec0 shadow tables should be platform-neutral so cross-host transfer + write works.

Workaround

Open the DB read-only on the non-creator host:

conn = sqlite3.connect("file:rss.db?mode=ro", uri=True)
conn.execute("PRAGMA query_only = 1")
# load sqlite-vec, run SELECTs only — works correctly

Hybrid FTS5 + vec0 SELECT queries return correct ranked results. Only writes corrupt.

Why this matters

Common deployment pattern: do expensive embedding on a beefy laptop, rsync the DB to a small server for read-only querying. Today this silently corrupts on the server's first write, even if that write would have been a logical no-op.

Happy to provide a minimized self-contained repro script if helpful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions