(WIP) proof of concept: feat(server): heap-safe instance cache + opt-in blueprint pooling (one instance per schema-shape)#1325
Open
yyyyaaa wants to merge 6 commits into
Open
(WIP) proof of concept: feat(server): heap-safe instance cache + opt-in blueprint pooling (one instance per schema-shape)#1325yyyyaaa wants to merge 6 commits into
yyyyaaa wants to merge 6 commits into
Conversation
…of 50 Root cause of the schema-builder (public cnc server) heap OOM: each PostGraphile v5 instance that has served a GraphQL request retains ~0.5 GB of heap (fully-materialised schema + grafast plan machinery; a build-only instance is far smaller). graphileCache capped entries at a fixed 50, so the steady-state resident set was ~50 x 0.5 GB ~= 24 GB -- far beyond the heap -- and the process OOM'd as distinct app hosts filled the cache over days. Eviction was empirically confirmed to free instances correctly; the count cap was simply far too large for the per-instance footprint. - getCacheConfig: heap-aware default for GRAPHILE_CACHE_MAX -- budget ~50% of the V8 heap limit at ~0.5 GB/instance, clamped to [3, 50], instead of a fixed 50. Override with GRAPHILE_CACHE_MAX; tune the per-instance estimate with GRAPHILE_CACHE_INSTANCE_HEAP_BYTES. The resolved cap is logged at startup. - disposeEntry: guard double-disposal by ENTRY IDENTITY (WeakSet) instead of by cache key. The key-scoped guard skipped pgl.release() for a rebuilt entry that shared a key with an entry still mid-release (same-key disposal race), proven via a repro harness (1/12 -> 12/12 disposals run). Also close the http.Server unconditionally -- it is never .listen()ed, so the old `.listening` guard was dead -- and drop the now-needless key bookkeeping. - pgCache cleanup callback: match entries by entry.dbname === poolKey (pools are keyed by database name) instead of `cacheKey.includes(poolKey)`; cacheKey is the request host and never contained the db name, so that safety valve was dead. dbname is threaded onto the entry via createGraphileInstance. - Add regression tests for the disposal guard and the heap-aware cap. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01H3mDDgX8z6dE7kyaMERhin
… GraphiQL gating - disposeEntry drains in-flight requests (refcounted via invokeEntryHandler, bounded by GRAPHILE_CACHE_DRAIN_TIMEOUT_MS, default 30s) before pgl.release(), so eviction can no longer tear down a schema mid-request. - All handler invocations in the graphile middleware go through invokeEntryHandler; disposing entries are treated as cache misses. - Global BuildSemaphore (GRAPHILE_BUILD_CONCURRENCY, default 1) serializes cross-key schema builds; ensureCacheHeadroom evicts the LRU instance BEFORE each build so the build's transient peak lands on freed headroom. - Prod idle TTL drops from 1 year to 6h (GRAPHILE_CACHE_TTL_MS still overrides). - GraphiQL (ruru) only in development or with GRAPHILE_GRAPHIQL=true. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0122xqM2VkNbuAmZshK1YNSb
…LS), reset type cache per build withPgClient(null, ...) executed the localeStrings query with no role and no jwt.claims — bypassing RLS on every translation read. Thread pgSettings from the grafast context into the runtime query (same pattern as graphile-llm's rag-plugin). Also reset localeTypeCache in init alongside i18nRegistry so the module-singleton I18nPlugin export cannot leak GraphQLObjectTypes across schema rebuilds. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0122xqM2VkNbuAmZshK1YNSb
…not module global Concurrent PostGraphile builds in one process interleave init/fields hooks, so the module-global cachedTablesMeta could bake build B's tables into build A's _meta resolver. Store per-build via WeakMap (build objects are frozen by graphile-build, so no own-property). Flat global retained solely for single-build codegen consumers (graphile-schema buildIntrospectionJSON, codegen DatabaseSchemaSource), documented as such. Also export invokeEntryHandler/ensureCacheHeadroom from graphile-cache barrel. Verified: graphile-settings 158/158 tests, 3 snapshots identical, against live PG on :5433. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0122xqM2VkNbuAmZshK1YNSb
… per schema-shape GRAPHILE_BLUEPRINT_POOLING=1 keys the instance cache by a blueprint hash (sorted logical schema names + shape fingerprint over the catalog's [schema,table] pairs + database settings flags) instead of per-tenant svc_key, builds shared instances with stock gather.pgIdentifiers='unqualified', and routes each request via pgSettings search_path (requesting tenant's physical schemas, double-quoted). Safety fallbacks to today's per-tenant instances: realtime-enabled APIs, empty schema lists, unqualified relation-name collisions within the schema set (e.g. identity_providers table/view shadow), or failed catalog probes. Decisions memoized per svc_key; schema:update flushes all pooled instances + decisions (v1 semantics). Plugins honor schema.constructiveUnqualified for tenant-data SQL (search chunk refs + BM25 index name, llm RAG chunk query, i18n localeStrings) while control-plane metaschema references stay fully qualified. presigned-url resolves storage modules by logical schema name; llm agent-discovery is tenant-filtered and keyed by database_id (fixes a LIMIT 1 cross-tenant bleed). grafast.context now reads role/anonRole from req.api in all modes (de-closured). Flag off = behavior-identical (verified: 61 pre-existing middleware tests unchanged; plugin suites byte-identical emissions). Gate evidence: SDL qualified-vs-unqualified byte-identical (sha256 match); zero-bleed proven on live hashed-schema tenants via per-request search_path. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0122xqM2VkNbuAmZshK1YNSb
…ixes for pooling - tenantSearchPath: keep shared 'public' LAST on the pooled search_path. Replacing the path with only tenant schemas broke SECURITY DEFINER auth functions without their own SET search_path (sign_in's ::email cast → 'type email does not exist', HTTP 500 on every pooled login). Verified fixed live: pooled signIn returns a token; authenticated data reads flow through the shared instance. - computeBlueprintKey now includes dbname: same-shape tenants in DIFFERENT physical databases must never share an instance (its pool targets one DB). - Transient catalog-probe failures are no longer memoized as permanent per-tenant fallbacks — next request re-probes. - Manual /flush route now also clears bp: entries + pooling decisions. - Single catalog scan feeds both shape fingerprint and collision check (was two identical pg_class scans per decision). - _meta reports LOGICAL schema names on pooled instances (stops leaking the representative tenant's hashed schema identifier to other tenants). W3 rig evidence: 7 same-shape tenants → 1 shared instance (6 builds → 1), tenant2 auto-split by shape fingerprint, zero-bleed via HTTP-authenticated canaries (10 interleaved rounds + cross-token control), RSS 1475MB → 711MB, collision fallback fires with warn + per-tenant instance. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0122xqM2VkNbuAmZshK1YNSb
pyramation
reviewed
Jul 3, 2026
| : `"${schemaName}"."${baseTable}"`; | ||
| const translationTableRef = constructiveUnqualified | ||
| ? `"${translationTable}"` | ||
| : `"${schemaName}"."${translationTable}"`; |
Contributor
There was a problem hiding this comment.
let's aim to use this where we can instead of just quoting stuff w/o checking:
https://www.npmjs.com/package/@pgsql/quotes
QuoteUtils.quoteQualifiedIdentifier('public', 'my_table');
pyramation
reviewed
Jul 3, 2026
| acm.task_table_name | ||
| FROM metaschema_modules_public.agent_chat_module acm | ||
| JOIN metaschema_public.schema s ON s.id = acm.schema_id | ||
| WHERE s.database_id = $1 |
Contributor
There was a problem hiding this comment.
this is great, not sure if we can separate a few smaller PRs, would be great :)
Contributor
There was a problem hiding this comment.
if not, I also get it, we can take a look together then.
pyramation
reviewed
Jul 3, 2026
| // (schema.constructiveUnqualified), emit search_path-relative references for | ||
| // tenant-data tables/indexes so the per-request search_path resolves the | ||
| // tenant schema. Default (flag absent): fully schema-qualified, byte-identical. | ||
| const constructiveUnqualified = !!((build as any)?.options?.constructiveUnqualified); |
Contributor
There was a problem hiding this comment.
wait, this means using unqualified schemas?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Makes the multi-tenant PostGraphile v5 server survive and scale on small heaps. Two layers:
GRAPHILE_BLUEPRINT_POOLING=1): one shared PostGraphile instance per schema-shape, routed per request viasearch_path— collapsing N same-shape tenants from N×~0.5GB to one instance.Also fixes three latent multi-tenant bugs found during the audit (one live RLS bypass).
Problem
Each query-serving PostGraphile instance retains ~0.5GB of heap (measured; ~51% strings, ~30% plan closures). The cache was keyed per
svc_key(per tenant×API) withmax: 50, TTL 1 year — a steady state of ~24GB for a fleet that runs in 2GB containers. Memory grew linearly with tenants and the process OOM'd long before "thousands of tenants."What's included
1. Cache + eviction hardening (
graphile-cache,graphql/server)clamp(⌊heap×0.5 / 512MB⌋, 3, 50)instead of a fixed 50 (GRAPHILE_CACHE_MAX,GRAPHILE_CACHE_INSTANCE_HEAP_BYTESoverride/tune); resolved cap logged at startup.entry.dbname(the old substring match never fired).invokeEntryHandler);disposeEntrywaits for them (bounded byGRAPHILE_CACHE_DRAIN_TIMEOUT_MS, default 30s) beforepgl.release()— eviction can no longer tear a schema down mid-request.GRAPHILE_BUILD_CONCURRENCY, default 1) and evict the LRU instance before building, so the build's transient peak lands on freed headroom.GRAPHILE_GRAPHIQL=trueto force).2. Multi-tenant bug fixes
graphile-i18n: live RLS bypass — thelocaleStringsquery ranwithPgClient(null, …)(no role, no claims). Now threads the request'spgSettings. Also resets the type cache per build (module-singleton leak).cachedTablesMetawas a module global — concurrent builds could serve each other's_meta. Now keyed per build via WeakMap (build objects are frozen; the flat global remains only for single-build codegen consumers, documented).graphile-llmagent-discovery: config cache was keyed by dbname with aLIMIT 1, no tenant filter — cross-tenant bleed in shared-DB topologies. Now filtered and keyed bydatabase_id.graphile-presigned-url: storage-module resolution matched build-time physical schema names; now matches logical names (hash prefix stripped) so it works under pooling and across re-hashed schemas.3. Blueprint pooling (opt-in)
bp:sha256({sorted logical schemas, shape fingerprint, database settings flags, api name, mode, dbname}). The shape fingerprint hashes the catalog's[logical schema, relname]pairs, so tenants that drifted (e.g. a half-provisioned tenant) automatically get their own instance.dbnameis included so same-shape tenants in different physical databases never share a pool.gather: { pgIdentifiers: 'unqualified' }(search_path-relative SQL — GraphQL SDL is byte-identical to qualified builds, sha256-verified) plusschema: { constructiveUnqualified: true }so Constructive plugins (search chunk refs + BM25 index name, llm RAG chunk query, i18n localeStrings) emit search_path-relative SQL for tenant data. Control-plane (metaschema_*,services_public) references stay fully qualified by design.grafast.contextreads roles fromreq.api(de-closured in all modes) and, on pooled instances only, setspgSettings.search_path= requesting tenant's physical schemas +publiclast (shared domains/extensions — SECURITY DEFINER functions likesign_independ on it).identity_providerstable/view shadow — detected by catalog probe, logged, per-tenant instance used); failed probes (not memoized — re-probed next request).schema:updateflushes all pooled instances + cached decisions (rebuilds are ~1–2s for tenant APIs); the manual/flushroute does the same.Verification (isolated rig: full constructive-db schema + 8 seeded marketplace tenants as hashed schemas, server at 2GB heap)
publicdrop, dbname in key, transient-probe memoization,/flushbp: gap, double catalog scan,_metaphysical-name leak), 3 documented below.Env vars
GRAPHILE_BLUEPRINT_POOLINGGRAPHILE_CACHE_MAXGRAPHILE_CACHE_INSTANCE_HEAP_BYTESGRAPHILE_CACHE_TTL_MSGRAPHILE_CACHE_DRAIN_TIMEOUT_MSGRAPHILE_BUILD_CONCURRENCYGRAPHILE_GRAPHIQLKnown limitations / follow-ups
schema:update. Follow-up: extend the fingerprint to attributes/procs./flushroute's missing auth is pre-existing (TODOin code).Rollout
GRAPHILE_BLUEPRINT_POOLING=1, watch[pooling]logs (attach vs build), instance counts, RSS.🤖 Generated with Claude Code
https://claude.ai/code/session_0122xqM2VkNbuAmZshK1YNSb