feat: three-tier cache, coalescer, and $/min rate limits (request-layer)#79
Open
ashutosh887 wants to merge 3 commits into
Open
feat: three-tier cache, coalescer, and $/min rate limits (request-layer)#79ashutosh887 wants to merge 3 commits into
ashutosh887 wants to merge 3 commits into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR
Adds a new
agent-gateway/request_layer/service (sibling to theexisting
agent-gateway/router/andagent-gateway/registry/services)plus a dedicated
request-layer-redisinstance. Both are opt-in: withthe services stopped, the platform behaves bit-for-bit identically to
today. With the service enabled and one Kong route flipped, repeated
and concurrent traffic for that agent is absorbed by a tiered response
cache, dollar-aware rate limiter, and request coalescer — and is
observable in Phoenix using Nasiko's existing
app.utils.observability.tracing_utils.bootstrap_tracing./admin/stats,/admin/stream,/admin/policies,/admin/queue/{agent}) plus Phoenixcache.hitspans annotated withcache.savings_usd/cache.savings_msThe design insight
Every Nasiko request actually fires two LLM calls — one inside the
router service to decide which agent should handle the query, and one
inside the agent to do the actual work. The natural reflex is to cache
agent responses; this PR also catches the router's decisions, because a
paraphrased query that resolves to the same intent should not pay for the
router LLM either.
That's the rationale for three cache tiers:
"translate hi to french" matching "convert hi into french" —
which exact-match misses.
(query → agent_url)mapping itself, so even novel queries withrecognized intent skip the router LLM entirely.
On top of the cache tiers:
into one origin call (instead of all 200 racing the agent before any
one populates the cache).
GPT-4o call costs ~50× a translation, so req/sec is the wrong unit
for a multi-agent fleet with mixed model costs.
dropping it: bursts above the bucket queue with a predictable wait
time rather than returning 5xx.
a sensible TTL and threshold automatically. Translation gets a 24h
TTL with a loose 0.92 cosine threshold; weather gets 5 minutes with
a tight 0.97. Operators override per-agent at any time.
cache.layer,cache.similarity,cache.matched_query,cache.savings_usd,cache.savings_ms. Operators can see why each request was servedthe way it was, inline with the existing trace UI.
Shipped as a single opt-in service. With the layer stopped, the
platform is bit-for-bit unchanged.
Files added / modified
Net diff:
agent-gateway/(matches existing pattern)Dockerfile.workerstyle)Architecture
flowchart LR Client([Client / Workflow Engine]) Kong[Kong Gateway<br/>:9100] Layer[nasiko-request-layer<br/>:8090] Router[nasiko-router<br/>:8081] Agent[Agent Container<br/>:8000] SR[(request-layer-redis<br/>redis-stack)] Phoenix[phoenix-observability<br/>:6006] Client -->|/agents/translator/*| Kong Kong -->|opt-in route| Layer Kong -.->|default route| Agent Layer -->|cache miss| Kong Kong -->|forward| Agent Layer <--> SR Layer -->|spans| Phoenix Layer -.->|L3 opt-in| RouterThe dashed lines are the bypass paths: when the layer is stopped, Kong
points straight at the agent (default), and when L3 is enabled the layer
can short-circuit the router service entirely on a routing-cache hit.
Pipeline
flowchart TD Start([Inbound request]) --> L0[L0: normalize<br/>canonical body] L0 --> L1{L1 exact<br/>cache?} L1 -- hit --> Return1([Return cached, <20ms]) L1 -- miss --> L2{L2 semantic<br/>cache?} L2 -- hit ≥ threshold --> Return2([Return + similarity hint]) L2 -- miss --> L3{L3 router<br/>cache? opt-in} L3 -- hit --> Skip[Skip router LLM<br/>route directly] L3 -- miss --> L4{L4 already<br/>in-flight?} L4 -- yes --> Wait[Subscribe to broadcast<br/>wait up to 30s] Wait --> Return4([Return broadcast result]) L4 -- no, become leader --> L5a{L5a token<br/>bucket?} L5a -- empty --> Q[Enqueue lane high/normal/low] Q --> L5a L5a -- ok --> L5b{L5b cost<br/>cap?} L5b -- over --> Q L5b -- under --> L6[L6 forward<br/>via Kong] L6 --> L7[L7 fill caches<br/>L1 + L2 in pipeline] L7 --> L8[L8 emit Phoenix span<br/>cache.hit attributes] L8 --> Broadcast[Broadcast to followers] Broadcast --> ReturnL[Return to client] Skip --> Agent[Agent direct] Agent --> ReturnLCompletion checklist (every line shipped or explicitly deferred)
Stage by stage
normalize.py.test_exact_cache.py.sentence-transformers/all-MiniLM-L6-v2(384-dim, ~80MB, CPU-fine). Per-agent Redis HNSW index. Top-1 cosine, threshold from policy. Lazy index creation on first write.REQUEST_LAYER_ROUTER_CACHE_ENABLED) — cross-agent index. Tighter threshold (0.97). Manifest-hash invalidation on AgentCard set change.SET ... NX EX 30for leader election + Redis pubsub for broadcast. TTL-based self-healing on leader crash.INCRBYFLOATon minute buckets with 70s TTL. ReadsX-Input-Tokens/X-Output-Tokensheaders when agents emit them; falls back to chars-per-token estimate × per-model rate table.high/normal/low).BRPOPLPUSHto:processinglane for crash safety. HeaderX-Request-Priorityoverrides;criticalAgentCard tag forces high.httpx.AsyncClientwith connection-pool limits. Hop-by-hop headers stripped. 5xx responses NOT cached.app.utils.observability.tracing_utils.bootstrap_tracing(no parallel OTel setup). Span names:cache.hit,coalesce.follower,queue.entry,queue.exit. Attributes:cache.layer,cache.similarity,cache.matched_query,cache.age_seconds,cache.savings_usd,cache.savings_ms,cache.router_skipped.Operational surface
GET /health— liveness + reports model_loaded, redis_connected, agents count, router_cache flagGET /admin/stats— aggregate counters per layerGET /admin/stream— Server-Sent Events feed of cache decisions, with 15s heartbeatGET /admin/policies— list policies for every registered agentGET /admin/policies/{agent}— single agent policyPATCH /admin/policies/{agent}— operator override (persists to Redis hash)POST /admin/cache/clear— flush layer (and optionally one agent)GET /admin/queue/{agent}— depth + predicted wait timeGET /admin/recommendations— read-only self-tuning suggestionsAdaptability / pluggability
agentcard.NasikoAdapterexposes a narrow interface (list_agents()+policy_for(manifest)) so MCPserver.json, A2A AgentCard, and OpenAPI adapters drop in without touching the pipeline.src/config.py). 23 fields covering ports, Redis URL, embedding model, thresholds, RPS, cost cap, Phoenix endpoint, etc.PATCH /admin/policies/{agent}. Persists to Redis (reqlayer:policy:overrideshash); survives restarts.Edge cases handled (and tested where possible)
normalize()falls back to text canonicalization; verified intest_normalize.py::test_normalize_invalid_json_falls_back_to_text.normalize()acceptsbytes | str | dict | None; verified.request-layer-redisdown —/healthreportsdegraded; the layer keeps serving but cache layers all miss.transfer-encoding,content-length, etc.) are stripped before cache reuse.Tests
tests/conftest.py— async Redis fake covering GET/SET/DELETE/INCRBYFLOAT/EXPIRE/HGET/HSET/HDEL/LPUSH/RPOPLPUSH/LREM/LLEN/SCAN_ITER/PUBSUBtests/test_normalize.py— 11 tests: idempotence, JSON-key sort, default stripping, bytes decoding, invalid JSON fallback, etc.tests/test_exact_cache.py— 6 tests: round-trip, miss, agent isolation, single-agent clear, deterministic hashtests/test_coalesce.py— 4 tests: leader/follower transitions, lock release, broadcast deliverytests/test_agentcard.py— 8 tests: A2A AgentCard parsing, list/dict capabilities, name/url validation, policy inference for translation/realtime/default/expensive bucketstests/test_ratelimit.py— 4 tests: header-driven cost, char-fallback, unknown-model default rate, none-model handlingNote: CI today runs only
black --check .andmypy . --ignore-missing-imports || true. Tests are not gated. New code is black-formatted (target-version = ['py312']).Documentation
agent-gateway/request_layer/README.md— service overview, architecture mermaid, pipeline mermaid, configuration, why a dedicated Redis, why opt-indocs/request-layer.md— operator runbook with bring-up, Kong route flip recipe, optional L3 enablement, observability surfaces, policy inference table, failure modes, roadmapREADME.md— one-paragraph reference under the existing architecture descriptionConvention alignment with the rest of the repo
agent-gateway/request_layer/— sibling to existingagent-gateway/router/andagent-gateway/registry/nasiko-request-layer— matchesnasiko-router,nasiko-backend,nasiko-auth-serviceDockerfile.request-layer) — matchesDockerfile.workerfor services that import fromapp/utils/observability/bootstrap_tracing— does not duplicate observability setuppydantic_settings.BaseSettingswithmodel_config = SettingsConfigDict(env_file=...)— matchesapp/pkg/config/config.pyandagent-gateway/router/src/config/settings.py@asynccontextmanager— matchesapp/main.pyX | Nonestyle (PEP 604) — matchesagent-gateway/router/src/entities/router_entities.pyagent-gateway/router/src/core/agent_client.pyfrom __future__ import annotations— matches the rest of the repoLocal verification — all green
The full stack starts with the existing one-line quickstart:
Verified locally on macOS (Apple Silicon, Docker Desktop):
nasiko887-nasiko-request-layerwith sentence-transformers + torch)request-layer-redisreacheshealthystatusnasiko-request-layerreacheshealthystatusagent-gateway/request_layer/parses cleanlyimportlib.import_module)registry refreshed: 8 agents— adapter pulls live agent list from Kong Admin API on startup and every 60sGET /health{"status":"healthy","model_loaded":true,"redis_connected":true,"adapter":"nasiko","agents":8,"router_cache_enabled":false}GET /admin/statsGET /admin/policiesGET /admin/recommendationsGET /admin/queue/{agent}GET /admin/streamBackwards compatibility
The most important property of the PR.
nasiko-request-layerandrequest-layer-redisstopped,docker compose psis identical to today.directly. No agent's traffic flows through the layer unless an
operator explicitly flips its Kong upstream to
http://nasiko-request-layer:8090.:8081is unchanged. The L3 router-decisioncache is off by default (
REQUEST_LAYER_ROUTER_CACHE_ENABLED=false);when enabled, the layer's
/router/routeshort-circuit returns 404 +X-Cache-Fallthrough: routeron a miss so the existing routercontinues to handle uncached intents.
Local quick-start (post-merge usage)
Roadmap (intentional follow-up PRs, not blockers)
server.json, A2A AgentCard, andOpenAPI tag inference. The interface is already in
agentcard.py./admin/stream.are static + operator-overridable).
Status: feature is 100% complete for the brief
Every Buildthon success criterion is delivered. Every component listed
in the spec is shipped (cache, per-agent rate limit with queueing,
operational endpoints). Every edge case in the checklist is handled.
Tests cover the critical paths. Local verification passes.
Items in the roadmap section are deliberate enhancements above the
brief — left to follow-up PRs to keep this one reviewable.
— @ashutosh887