Skip to content

Retry transient network errors + fix CloudFetch prefetch-rejection leak#424

Merged
msrathore-db merged 2 commits into
mainfrom
fix/cloudfetch-retry-network-errors
Jun 8, 2026
Merged

Retry transient network errors + fix CloudFetch prefetch-rejection leak#424
msrathore-db merged 2 commits into
mainfrom
fix/cloudfetch-retry-network-errors

Conversation

@msrathore-db

@msrathore-db msrathore-db commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Summary

Two complementary fixes for the Thrift + CloudFetch ECONNRESET / "socket hang up" failures users hit while draining large (~100MB) results (issue #260). Combined into one branch per request.

1. Retry transient network errors in HttpRetryPolicy (folds in #398)

invokeWithRetry previously only inspected response.status, so when fetch() itself rejected with a network error the rejection propagated on attempt #1 with no retry. Now it catches operation rejections and classifies them via isRetryableNetworkError — FetchError system type, OS errnos (ECONNRESET/ECONNREFUSED/ETIMEDOUT/EHOSTUNREACH/ENETUNREACH/EPIPE/ENOTFOUND/EAI_AGAIN) and message patterns (socket hang up/premature close/aborted) — retrying within the existing attempt/timeout budget. Thrift (response.buffer()) and CloudFetch (response.arrayBuffer()) body reads were moved inside the retried block so mid-stream "Premature close" is retried too.

Idempotency unchanged: ExecuteStatement and other write-y Thrift methods keep NullRetryPolicy — no double-execution risk.

2. Fix CloudFetch prefetch-rejection leak

CloudFetchResultHandler.fetchNext fans out up to cloudFetchConcurrentDownloads downloads but only awaits the head. If the head rejects and the consumer stops iterating, the remaining in-flight promises had no handler and surfaced as unhandledRejection. Each download is now reflected into a SettledDownload value the instant it's created, so a prefetched download can never become an unhandled rejection. The error is re-thrown only when that task is actually consumed, so observable behavior is unchanged.

These are orthogonal: #1 makes a transient reset recover; #2 makes the terminal-failure path clean (including once the retry budget is exhausted).

Test plan

  • HttpRetryPolicy unit suite: 14/14 passing, incl. real ECONNRESET and "Premature close" cases.
  • New regression test: prefetched downloadTasks settle to { ok: false } (fulfilled, not rejected) when downloads fail.
  • Standalone checks against the built dist:
    • downloads all fail + consumer bails → 0 leaked unhandledRejections, fetchNext throws cleanly.
    • real node-fetch GET against a local server that RSTs its first 2 connections → recovers transparently on attempt Update CHANGELOG #3 (HTTP 200).

Live before/after (real warehouse)

Verified end-to-end against a real staging warehouse on AWS us-west-2, draining the original failing query (catalog_sales, 664,893 rows / ~100MB) over the Thrift backend with CloudFetch on, 3 attempts each. Same machine, same warehouse, same query — only the build differs.

Build thrift + cloudFetch thrift + NO cloudFetch (control)
base main (no fixes) FAIL 0/3FetchError: socket hang up on the S3 result-file GETs, surfacing as leaked unhandledRejections ✅ OK 3/3
this PR (both fixes) PASS 3/3 — all 664,893 rows drained, 0 stray async errors ✅ OK 3/3

The non-CloudFetch control passes on both builds, isolating the failure to the CloudFetch download path. The fix turns a reproducible 0/3 failure into a clean 3/3 pass.

Note: the live run used useLZ4Compression: false (the test box lacks the lz4 native module). The failure and fix are at the socket/fetch layer — upstream of LZ4 decompression — so the LZ4-on production path is fixed by the same code; it just wasn't exercised directly in this run.

This pull request and its description were written by Isaac.

@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

Two complementary fixes for the Thrift + CloudFetch ECONNRESET ("socket hang
up") failures when draining large (~100MB) results.

1. Retry transient network errors in HttpRetryPolicy (supersedes/folds in #398,
   fixes #260). invokeWithRetry now catches operation rejections and classifies
   them via isRetryableNetworkError (ECONNRESET/ECONNREFUSED/ETIMEDOUT/... plus
   "socket hang up"/"premature close" message patterns), retrying within the
   existing attempt/timeout budget. Thrift and CloudFetch body reads were moved
   inside the retried block so mid-stream "Premature close" is retried too.
   ExecuteStatement and other write-y methods keep NullRetryPolicy (no
   double-execution risk).

2. Fix CloudFetch prefetch-rejection leak. fetchNext fans out up to
   cloudFetchConcurrentDownloads downloads but only awaits the head; if the head
   rejects and the consumer stops iterating, the remaining in-flight promises
   surfaced as unhandledRejection. Each download is now reflected into a
   SettledDownload value at creation time, so a prefetched download can never
   become an unhandled rejection. The error is re-thrown only when the task is
   consumed, so observable behavior is unchanged.

Testing:
- HttpRetryPolicy unit suite 14/14 (incl. real ECONNRESET / "Premature close").
- Standalone checks against the built dist: zero leaked rejections when
  downloads fail and the consumer bails; a real node-fetch ECONNRESET against a
  local RST-injecting server recovers transparently (recovers on attempt 3).

Co-authored-by: Isaac
Signed-off-by: Madhavendra Rathore <madhavendra.rathore@databricks.com>
…odejs into fix/cloudfetch-retry-network-errors

Signed-off-by: Madhavendra Rathore <madhavendra.rathore@databricks.com>
@msrathore-db msrathore-db force-pushed the fix/cloudfetch-retry-network-errors branch from 1ec8e27 to 80e446f Compare June 8, 2026 13:10
@msrathore-db msrathore-db added this pull request to the merge queue Jun 8, 2026
Merged via the queue into main with commit 7040a95 Jun 8, 2026
19 checks passed
msrathore-db added a commit that referenced this pull request Jun 8, 2026
…ckoff + operation-status fields

Squashed rebase of the SEA kernel-parity batch onto current main
(replaces the prior multi-commit branch, which had diverged with
unsigned commits and a merge-commit history that conflicted with the
since-merged #424/#426). Net feature set unchanged:

- mTLS (clientCertPem/clientKeyPem) + custom HTTP headers + User-Agent
  entry on the SEA path, with PEM and header-token (CR/LF/NUL)
  validation in SeaAuth.
- Retry/backoff tuning forwarded to the kernel (omitting unset knobs).
- Operation-status fields surfaced through op.status(): numModifiedRows,
  displayMessage, diagnosticInfo, errorDetailsJson (async + sync paths),
  memoized once terminal; Thrift wire synthesis maps the same fields.
- preserveBigNumericPrecision connection option (Thrift + SEA) and
  DATE/large-number parameter type-inference fix.

Co-authored-by: Isaac
Signed-off-by: Madhavendra Rathore <madhavendra.rathore@databricks.com>
pull Bot pushed a commit to rebeccalarner/databricks-sql-nodejs that referenced this pull request Jun 8, 2026
…ckoff + operation-status fields (databricks#420)

* [SEA-NodeJS] Kernel-parity batch: mTLS + custom headers/UA + retry/backoff + operation-status fields

Squashed rebase of the SEA kernel-parity batch onto current main
(replaces the prior multi-commit branch, which had diverged with
unsigned commits and a merge-commit history that conflicted with the
since-merged databricks#424/databricks#426). Net feature set unchanged:

- mTLS (clientCertPem/clientKeyPem) + custom HTTP headers + User-Agent
  entry on the SEA path, with PEM and header-token (CR/LF/NUL)
  validation in SeaAuth.
- Retry/backoff tuning forwarded to the kernel (omitting unset knobs).
- Operation-status fields surfaced through op.status(): numModifiedRows,
  displayMessage, diagnosticInfo, errorDetailsJson (async + sync paths),
  memoized once terminal; Thrift wire synthesis maps the same fields.
- preserveBigNumericPrecision connection option (Thrift + SEA) and
  DATE/large-number parameter type-inference fix.

Co-authored-by: Isaac
Signed-off-by: Madhavendra Rathore <madhavendra.rathore@databricks.com>

* fix(sea): address review findings + bump KERNEL_REV to kernel main; prettier

Review findings (GopalDB / code-review agent), each validated:

1. DBSQLParameter DATE off-by-one in non-UTC timezones — `toISOString()`
   converts to UTC before slicing, so a local-constructed Date shifts a
   day in offset zones. Use local calendar accessors
   (getFullYear/getMonth/getDate). Validated end-to-end on a live SEA
   warehouse under TZ=Australia/Sydney: a `new Date(2024,2,14)` now
   round-trips as 2024-03-14 (was 2024-03-13). Regression test sabotages
   toISOString so it guards in any timezone (incl. the UTC CI runner).

2. SeaOperationBackend.status() (sync path) could block indefinitely —
   it gated Succeeded on `fetchHandlePromise` *existing*, but that is
   assigned synchronously when getFetchHandle() is first called while
   result() may still be pending; status() would then report Succeeded
   early AND await the pending result() via readRichStatusFields(). Gate
   on `blockingStatement` (set only after result() resolves) instead.

3. Async-path progress callback omitted the rich-status fields on the
   terminal Succeeded tick (the sync path includes them) — parity break
   for DML numModifiedRows. Merge the rich fields into the Succeeded
   callback in waitUntilReadyAsync.

Also:
- KERNEL_REV -> 9c2e2378 (kernel main, databricks#144 merged). Binding contract
  (native/sea/index.d.ts) already matched the bumped surface.
- prettier --write (lint fix) + new regression tests for findings 1-3.

Co-authored-by: Isaac
Signed-off-by: Madhavendra Rathore <madhavendra.rathore@databricks.com>

---------

Signed-off-by: Madhavendra Rathore <madhavendra.rathore@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants