Skip to content

fix: bound resolve spawn with timeout to prevent indefinite hangs (Fixes #463)#471

Open
eleanorjboyd wants to merge 1 commit into
mainfrom
fix/463-bound-resolve-spawn
Open

fix: bound resolve spawn with timeout to prevent indefinite hangs (Fixes #463)#471
eleanorjboyd wants to merge 1 commit into
mainfrom
fix/463-bound-resolve-spawn

Conversation

@eleanorjboyd
Copy link
Copy Markdown
Member

Fixes #463.

Problem

pet.resolve p90 > 200s (p95 > 350s, p99 = 122s on Windows) for the cache_stale cohort tracked by the Python Environments extension. The extension's envSelection foreground path waits on this resolve, so when it hangs the user sees a 30s+ hang detected by the watchdog.

Root cause: get_interpreter_details in crates/pet-python-utils/src/env.rs calls std::process::Command::output() on the executable. That is a synchronous, unbounded wait — there is no timeout, no kill. On Windows this is the textbook way to stall on a stale cached path:

  • A cached path pointing at ~/AppData/Local/Microsoft/WindowsApps/python.exe for an uninstalled Store install — invoking the App Execution Alias stub can open the Store UI and block until dismissed.
  • A cached path on a network share whose mount has gone away — CreateProcess blocks on SMB.
  • An EDR / antivirus product stalling CreateProcess for tens of seconds while it scans the exe.

The per-exe cache Mutex is also held across this spawn, so a single hung resolve blocks every other resolve of the same path.

Fix

Replace Command::output() with spawn() + bounded try_wait() polling. On timeout (15 s default) we kill() the child, wait() to reap it, and return None. Behavior on success is unchanged.

let mut child = new_silent_command(executable)
    .args(["-c", PYTHON_INFO_CMD])
    .stdin(Stdio::null())
    .stdout(Stdio::piped())
    .stderr(Stdio::piped())
    .spawn()?;

let deadline = Instant::now() + RESOLVE_SPAWN_TIMEOUT;
loop {
    match child.try_wait() {
        Ok(Some(_)) => break,
        Ok(None) if Instant::now() >= deadline => {
            let _ = child.kill();
            let _ = child.wait();
            return None;
        }
        Ok(None) => thread::sleep(Duration::from_millis(25)),
        Err(_) => { let _ = child.kill(); let _ = child.wait(); return None; }
    }
}

15 s is well above the worst observed legitimate spawn time (low single-digit seconds even on cold cache + Defender) and well below the extension's hang-detection threshold, so legitimate slow spawns still succeed while pathological hangs are bounded.

Why not a separate worker thread + channel?

try_wait() polling avoids spawning an extra thread per resolve and avoids ownership gymnastics around the Child. The 25 ms poll interval adds at most a single poll of latency to a fast spawn (a typical spawn returns in 50–200 ms), which is negligible vs. the JSONRPC round-trip.

Why not drop the cache Mutex during the spawn?

The mutex serializes concurrent resolves of the same exe so they share one spawn rather than racing. With the spawn now bounded, holding the lock for ≤15 s is acceptable; dropping it would be a larger behavior change worth doing separately.

Test

Added a Unix regression test that writes a shell script which sleeps for 60 s, calls the bounded resolver with a 200 ms timeout, and asserts:

  • the call returns None,
  • elapsed time is well under the script's sleep duration (the child was actually killed).
running 1 test
test env::tests::get_interpreter_details_times_out_on_hanging_executable ... ok

Test is gated to #[cfg(all(test, unix))] because constructing a hang-on-launch executable on Windows requires a different fixture; the production code path is exercised on all platforms.

Verification

  • cargo fmt --all — clean
  • cargo clippy --all -- -D warnings — clean
  • cargo test -p pet-python-utils — passes including the new regression test
  • ./scripts/rust-precommit.sh=== ALL PRE-COMMIT CHECKS PASSED ===

Expected impact

Based on the extension team's June 1 report, this should close most of the remaining ~1.57 pp bad-experience gap on insiders vs. v1.30 stable: envSelection hangs (dominant stage across win32 / linux / darwin, globalScopeDeferred = deferred) all flow through this code path via the extension's cache-stale resolve.

 #463)

Replace the unbounded `Command::output()` call in
`get_interpreter_details` with a `spawn()` + bounded `try_wait()`
polling loop. On timeout the child is killed and `None` is returned so
the resolve path completes promptly instead of stalling the foreground
env-selection flow.

Stale cached Python paths on Windows (Store stubs, vanished network
shares, EDR-stalled `CreateProcess`) could otherwise block `resolve`
for tens to hundreds of seconds, tripping the extension's hang
watchdog. Telemetry showed p99 of 122s on Windows and `envSelection`
hangs dominating across all three platforms.

Adds a Unix regression test that proves a hanging fake executable is
killed near the configured timeout rather than blocking the full
sleep duration.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Performance Report (Linux) ➖

Metric PR (P50) PR (P95) Baseline (P50) Delta Change
Server Startup 1ms 1ms 1ms 0ms 0%
Full Refresh 66ms 678ms 56ms 10ms 10.0%

Results based on 10 iterations. P50 = median, P95 = 95th percentile.


Legend
  • 🚀 Significant speedup (>100ms faster)
  • ✅ Faster than baseline
  • ➖ No significant change
  • 🔺 Slower than baseline (>100ms)
  • ⚠️ Significant slowdown (>500ms)

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Test Coverage Report (Linux)

Metric Value
Current Coverage 79.9%
Base Branch Coverage 79.8%
Delta .1% ✅

Coverage increased! Great work!

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Performance Report (macOS)

Metric PR (P50) PR (P95) Baseline (P50) Delta
Server Startup 105ms 799ms 51ms 54ms
Full Refresh 218ms 30324ms 96ms 122ms

Results based on 10 iterations. P50 = median, P95 = 95th percentile.


Legend
  • 🚀 Significant speedup (>100ms faster)
  • ✅ Faster than baseline
  • ➖ No significant change
  • 🔺 Slower than baseline (>100ms)
  • ⚠️ Significant slowdown (>500ms)

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Test Coverage Report (Windows)

Metric Value
Current Coverage 76.91%
Base Branch Coverage 76.94%
Delta -0.03% ❌

Coverage decreased. Please add tests for new code.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Performance Report (Windows) ➖

Metric PR (P50) PR (P95) Baseline (P50) Delta Change
Server Startup 10ms 13ms 8ms 2ms 25%
Full Refresh 128ms 1228ms 118ms 10ms 8.5%

Results based on 10 iterations. P50 = median, P95 = 95th percentile.


Legend
  • 🚀 Significant speedup (>100ms faster)
  • ✅ Faster than baseline
  • ➖ No significant change
  • 🔺 Slower than baseline (>100ms)
  • ⚠️ Significant slowdown (>500ms)

@eleanorjboyd
Copy link
Copy Markdown
Member Author

@karthiknadig could you take over here or open a new PR the way you like it for this fix? Lmk if the attached issue doesn't make sense or need clarity

}
thread::sleep(Duration::from_millis(25));
}
Err(err) => {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot generated:
[unverified] wait_with_output() reads stdout/stderr to EOF. If the spawned process is a wrapper (conda shim, pyenv shim) that forks a grandchild inheriting the piped stdout fd, EOF won't arrive until the grandchild closes the fd — re-introducing unbounded blocking after the poll loop. Consider taking child.stdout/child.stderr before the loop and draining with a bounded read, or using child.stdout.take() + manual read_to_end with a second deadline after the loop.

[verified]

Ok(child) => child,
Err(err) => {
error!(
"Failed to spawn Python to resolve info {:?}: {}",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot generated:
The exit status from try_wait() is discarded (Ok(Some(_status)) => break). Consider checking status.success() here and returning None early with a diagnostic log (including stderr) for non-zero exits, rather than falling through to JSON parsing on potentially empty/garbled stdout.

[verified]

Copy link
Copy Markdown

@rchiodo rchiodo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved via Review Center.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

resolve hangs for tens to hundreds of seconds on Windows for invalidated cached paths

3 participants