fix: bound resolve spawn with timeout to prevent indefinite hangs (Fixes #463)#471
fix: bound resolve spawn with timeout to prevent indefinite hangs (Fixes #463)#471eleanorjboyd wants to merge 1 commit into
Conversation
#463) Replace the unbounded `Command::output()` call in `get_interpreter_details` with a `spawn()` + bounded `try_wait()` polling loop. On timeout the child is killed and `None` is returned so the resolve path completes promptly instead of stalling the foreground env-selection flow. Stale cached Python paths on Windows (Store stubs, vanished network shares, EDR-stalled `CreateProcess`) could otherwise block `resolve` for tens to hundreds of seconds, tripping the extension's hang watchdog. Telemetry showed p99 of 122s on Windows and `envSelection` hangs dominating across all three platforms. Adds a Unix regression test that proves a hanging fake executable is killed near the configured timeout rather than blocking the full sleep duration.
Performance Report (Linux) ➖
Legend
|
Test Coverage Report (Linux)
Coverage increased! Great work! |
Performance Report (macOS)
Legend
|
Test Coverage Report (Windows)
Coverage decreased. Please add tests for new code. |
Performance Report (Windows) ➖
Legend
|
|
@karthiknadig could you take over here or open a new PR the way you like it for this fix? Lmk if the attached issue doesn't make sense or need clarity |
| } | ||
| thread::sleep(Duration::from_millis(25)); | ||
| } | ||
| Err(err) => { |
There was a problem hiding this comment.
Copilot generated:
[unverified] wait_with_output() reads stdout/stderr to EOF. If the spawned process is a wrapper (conda shim, pyenv shim) that forks a grandchild inheriting the piped stdout fd, EOF won't arrive until the grandchild closes the fd — re-introducing unbounded blocking after the poll loop. Consider taking child.stdout/child.stderr before the loop and draining with a bounded read, or using child.stdout.take() + manual read_to_end with a second deadline after the loop.
[verified]
| Ok(child) => child, | ||
| Err(err) => { | ||
| error!( | ||
| "Failed to spawn Python to resolve info {:?}: {}", |
There was a problem hiding this comment.
Copilot generated:
The exit status from try_wait() is discarded (Ok(Some(_status)) => break). Consider checking status.success() here and returning None early with a diagnostic log (including stderr) for non-zero exits, rather than falling through to JSON parsing on potentially empty/garbled stdout.
[verified]
Fixes #463.
Problem
pet.resolvep90 > 200s (p95 > 350s, p99 = 122s on Windows) for thecache_stalecohort tracked by the Python Environments extension. The extension'senvSelectionforeground path waits on thisresolve, so when it hangs the user sees a 30s+ hang detected by the watchdog.Root cause:
get_interpreter_detailsincrates/pet-python-utils/src/env.rscallsstd::process::Command::output()on the executable. That is a synchronous, unbounded wait — there is no timeout, no kill. On Windows this is the textbook way to stall on a stale cached path:~/AppData/Local/Microsoft/WindowsApps/python.exefor an uninstalled Store install — invoking the App Execution Alias stub can open the Store UI and block until dismissed.CreateProcessblocks on SMB.CreateProcessfor tens of seconds while it scans the exe.The per-exe cache
Mutexis also held across this spawn, so a single hung resolve blocks every other resolve of the same path.Fix
Replace
Command::output()withspawn()+ boundedtry_wait()polling. On timeout (15 s default) wekill()the child,wait()to reap it, and returnNone. Behavior on success is unchanged.15 s is well above the worst observed legitimate spawn time (low single-digit seconds even on cold cache + Defender) and well below the extension's hang-detection threshold, so legitimate slow spawns still succeed while pathological hangs are bounded.
Why not a separate worker thread + channel?
try_wait()polling avoids spawning an extra thread per resolve and avoids ownership gymnastics around theChild. The 25 ms poll interval adds at most a single poll of latency to a fast spawn (a typical spawn returns in 50–200 ms), which is negligible vs. the JSONRPC round-trip.Why not drop the cache
Mutexduring the spawn?The mutex serializes concurrent resolves of the same exe so they share one spawn rather than racing. With the spawn now bounded, holding the lock for ≤15 s is acceptable; dropping it would be a larger behavior change worth doing separately.
Test
Added a Unix regression test that writes a shell script which
sleeps for 60 s, calls the bounded resolver with a 200 ms timeout, and asserts:None,Test is gated to
#[cfg(all(test, unix))]because constructing a hang-on-launch executable on Windows requires a different fixture; the production code path is exercised on all platforms.Verification
cargo fmt --all— cleancargo clippy --all -- -D warnings— cleancargo test -p pet-python-utils— passes including the new regression test./scripts/rust-precommit.sh—=== ALL PRE-COMMIT CHECKS PASSED ===Expected impact
Based on the extension team's June 1 report, this should close most of the remaining ~1.57 pp bad-experience gap on insiders vs. v1.30 stable:
envSelectionhangs (dominant stage across win32 / linux / darwin,globalScopeDeferred = deferred) all flow through this code path via the extension's cache-stale resolve.