fix: bound resolve spawn with timeout to prevent indefinite hangs (Fixes #463) by eleanorjboyd · Pull Request #471 · microsoft/python-environment-tools

eleanorjboyd · 2026-06-01T15:34:24Z

Fixes #463.

Problem

pet.resolve p90 > 200s (p95 > 350s, p99 = 122s on Windows) for the cache_stale cohort tracked by the Python Environments extension. The extension's envSelection foreground path waits on this resolve, so when it hangs the user sees a 30s+ hang detected by the watchdog.

Root cause: get_interpreter_details in crates/pet-python-utils/src/env.rs calls std::process::Command::output() on the executable. That is a synchronous, unbounded wait — there is no timeout, no kill. On Windows this is the textbook way to stall on a stale cached path:

A cached path pointing at ~/AppData/Local/Microsoft/WindowsApps/python.exe for an uninstalled Store install — invoking the App Execution Alias stub can open the Store UI and block until dismissed.
A cached path on a network share whose mount has gone away — CreateProcess blocks on SMB.
An EDR / antivirus product stalling CreateProcess for tens of seconds while it scans the exe.

The per-exe cache Mutex is also held across this spawn, so a single hung resolve blocks every other resolve of the same path.

Fix

Replace Command::output() with spawn() + bounded try_wait() polling. On timeout (15 s default) we kill() the child, wait() to reap it, and return None. Behavior on success is unchanged.

let mut child = new_silent_command(executable)
    .args(["-c", PYTHON_INFO_CMD])
    .stdin(Stdio::null())
    .stdout(Stdio::piped())
    .stderr(Stdio::piped())
    .spawn()?;

let deadline = Instant::now() + RESOLVE_SPAWN_TIMEOUT;
loop {
    match child.try_wait() {
        Ok(Some(_)) => break,
        Ok(None) if Instant::now() >= deadline => {
            let _ = child.kill();
            let _ = child.wait();
            return None;
        }
        Ok(None) => thread::sleep(Duration::from_millis(25)),
        Err(_) => { let _ = child.kill(); let _ = child.wait(); return None; }
    }
}

15 s is well above the worst observed legitimate spawn time (low single-digit seconds even on cold cache + Defender) and well below the extension's hang-detection threshold, so legitimate slow spawns still succeed while pathological hangs are bounded.

Why not a separate worker thread + channel?

try_wait() polling avoids spawning an extra thread per resolve and avoids ownership gymnastics around the Child. The 25 ms poll interval adds at most a single poll of latency to a fast spawn (a typical spawn returns in 50–200 ms), which is negligible vs. the JSONRPC round-trip.

Why not drop the cache `Mutex` during the spawn?

The mutex serializes concurrent resolves of the same exe so they share one spawn rather than racing. With the spawn now bounded, holding the lock for ≤15 s is acceptable; dropping it would be a larger behavior change worth doing separately.

Test

Added a Unix regression test that writes a shell script which sleeps for 60 s, calls the bounded resolver with a 200 ms timeout, and asserts:

the call returns None,
elapsed time is well under the script's sleep duration (the child was actually killed).

running 1 test
test env::tests::get_interpreter_details_times_out_on_hanging_executable ... ok

Test is gated to #[cfg(all(test, unix))] because constructing a hang-on-launch executable on Windows requires a different fixture; the production code path is exercised on all platforms.

Verification

cargo fmt --all — clean
cargo clippy --all -- -D warnings — clean
cargo test -p pet-python-utils — passes including the new regression test
./scripts/rust-precommit.sh — === ALL PRE-COMMIT CHECKS PASSED ===

Expected impact

Based on the extension team's June 1 report, this should close most of the remaining ~1.57 pp bad-experience gap on insiders vs. v1.30 stable: envSelection hangs (dominant stage across win32 / linux / darwin, globalScopeDeferred = deferred) all flow through this code path via the extension's cache-stale resolve.

#463) Replace the unbounded `Command::output()` call in `get_interpreter_details` with a `spawn()` + bounded `try_wait()` polling loop. On timeout the child is killed and `None` is returned so the resolve path completes promptly instead of stalling the foreground env-selection flow. Stale cached Python paths on Windows (Store stubs, vanished network shares, EDR-stalled `CreateProcess`) could otherwise block `resolve` for tens to hundreds of seconds, tripping the extension's hang watchdog. Telemetry showed p99 of 122s on Windows and `envSelection` hangs dominating across all three platforms. Adds a Unix regression test that proves a hanging fake executable is killed near the configured timeout rather than blocking the full sleep duration.

github-actions · 2026-06-01T15:35:40Z

Performance Report (Linux) ➖

Metric	PR (P50)	PR (P95)	Baseline (P50)	Delta	Change
Server Startup	1ms	1ms	1ms	0ms	0%
Full Refresh	66ms	678ms	56ms	10ms	10.0%

Results based on 10 iterations. P50 = median, P95 = 95th percentile.

Legend

🚀 Significant speedup (>100ms faster)
✅ Faster than baseline
➖ No significant change
🔺 Slower than baseline (>100ms)
⚠️ Significant slowdown (>500ms)

github-actions · 2026-06-01T15:35:45Z

Test Coverage Report (Linux)

Metric	Value
Current Coverage	79.9%
Base Branch Coverage	79.8%
Delta	.1% ✅

Coverage increased! Great work!

github-actions · 2026-06-01T15:35:46Z

Performance Report (macOS)

Metric	PR (P50)	PR (P95)	Baseline (P50)	Delta
Server Startup	105ms	799ms	51ms	54ms
Full Refresh	218ms	30324ms	96ms	122ms

Results based on 10 iterations. P50 = median, P95 = 95th percentile.

Legend

🚀 Significant speedup (>100ms faster)
✅ Faster than baseline
➖ No significant change
🔺 Slower than baseline (>100ms)
⚠️ Significant slowdown (>500ms)

github-actions · 2026-06-01T15:35:51Z

Test Coverage Report (Windows)

Metric	Value
Current Coverage	76.91%
Base Branch Coverage	76.94%
Delta	-0.03% ❌

Coverage decreased. Please add tests for new code.

github-actions · 2026-06-01T15:35:55Z

Performance Report (Windows) ➖

Metric	PR (P50)	PR (P95)	Baseline (P50)	Delta	Change
Server Startup	10ms	13ms	8ms	2ms	25%
Full Refresh	128ms	1228ms	118ms	10ms	8.5%

Results based on 10 iterations. P50 = median, P95 = 95th percentile.

Legend

🚀 Significant speedup (>100ms faster)
✅ Faster than baseline
➖ No significant change
🔺 Slower than baseline (>100ms)
⚠️ Significant slowdown (>500ms)

eleanorjboyd · 2026-06-01T15:46:28Z

@karthiknadig could you take over here or open a new PR the way you like it for this fix? Lmk if the attached issue doesn't make sense or need clarity

rchiodo · 2026-06-01T16:58:41Z

+                }
+                thread::sleep(Duration::from_millis(25));
+            }
+            Err(err) => {


Copilot generated:
[unverified] wait_with_output() reads stdout/stderr to EOF. If the spawned process is a wrapper (conda shim, pyenv shim) that forks a grandchild inheriting the piped stdout fd, EOF won't arrive until the grandchild closes the fd — re-introducing unbounded blocking after the poll loop. Consider taking child.stdout/child.stderr before the loop and draining with a bounded read, or using child.stdout.take() + manual read_to_end with a second deadline after the loop.

[verified]

rchiodo · 2026-06-01T16:58:58Z

+        Ok(child) => child,
+        Err(err) => {
+            error!(
+                "Failed to spawn Python to resolve info {:?}: {}",


Copilot generated:
The exit status from try_wait() is discarded (Ok(Some(_status)) => break). Consider checking status.success() here and returning None early with a diagnostic log (including stderr) for non-zero exits, rather than falling through to JSON parsing on potentially empty/garbled stdout.

[verified]

rchiodo

Approved via Review Center.

eleanorjboyd assigned karthiknadig Jun 1, 2026

rchiodo reviewed Jun 1, 2026

View reviewed changes

rchiodo approved these changes Jun 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: bound resolve spawn with timeout to prevent indefinite hangs (Fixes #463)#471

fix: bound resolve spawn with timeout to prevent indefinite hangs (Fixes #463)#471
eleanorjboyd wants to merge 1 commit into
mainfrom
fix/463-bound-resolve-spawn

eleanorjboyd commented Jun 1, 2026

Uh oh!

github-actions Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

eleanorjboyd commented Jun 1, 2026

Uh oh!

rchiodo Jun 1, 2026

Uh oh!

rchiodo Jun 1, 2026

Uh oh!

rchiodo left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

eleanorjboyd commented Jun 1, 2026

Problem

Fix

Why not a separate worker thread + channel?

Why not drop the cache Mutex during the spawn?

Test

Verification

Expected impact

Uh oh!

github-actions Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance Report (Linux) ➖

Uh oh!

github-actions Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Coverage Report (Linux)

Uh oh!

github-actions Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance Report (macOS)

Uh oh!

github-actions Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Coverage Report (Windows)

Uh oh!

github-actions Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance Report (Windows) ➖

Uh oh!

eleanorjboyd commented Jun 1, 2026

Uh oh!

rchiodo Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

rchiodo Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

rchiodo left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Why not drop the cache `Mutex` during the spawn?

github-actions Bot commented Jun 1, 2026 •

edited

Loading

github-actions Bot commented Jun 1, 2026 •

edited

Loading

github-actions Bot commented Jun 1, 2026 •

edited

Loading

github-actions Bot commented Jun 1, 2026 •

edited

Loading

github-actions Bot commented Jun 1, 2026 •

edited

Loading