Skip to content

[ports] Unpack ports atomically to improve concurrency robustness#27161

Merged
sbc100 merged 1 commit into
emscripten-core:mainfrom
sbc100:fix-ports-concurrency
Jun 23, 2026
Merged

[ports] Unpack ports atomically to improve concurrency robustness#27161
sbc100 merged 1 commit into
emscripten-core:mainfrom
sbc100:fix-ports-concurrency

Conversation

@sbc100

@sbc100 sbc100 commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

When building ports in parallel, different processes can race on checking
the port's up-to-date status. Currently, we unpack directly into the target
directory and then write the marker file. This creates a window of time
where the directory exists but is incomplete and lacks the marker.

If another process checks the port status during this window, it will
conclude the port is invalid/incomplete, and aggressively delete the
directory and its cached builds to re-unpack it. This can disrupt other
processes currently compiling from that directory or lead to
EM_CACHE_IS_LOCKED errors when child processes try to rebuild deleted
libraries.

By unpacking to a temporary directory and atomically renaming it to the
final destination, we ensure that the target directory either does not
exist at all or is 100% complete with the marker. This eliminates the
partially unpacked state and leverages the atomicity of directory
rename (which is highly coherent in OS VFS layers) to reduce the race
window to zero.

This should help resolve intermittent bot failures on macOS and Windows,
where VFS metadata latency can cause processes to see inconsistent directory
states.

Hopefully this will address #27160, although time will tell.

When building ports in parallel, different processes can race on checking
the port's up-to-date status. Currently, we unpack directly into the target
directory and then write the marker file. This creates a window of time
where the directory exists but is incomplete and lacks the marker.

If another process checks the port status during this window, it will
conclude the port is invalid/incomplete, and aggressively delete the
directory and its cached builds to re-unpack it. This can disrupt other
processes currently compiling from that directory or lead to
EM_CACHE_IS_LOCKED errors when child processes try to rebuild deleted
libraries.

By unpacking to a temporary directory and atomically renaming it to the
final destination, we ensure that the target directory either does not
exist at all or is 100% complete with the marker. This eliminates the
partially unpacked state and leverages the atomicity of directory
rename (which is highly coherent in OS VFS layers) to reduce the race
window to zero.

This should help resolve intermittent bot failures on macOS and Windows,
where VFS metadata latency can cause processes to see inconsistent directory
states.
@sbc100 sbc100 requested a review from dschuff June 23, 2026 01:09
@sbc100 sbc100 changed the title ports: Unpack ports atomically to improve concurrency robustness [ports] Unpack ports atomically to improve concurrency robustness Jun 23, 2026
@sbc100 sbc100 changed the title [ports] Unpack ports atomically to improve concurrency robustness ports: Unpack ports atomically to improve concurrency robustness Jun 23, 2026
@sbc100 sbc100 changed the title ports: Unpack ports atomically to improve concurrency robustness [ports] Unpack ports atomically to improve concurrency robustness Jun 23, 2026
Comment thread tools/ports/__init__.py
utils.safe_ensure_dirs(fullname)
shutil.unpack_archive(filename=fullpath, extract_dir=fullname)
utils.write_file(marker, url + '\n')
unpack_dir = fullname + '.tmp'

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like it could still happen that a second process could see that the port directory doesn't exist, enter unpack() and and then they'd just both be clobbering each other in the temp directory?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unpack operation is all within the with cache.lock('unpack port'): block

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW its not clear to me that this is going to completely solve this issue, its more of and experiment, and it seems completely harmless and a good idea anyway.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what's happening now is that the process that loses the race is deleting the directory before it takes the lock?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it seems like what is happening is that the filesystem lock can be released, but there is still come risk the other process might not see the stamp file that we created while holding the lock.

It seems like this is especially true on windows but we have seen it on all platforms. Seems kind of scary but that is the current analysis that gemini has come up with. i.e. secondary process gets the lock but only seem a partial stamp file or no stamp file.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See point (4) in the analysis in the bug report:

The Race: If up_to_date('sdl2') in Process B returns False (which can happen due to filesystem latency in reflecting the marker file written
by Process A, especially on macOS/Windows), Process B thinks the port is invalid.

In particular the AI seems to think there is such as things as filesystem latency where one process could acquire the lock but then not observe all the effects of the previous process that was holding it. This seems like it should be impossible to me. I would hope that filesystem operations would be sequentially consistent, especially when process are using filelock.py to communicate.

But I think this PR is worth landing to see if it makes any difference.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. yeah it makes sense that this change would fix that problem, and seems straightforward enough to try. (Maybe we shouldn't use "Fixes: 27160" until we know that the issue is actually fixed)

@sbc100 sbc100 merged commit 87a25dc into emscripten-core:main Jun 23, 2026
39 checks passed
@sbc100 sbc100 deleted the fix-ports-concurrency branch June 23, 2026 19:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants