Name and publisher normalization for cross reference#724
Conversation
| # Package family names | ||
| \b[-.A-Za-z0-9]+_[a-z0-9]{13}\b | ||
| # Locales for name normalization | ||
| \b\p{Lu}{2,3}(-(CANS|CYRL|LATN|MONG))?-\p{Lu}{2}(?![A-Z])(?:-VALENCIA)?\b |
There was a problem hiding this comment.
I'd generally suggest doing:
| \b\p{Lu}{2,3}(-(CANS|CYRL|LATN|MONG))?-\p{Lu}{2}(?![A-Z])(?:-VALENCIA)?\b | |
| \b\p{Lu}{2,3}(?:-(?:CANS|CYRL|LATN|MONG))?-\p{Lu}{2}(?![A-Z])(?:-VALENCIA)?\b |
Basically, if you aren't capturing you should tell perl you aren't.
Otherwise, if you have a rule like:
\b([A-Za-z])\1{3,}\b
(which is earlier in this file), you may be unhappy.
There was a problem hiding this comment.
The entire file is run as one big regex? I would assume that the capture reference would be per-line. I don't mind changing it, I just didn't expect that the captures really mattered for this use case.
There was a problem hiding this comment.
Yeah, it's run as one big regex.
It's a lazy optimization that only trips things up occasionally.
I need to look into whether it's worth treating each line as a unique regex / how to make that work.
…e the result file
| L"QUZ-BO", L"QUZ-EC", L"QUZ-PE", L"RM-CH", L"RO-RO", L"RU-RU", L"RW-RW", L"SAH-RU", L"SA-IN", L"SE-FI", L"SE-NO", | ||
| L"SE-SE", L"SI-LK", L"SK-SK", L"SL-SI", L"SMA-NO", L"SMA-SE", L"SMJ-NO", L"SMJ-SE", L"SMN-FI", L"SMS-FI", L"SQ-AL", | ||
| L"SV-FI", L"SV-SE", L"SW-KE", L"SYR-SY", L"TA-IN", L"TE-IN", L"TH-TH", L"TK-TM", L"TN-ZA", L"TR-TR", L"TT-RU", | ||
| L"UG-CN", L"UK-UA", L"UR-PK", L"VI-VN", L"WO-SN", L"XH-ZA", L"YO-NG", L"ZH-CN", L"ZH-HK", L"ZH-MO", L"ZH-SG", |
There was a problem hiding this comment.
L"ZH-CN" [](start = 86, length = 8)
The comment said the values must be in sorted order but looks like no?
There was a problem hiding this comment.
They also need to be folded, which is a harder thing to expect people to do. So I fold and sort them into another container in the constructor. I can update the comment to better reflect this.
| { | ||
| std::wstring result = Utility::Normalize(ConvertToUTF16(value)); | ||
| Trim(result); | ||
| size_t atPos = result.find(L"@@", 3); |
There was a problem hiding this comment.
@@ [](start = 45, length = 2)
what's special with this?
There was a problem hiding this comment.
I have no idea; the starting point for this code was something doing similar work, and I left the majority of it alone.
Misspellings found, please review:
To accept these changes, run the following commands from this repository on this branch |
Change
This change adds code for creating normalized strings from package names and publishers. It is not actually used anywhere yet, but it is intended for use in correlating locally installed packages with remote ones, as well correlating one version to the next. I also plan to use the values to create "nicer" Ids for locally installed packages that are not found in a remote source.
In order to support all of Unicode, the code does not use STL's regex, but instead an ICU regex wrapper is implemented to enable global support. This wrapper's interface is purpose built for this use case, so it may need to be adapted as needed for other uses.
The expressions and their use were adapted from code that is used to do similar work for app compatibility on Windows. The goal is to remove versions and extraneous strings from names and publishers to create values unique to the package. The code also attempts to capture the architecture and locale if it is present in the name.
A large set of real world examples can be found in the test data files. These contain real name
InputNames.txtand publisherInputPublishers.txtstrings from packages found in winget-pkgs, as well as a few others added to smoke test proper functionality outside of English and other European character sets. The result fileNormalizationInitialIds.txtcontains the strings that result from normalizing both name and publisher and appending them asPublisher + . + Name.Microsoft Reviewers: Open in CodeFlow