Skip to content

Name and publisher normalization for cross reference#724

Merged
JohnMcPMS merged 15 commits into
microsoft:masterfrom
JohnMcPMS:arpnorm
Feb 6, 2021
Merged

Name and publisher normalization for cross reference#724
JohnMcPMS merged 15 commits into
microsoft:masterfrom
JohnMcPMS:arpnorm

Conversation

@JohnMcPMS

@JohnMcPMS JohnMcPMS commented Jan 28, 2021

Copy link
Copy Markdown
Member

Change

This change adds code for creating normalized strings from package names and publishers. It is not actually used anywhere yet, but it is intended for use in correlating locally installed packages with remote ones, as well correlating one version to the next. I also plan to use the values to create "nicer" Ids for locally installed packages that are not found in a remote source.

In order to support all of Unicode, the code does not use STL's regex, but instead an ICU regex wrapper is implemented to enable global support. This wrapper's interface is purpose built for this use case, so it may need to be adapted as needed for other uses.

The expressions and their use were adapted from code that is used to do similar work for app compatibility on Windows. The goal is to remove versions and extraneous strings from names and publishers to create values unique to the package. The code also attempts to capture the architecture and locale if it is present in the name.

A large set of real world examples can be found in the test data files. These contain real name InputNames.txt and publisher InputPublishers.txt strings from packages found in winget-pkgs, as well as a few others added to smoke test proper functionality outside of English and other European character sets. The result file NormalizationInitialIds.txt contains the strings that result from normalizing both name and publisher and appending them as Publisher + . + Name.

Microsoft Reviewers: Open in CodeFlow

@JohnMcPMS JohnMcPMS requested a review from a team as a code owner January 28, 2021 23:49
Comment thread .github/actions/spelling/patterns.txt Outdated
# Package family names
\b[-.A-Za-z0-9]+_[a-z0-9]{13}\b
# Locales for name normalization
\b\p{Lu}{2,3}(-(CANS|CYRL|LATN|MONG))?-\p{Lu}{2}(?![A-Z])(?:-VALENCIA)?\b

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd generally suggest doing:

Suggested change
\b\p{Lu}{2,3}(-(CANS|CYRL|LATN|MONG))?-\p{Lu}{2}(?![A-Z])(?:-VALENCIA)?\b
\b\p{Lu}{2,3}(?:-(?:CANS|CYRL|LATN|MONG))?-\p{Lu}{2}(?![A-Z])(?:-VALENCIA)?\b

Basically, if you aren't capturing you should tell perl you aren't.

Otherwise, if you have a rule like:

\b([A-Za-z])\1{3,}\b

(which is earlier in this file), you may be unhappy.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entire file is run as one big regex? I would assume that the capture reference would be per-line. I don't mind changing it, I just didn't expect that the captures really mattered for this use case.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's run as one big regex.

It's a lazy optimization that only trips things up occasionally.

I need to look into whether it's worth treating each line as a unique regex / how to make that work.

L"QUZ-BO", L"QUZ-EC", L"QUZ-PE", L"RM-CH", L"RO-RO", L"RU-RU", L"RW-RW", L"SAH-RU", L"SA-IN", L"SE-FI", L"SE-NO",
L"SE-SE", L"SI-LK", L"SK-SK", L"SL-SI", L"SMA-NO", L"SMA-SE", L"SMJ-NO", L"SMJ-SE", L"SMN-FI", L"SMS-FI", L"SQ-AL",
L"SV-FI", L"SV-SE", L"SW-KE", L"SYR-SY", L"TA-IN", L"TE-IN", L"TH-TH", L"TK-TM", L"TN-ZA", L"TR-TR", L"TT-RU",
L"UG-CN", L"UK-UA", L"UR-PK", L"VI-VN", L"WO-SN", L"XH-ZA", L"YO-NG", L"ZH-CN", L"ZH-HK", L"ZH-MO", L"ZH-SG",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L"ZH-CN" [](start = 86, length = 8)

The comment said the values must be in sorted order but looks like no?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They also need to be folded, which is a harder thing to expect people to do. So I fold and sort them into another container in the constructor. I can update the comment to better reflect this.

{
std::wstring result = Utility::Normalize(ConvertToUTF16(value));
Trim(result);
size_t atPos = result.find(L"@@", 3);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ [](start = 45, length = 2)

what's special with this?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea; the starting point for this code was something doing similar work, and I left the majority of it alone.

@yao-msft yao-msft left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@github-actions

github-actions Bot commented Feb 5, 2021

Copy link
Copy Markdown

Misspellings found, please review:

  • URIs
To accept these changes, run the following commands from this repository on this branch
pushd $(git rev-parse --show-toplevel)
perl -e '
my $new_expect_file=".github/actions/spelling/expect.txt";
use File::Path qw(make_path);
make_path ".github/actions/spelling";
open FILE, q{<}, $new_expect_file; chomp(my @words = <FILE>); close FILE;
my @add=qw('"URIs "');
my %items; @items{@words} = @words x (1); @items{@add} = @add x (1);
@words = sort {lc($a) cmp lc($b)} keys %items;
open FILE, q{>}, $new_expect_file; for my $word (@words) { print FILE "$word\n" if $word =~ /\w/; };
close FILE;'
popd

@JohnMcPMS JohnMcPMS merged commit f4882c2 into microsoft:master Feb 6, 2021
@JohnMcPMS JohnMcPMS deleted the arpnorm branch February 6, 2021 00:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants