Name and publisher normalization for cross reference by JohnMcPMS · Pull Request #724 · microsoft/winget-cli

JohnMcPMS · 2021-01-28T23:49:05Z

Change

This change adds code for creating normalized strings from package names and publishers. It is not actually used anywhere yet, but it is intended for use in correlating locally installed packages with remote ones, as well correlating one version to the next. I also plan to use the values to create "nicer" Ids for locally installed packages that are not found in a remote source.

In order to support all of Unicode, the code does not use STL's regex, but instead an ICU regex wrapper is implemented to enable global support. This wrapper's interface is purpose built for this use case, so it may need to be adapted as needed for other uses.

The expressions and their use were adapted from code that is used to do similar work for app compatibility on Windows. The goal is to remove versions and extraneous strings from names and publishers to create values unique to the package. The code also attempts to capture the architecture and locale if it is present in the name.

A large set of real world examples can be found in the test data files. These contain real name InputNames.txt and publisher InputPublishers.txt strings from packages found in winget-pkgs, as well as a few others added to smoke test proper functionality outside of English and other European character sets. The result file NormalizationInitialIds.txt contains the strings that result from normalizing both name and publisher and appending them as Publisher + . + Name.

Microsoft Reviewers: Open in CodeFlow

jsoref · 2021-01-28T23:55:25Z

 # Package family names
 \b[-.A-Za-z0-9]+_[a-z0-9]{13}\b
+# Locales for name normalization
+\b\p{Lu}{2,3}(-(CANS|CYRL|LATN|MONG))?-\p{Lu}{2}(?![A-Z])(?:-VALENCIA)?\b


I'd generally suggest doing:

Suggested change

\b\p{Lu}{2,3}(-(CANS|CYRL|LATN|MONG))?-\p{Lu}{2}(?![A-Z])(?:-VALENCIA)?\b

\b\p{Lu}{2,3}(?:-(?:CANS|CYRL|LATN|MONG))?-\p{Lu}{2}(?![A-Z])(?:-VALENCIA)?\b

Basically, if you aren't capturing you should tell perl you aren't.

Otherwise, if you have a rule like:

\b([A-Za-z])\1{3,}\b

(which is earlier in this file), you may be unhappy.

The entire file is run as one big regex? I would assume that the capture reference would be per-line. I don't mind changing it, I just didn't expect that the captures really mattered for this use case.

Yeah, it's run as one big regex.

It's a lazy optimization that only trips things up occasionally.

I need to look into whether it's worth treating each line as a unique regex / how to make that work.

…e the result file

yao-msft · 2021-02-04T21:06:31Z

+                L"QUZ-BO", L"QUZ-EC", L"QUZ-PE", L"RM-CH", L"RO-RO", L"RU-RU", L"RW-RW", L"SAH-RU", L"SA-IN", L"SE-FI", L"SE-NO",
+                L"SE-SE", L"SI-LK", L"SK-SK", L"SL-SI", L"SMA-NO", L"SMA-SE", L"SMJ-NO", L"SMJ-SE", L"SMN-FI", L"SMS-FI", L"SQ-AL",
+                L"SV-FI", L"SV-SE", L"SW-KE", L"SYR-SY", L"TA-IN", L"TE-IN", L"TH-TH", L"TK-TM", L"TN-ZA", L"TR-TR", L"TT-RU",
+                L"UG-CN", L"UK-UA", L"UR-PK", L"VI-VN", L"WO-SN", L"XH-ZA", L"YO-NG", L"ZH-CN", L"ZH-HK", L"ZH-MO", L"ZH-SG",


L"ZH-CN" [](start = 86, length = 8)

The comment said the values must be in sorted order but looks like no?

They also need to be folded, which is a harder thing to expect people to do. So I fold and sort them into another container in the constructor. I can update the comment to better reflect this.

yao-msft · 2021-02-04T21:28:19Z

+            {
+                std::wstring result = Utility::Normalize(ConvertToUTF16(value));
+                Trim(result);
+                size_t atPos = result.find(L"@@", 3);


@@ [](start = 45, length = 2)

what's special with this?

I have no idea; the starting point for this code was something doing similar work, and I left the majority of it alone.

yao-msft

github-actions · 2021-02-05T23:30:33Z

Misspellings found, please review:

URIs

To accept these changes, run the following commands from this repository on this branch

pushd $(git rev-parse --show-toplevel)
perl -e '
my $new_expect_file=".github/actions/spelling/expect.txt";
use File::Path qw(make_path);
make_path ".github/actions/spelling";
open FILE, q{<}, $new_expect_file; chomp(my @words = <FILE>); close FILE;
my @add=qw('"URIs "');
my %items; @items{@words} = @words x (1); @items{@add} = @add x (1);
@words = sort {lc($a) cmp lc($b)} keys %items;
open FILE, q{>}, $new_expect_file; for my $word (@words) { print FILE "$word\n" if $word =~ /\w/; };
close FILE;'
popd

JohnMcPMS added 7 commits January 19, 2021 13:07

Norm code mostly complete, refactor for lifetime and test next

ac40b10

Work done prior to ICU transition

0191135

ICU regex wrapper, NameNorm needs testing

95ba831

Finish up name normalization

eaa314e

Fix typo and add locale regex to patterns

c339a48

Exclude test files containing real world data

f287525

Add new words to expect

8c10c7f

JohnMcPMS requested a review from a team as a code owner January 28, 2021 23:49

jsoref reviewed Jan 28, 2021

View reviewed changes

JohnMcPMS added 5 commits January 28, 2021 17:13

Update pattern per feeback and fix comparison

63eba08

Update test to be expected == actual and add an ignored test to updat…

fcb7e84

…e the result file

Change name to avoid potential test break

e38c1ef

Update version regex to allow arbitrary connected bits after a version

0bc9055

Forgot to actually update the expected data

27bb5f6

yao-msft reviewed Feb 4, 2021

View reviewed changes

yao-msft approved these changes Feb 4, 2021

View reviewed changes

JohnMcPMS added 2 commits February 5, 2021 14:38

Update comments for feedback

58f272f

Merge from master

e8b784d

Add URIs

ad2be0b

JohnMcPMS merged commit f4882c2 into microsoft:master Feb 6, 2021

JohnMcPMS deleted the arpnorm branch February 6, 2021 00:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Name and publisher normalization for cross reference#724

Name and publisher normalization for cross reference#724
JohnMcPMS merged 15 commits into
microsoft:masterfrom
JohnMcPMS:arpnorm

JohnMcPMS commented Jan 28, 2021 •

edited by ghost

Loading

Uh oh!

jsoref Jan 28, 2021

Uh oh!

JohnMcPMS Jan 29, 2021

Uh oh!

jsoref Jan 29, 2021

Uh oh!

yao-msft Feb 4, 2021

Uh oh!

JohnMcPMS Feb 5, 2021

Uh oh!

yao-msft Feb 4, 2021

Uh oh!

JohnMcPMS Feb 5, 2021

Uh oh!

yao-msft left a comment

Uh oh!

github-actions Bot commented Feb 5, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	\b\p{Lu}{2,3}(-(CANS\|CYRL\|LATN\|MONG))?-\p{Lu}{2}(?![A-Z])(?:-VALENCIA)?\b
	\b\p{Lu}{2,3}(?:-(?:CANS\|CYRL\|LATN\|MONG))?-\p{Lu}{2}(?![A-Z])(?:-VALENCIA)?\b

Uh oh!

Conversation

JohnMcPMS commented Jan 28, 2021 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change

Microsoft Reviewers: Open in CodeFlow

Uh oh!

jsoref Jan 28, 2021

Choose a reason for hiding this comment

Uh oh!

JohnMcPMS Jan 29, 2021

Choose a reason for hiding this comment

Uh oh!

jsoref Jan 29, 2021

Choose a reason for hiding this comment

Uh oh!

yao-msft Feb 4, 2021

Choose a reason for hiding this comment

Uh oh!

JohnMcPMS Feb 5, 2021

Choose a reason for hiding this comment

Uh oh!

yao-msft Feb 4, 2021

Choose a reason for hiding this comment

Uh oh!

JohnMcPMS Feb 5, 2021

Choose a reason for hiding this comment

Uh oh!

yao-msft left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Feb 5, 2021

Misspellings found, please review:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JohnMcPMS commented Jan 28, 2021 •

edited by ghost

Loading