Submission

Name: Bill Jouris

Date: 9 Apr 2024

Original Public Comment: String Similarity Review Guidelines

Other Comments

At the presentation in San Juan, I raised the issue of Underlining. You asked that I include it in a comment, so here you go:

When I look at a domain name, the software automatically will have changed the color of the font (which is not a problem), and underlined the name (which can be). First, various diacritics are written below the line. Underlining basically overlays them, and makes them invisible. When the user is not familiar with the diacritic involved, he will not even know to look for it. But even when he knows about it, he will have great difficulty, at best, seeing it. Second, there are letters, in several scripts, which differ from each other only by lines, dot, etc. which occur on the same place that underlining goes. One obvious example, from the Latin script, involves the letter G (in san serif fonts) and the letter Q.

As a result of this potential for confusion, I believe that any review for string similarity needs to include consideration of whether underlining is going to obscure the distinction between the code points being used.

I also have some other comments on the document:

· Section 3.1: there is an additional complexity which arises when users are confronted with a label in the script used for their native language. But, because the script has been extended for use in multiple, possibly totally unrelated, languages, the label contains elements with which they are unfamiliar. Annex A covers this. It should be referenced explicitly here.

· Section 3.1: it would be well to recognize explicitly that even attentive users do not generally look at what appears in the browser address bar. They look at the link they are presented with.

· Section 3.2.2: You might wish to consider situations like this: A native speaker of Japanese will recognize a label written in kanji. But he would also recognize, as the same word, the label spelled out in kana. Or even in romanji. Even though the three look nothing alike. This could be covered in detail in Annex C, but should at least be mentioned here.

· Section 5.3.1: I wonder if an example might be found where the distinction is something which would be obscured by the underlining routinely added by word processing or web browser software. (See my note on underlining above.)

· Annex A: it would be good to speak about font differences and their possible impact, as for these scripts (especially Latin script), numerous different fonts are in common use. (For example, serif and sans serif fonts in the Latin script.) In most cases, font differences are trivial, although they can cause users unfamiliar with particular diacritics to see them as font peculiarities. But there are cases where the difference is substantial, such as the letter G, which can change from g to g, depending on the font.

· Section 9.1: it appears that this section uses “registry” to mean the set of gTLDs. Whereas the term’s general use meaning within ICANN means something rather different, and focused on SLDs. This is confusing on first reading. (Or, if the authors did mean the term as it is commonly used, it is not clear why it would arise at all in a document about TLDs. Some clarification may be in order.

· Section 9: it would be helpful to have more detail about how the Index (Variant) Label would be calculated. I can think of a couple possibilities, but my imagination fails to find one which would have he specified property of uniqueness. Sample calculations might help here as well. [Section 16.1 does not appear, yet, to provide useful input for this.] Note: Section 11 seems to suggest that this, as the rest of the heuristic, is still an aspiration only….

· Section 10.1, step 1, paragraph 2: change “…basis of semantic of phonetic equivalence” to “…basis of semantic or phonetic equivalence”

· Section 10.1, step 4: It is not immediately obvious how anyone would know of a proposed label, in order to request manual screening. Is there an unlisted step somewhere that would publish a list of requested labels for public review?

· Section 10, step 5: it seems like this contemplates something for confusables akin to allocatable (vs blocked) variants. If so, how will these be arrived at?

· Section 11: this appears to be saying that, contra Section 9, it might well not be possible to find a way to automate prescreening. Is that the intended reading ?

· Annex C: Japanese uses three different phonetic scripts (hiragana, katakana, and romanji), in addition to characters (kanji). A Japanese speaker will recognize a word as the same, whether it is written in kanji or in one of the phonetic scripts. Accordingly, it seems like those should be considered “similar” for the purposes of the similarity review.

Summary of Submission

Underlining, which is commonly added by software, needs to be considered when deciding if two labels are similar.

There are also questions and comments about various sections in the document.

ICANN

Get Started

News and Media

Policy

Public Comment

Resources

Community

Quicklinks