A recent study carried out by Governance Primer on behalf of the Universal Acceptance Steering Group (UASG) identified trends in the acceptance of all domain names in software hosted at Github, the largest open-source repository globally. This research builds on top of previous efforts aimed at identifying the underlying issues that result in problems when different applications need to handle Internationalized Domain Names (IDNs) and new New gTLDs, particularly when it comes to email addresses.
The goal was to obtain real-world data on the usage of software libraries, which are prewritten pieces of code that act much like building blocks, providing a specific functionality required by the developer so that they do not have to reinvent the wheel every time a feature needs to be implemented. For example, the Pillow library for Python provides image processing functionalities so that an application in which changes can be made to digital images does not need to have coded from scratch the manipulation of pixels, as well as common features such as transparency, blur, sharpening, and so on.
In the case of domain names, libraries of particular concern are the ones that somehow deal with validation, allowing the input of certain characters and structures, while disallowing others. To give a practical example, our previous research tested the validation of email addresses in the “form” field (often found in contact sections) of the top 1.000 websites in the world according to Alexa’s rankings. These results were subsequently updated in 2020 by another team, and this is what the acceptance landscape looked like:
Test case | 2017 | 2019 | 2020 |
---|---|---|---|
ascii@ascii.newshort | 91% | 97% | 98% |
ascii@ascii.newlong | 78% | 84% | 84% |
ascii@idn.ascii | 45% | 50% | 47% |
Unicode@ascii.ascii | 14% | 13% | 18% |
Unicode@idn.idn | 8% | 8% | 11% |
Right-to-left (RTL) | 8% | 7% | 11% |
What these results tell us is that the code being deployed on the Web is fairly competent at dealing with new gTLDs of four characters or less, but already start to struggle with those with longer ones and see a dramatic drop when IDNs are introduced. The question that followed these findings was: what does the landscape look like when it comes to software? While many validation processes are carried out on the Web, several others happen in non-Web applications.
To perform this analysis, the most used coding languages in open-source software were targeted, Java and Python, and a crawler were created to aggregate all valid (as per Guthub’s guidelines) software, extracting their “dependency” file. This file is basically responsible for telling anybody who wants to do any work with a given application what libraries it relies upon so that those can be included in the final software for it to perform its tasks correctly.
While some lists of most used libraries do exist, their methodology is not based on the direct sampling of projects, and not enough metadata is provided for correlations to be made between projects and the libraries they use. This means that it would be hard to map out what projects use an insufficient library and engage with them to stimulate changes to their codebase, implementing a more compliant library compliant with Universal Acceptance. Further, metadata about the projects was collected to generate a ranking of the most relevant applications (based on an algorithm that considered data points such as the number of forks), which is a feature not provided by Github.
Thanks to the “Universal Acceptance Compliance of Some Programming Language Libraries and Frameworks” study, the compliance status of some libraries was already known, and the team proceeded to evaluate the status of others that were deemed to be relevant. In essence, a library that makes use of the newer IDNA2008 standard is “UA-Ready”, while one that makes use of the older IDNA2003 standard is “Not UA-Ready.” There is also the possibility that it follows neither, which leads to a reasonable assumption that it is “Not UA-Ready.”
It is not the case that by incorporating a UA-Ready library, the application automatically becomes able to accept all domain names, as, unfortunately, other factors are involved, including whether the library is implemented correctly by developers. However, this makes decision-making around resource allocation for engagement and remediation much more rational, as priorities can be better established, such as, for example: “testing and reaching out to projects that place highly in the ranking and are likely to be UA-Ready.”
The results are presented below.
Java
“RegEx via annotations” seems to be a popular method of performing validation in Java, which is unfavorable to the UASG’s interests, as it is not a uniform way of validating strings, and any arbitrary expression can be used to make that check. This means we cannot be sure of what kind of processing is being performed under the hood, but it is likely not helping the application become UA-Ready. The most relevant libraries making use of this method are: validation-api ranking at 55th and its derivative hibernate-validator placing even higher at 21st. springfox-bean-validators also rank quite high at 79th.
Library | Occurrence (projects) | Status |
---|---|---|
hibernate-validator | 62963 | Not UA-Ready. RegEx via annotations; Hibernate implementation ofvalidation-api. |
validation-api | 25190 | Not UA-Ready. RegEx via annotations. |
springfox-bean-validators | 12501 | Not UA-Ready. RegEx via annotations; SpringFox implementation ofvalidation-api. |
commons-validator | 4906 | Not UA-Ready. Relies on a static list of TLDs from 2017. |
icu4j | 886 | UA-Ready. IDNA2008. |
libidn | 29 | Not UA-Ready. IDNA2003, deprecated and ported to the Java language as “java.net.IDN”. |
Python
Out of the entire Python dataset, the idna module ranks 6th overall in terms of usage, which is a favorable result to the UASG’s interests. It can also be a key argument in engaging with the Python language developers to port that module to the language’s core, replacing the default IDNA2003 implementation. This would be a significant gain to a coding language that is in increasing demand.
Library | Occurrence in projects | Status |
---|---|---|
idna | 70789 | UA-Ready. IDNA2008. |
validators | 1660 | Not UA-Ready. Email validation based on Django validator; URL validation based on RegEx. |
email_validator | 1178 | UA-Ready. IDNA2008. |
pyicu | 243 | UA-Ready. IDNA2008. |
idna_ssl | 10 | UA-Ready. IDNA2008. |
The complete study can be found at this link.
Many thanks to project contributors Sávyo Vinícius de Morais, Edson Celio Ferreira Araujo, Jonas Mendes Fiorini.