Querying Internationalized Domain Names

Recently, I’ve been working a lot on functionality involving querying DNS records. Since this was relatively untouched area for me, it required a lot of manual sanity checks to make sure I understand what is happening. I’ve mostly done them using dig - a wonderful little tool to poke DNS records. If you have not used it, I highly recommend playing with it. Be it as it may, I eventually ran into a problem with certain domains that confounded me.

Digging Internationalized Domains

After dig-ing a bunch of domains for some time, I realized that something was missing in my mental model. Consider a webpage like https://casaè.it/. It’s a fully working webpage. You can visit it and browse it meaning it has DNS records set up. However, when we try to dig it, we get the following response:

dig casaè.it

; <<>> DiG 9.10.6 <<>> casaè.it
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 18928
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;casa\195\168.it. IN A

;; AUTHORITY SECTION:
it. 3600 IN SOA dns.nic.it. hostmaster.nic.it. 2023070815 10800 900 604800 3600

;; Query time: 90 msec
;; SERVER: 192.168.0.1#53(192.168.0.1)
;; WHEN: Sat Jul 08 16:25:19 EEST 2023
;; MSG SIZE rcvd: 93

In short, we see that such domain does not exist: we get status: NXDOMAIN along with no answer section at all. What gives? How can it not exist if we just visited it?

Well, turns out there’s something called Internationalized Domain Name (IDN).

What Are Internationalized Domain Names?

IDNs, put simply, are domain names that contain at least one character outside of the traditional ASCII [1][2][3]. This includes the top level domain, too (for example, we could register TLD .ευ).

IDNs are not that common in the wild. Even today, more than 35 years after original IDN proposition, most domains in the “western” world do not have non-latin characters. This is understandable as it’s often easier to remember and type in a domain name made up purely of latin characters, especially if you cater to a multilingual audience where people might not share a common tongue and/or have a substantially different alphabet. The fact that most domain names are made up of ASCII characters also results in some people not even being aware that you can, in fact, have an internationalized domain name.

Working With IDNs: Problem

Just because some domains are not common, it does not mean that they are non-existent. Sooner or later you will come across such domains, and chances are against it being a convenient time. Keeping that in mind, how can we work with them?

As we’ve seen before, we cannot just dig <IDN> and call it a day, as you’ll find no DNS records. That is because a domain name is a string that can contain only letters from a latin alphabet (aA-zZ), digits (0-9), minus sign (-) and a period (.) [3][4][5]:

1. A "name" (Net, Host, Gateway, or Domain name) is a text string up
to 24 characters drawn from the alphabet (A-Z), digits (0-9), minus
sign (-), and period (.). Note that periods are only allowed when
they serve to delimit components of "domain style names". (See
RFC-921, "Domain Name System Implementation Schedule", for
background). No blank or space characters are permitted as part of a
name. No distinction is made between upper and lower case.

Source: https://www.rfc-editor.org/rfc/rfc952

Or more succintly, from wikipedia:

The DNS, which performs a lookup service to translate mostly user-friendly names into network addresses for locating Internet resources, is restricted in practice to the use of ASCII characters, a practical limitation that initially set the standard for acceptable domain names.

This raises two obvious questions:

  • How can domains with non-latin characters exist if non-ASCII characters are not allowed?
  • How to query (e.g., dig) those domains?

Working With IDNs: Solution

The answer to both of these questions is punycode - a representation of Unicode characters with a subset of ASCII characters [6]. Since domain names do not support Unicode characters, they are encoded in punycode instead. For example, something like münchen.de would be converted to xn--mnchen-3ya.de. As you can see, punycode representation contains only ASCII letters. This makes it possible to use the domain in DNS queries:

> dig münchen.de

; <<>> DiG 9.10.6 <<>> münchen.de
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 35792
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;m\195\188nchen.de. IN A

;; AUTHORITY SECTION:
de. 6555 IN SOA f.nic.de. dns-operations.denic.de. 1690115334 7200 7200 3600000 7200

;; Query time: 8 msec
;; SERVER: 192.168.0.1#53(192.168.0.1)
;; WHEN: Sun Jul 23 15:41:25 EEST 2023
;; MSG SIZE rcvd: 103

> dig xn--mnchen-3ya.de

; <<>> DiG 9.10.6 <<>> xn--mnchen-3ya.de
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 53412
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 6, ADDITIONAL: 3

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;xn--mnchen-3ya.de. IN A

;; ANSWER SECTION:
xn--mnchen-3ya.de. 86400 IN A 194.246.148.58

;; AUTHORITY SECTION:
xn--mnchen-3ya.de. 85565 IN NS dns2.ascio.com.
xn--mnchen-3ya.de. 85565 IN NS ns01e.muenchen.de.
xn--mnchen-3ya.de. 85565 IN NS dns2.epag.net.
xn--mnchen-3ya.de. 85565 IN NS dns1.ascio.com.
xn--mnchen-3ya.de. 85565 IN NS ns02e.muenchen.de.
xn--mnchen-3ya.de. 85565 IN NS dns1.epag.net.

;; ADDITIONAL SECTION:
dns1.epag.net. 85548 IN A 212.123.35.78
dns2.epag.net. 85548 IN A 212.123.32.78

;; Query time: 36 msec
;; SERVER: 192.168.0.1#53(192.168.0.1)
;; WHEN: Sun Jul 23 15:41:45 EEST 2023
;; MSG SIZE rcvd: 236

Observe that the second query (dig xn--mnchen-3ya.de) returned an A record, along with a bunch of additional information, even though the first one (dig münchen.de) did not. To be 100% sure, we can try digging casaè.it, too, which translates to xn--casa-8oa.it:

> dig xn--casa-8oa.it

; <<>> DiG 9.10.6 <<>> xn--casa-8oa.it
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 51846
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 2, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;xn--casa-8oa.it. IN A

;; ANSWER SECTION:
xn--casa-8oa.it. 300 IN A 104.21.89.246
xn--casa-8oa.it. 300 IN A 172.67.166.45

;; AUTHORITY SECTION:
xn--casa-8oa.it. 10265 IN NS doug.ns.cloudflare.com.
xn--casa-8oa.it. 10265 IN NS ruth.ns.cloudflare.com.

;; Query time: 16 msec
;; SERVER: 192.168.0.1#53(192.168.0.1)
;; WHEN: Sun Jul 23 15:42:42 EEST 2023
;; MSG SIZE rcvd: 131

Astute reader will notice that the translation is not as simple as substituting unicode letters for some combination of ASCII characters. The translation algorithm is out of the scope for this post, but you can read its details in wikipedia’s article on punycode. You can also easily find tools online to do the conversions - just google “punycode converter” or similar. Many programming languages have this encoding in-built, too. For example, in Python we can do:

>>> "münchen.de".encode("idna").decode("utf8")
'xn--mnchen-3ya.de'

Whereas in Java/Scala this is as easy as:

import java.net.IDN

IDN.toASCII("münchen.de")
'xn-mnchen-3ya.de'

You may wonder what would happen if an ASCII string was translated to punycode. In fact, such translation does not change the string:

>>> "munchen.de".encode("idna").decode("utf8")
'munchen.de'

Thus, if you think you may encounter IDNs in your system, it might be a good idea to run the domain through punycode converter first. This will ensure you’ll be able to query the domains in a way DNS expects. It will also save you an embarrassment of assuming a DNS record does not exist when in fact it does, just in a different encoding.

Conclusion

DNS is a wonderful system that stood the test of time. It’s usefulness and flexibility cannot be overstated. However, since it is an old system, handling of modern use cases is not always intuitive, as can be seen from IDNs. Nevertheless, given the restrictions within which the system has to work, I think it’s a testament to DNS engineers’ cleverness and ingenuity that the system can be adapted to unforeseen use cases without massive disruptions to the way DNS works.

Sources

  1. https://en.wikipedia.org/wiki/Internationalized_domain_name
  2. https://newgtlds.icann.org/en/about/idns
  3. https://datatracker.ietf.org/doc/html/rfc952
  4. https://datatracker.ietf.org/doc/html/rfc819
  5. https://datatracker.ietf.org/doc/html/rfc1035#section-2.3.1
  6. https://en.wikipedia.org/wiki/Punycode