Clarification of the situation regarding DNSSEC on 4. 4. 2017

[gtranslate]

As always, we are open and we make no secret.

On Tuesday, April 4, complications occurred with .cz domains that are secured using DNSSEC technology, resulting in their partial unavailability from some networks. We would like to explain what exactly happened and how it affected each service.

What exactly happened

There was a technical error in the generation of DNSSEC keys that caused problems with domain name resolution (and domain authentication for mail services) when using DNS servers that validate DNSSEC. These servers are (fortunately) minimal so far, and thus several % of visitors from networks where visitors use DNS servers with DNSSEC validation did not reach the domains. According to our measurements, this affected about 3-5% of traffic (the first estimates were higher and we talked about about 10%). For example, it was part of the O2 network and the network where they use Google’s DNS servers.

The ZSK key for DNSSEC is regularly and automatically replaced once a month by the robot. This is done by generating a new key 5 days before the key expires and adding it to the zone. We have been using this procedure since 2011 and until Monday 3.4.2017 it worked without any problems.

Unfortunately, on Monday, the new key was not created. According to the logs, the robot startup occurred as before and no error was reported for the process. Thus, the technicians did not get a warning that something was wrong during creation (the robot did not return an error), nor did our monitoring detect that the robot was not working. It was triggered in a completely normal way.

Unfortunately, we had no monitoring for this situation. Thus, the key was extracted and the chain of trust was invalidated.

ISP resolvers that thoroughly verify DNSSEC then returned an error to users. Which was roughly 3-5% of the Internet population that tried to access the domain while the .cz domain had an invalid key.

We immediately started to solve the situation by re-generating keys to all domains for which we manage DNS records, which is over 179 thousand.

Unfortunately, our system for generating zone files was not powerful enough for this situation at the time. Generating new keys and then zone files is a very performance and disk intensive operation. Although we used practically the whole north just for this activity, it was not enough.

Our primary focus was on renewing the keys of our customers who wrote to us for customer support, then our customers who have services with us, then parking pages and finally domains without an A record.

In order to speed up the process, our developers and engineers have worked together to prepare a new temporary key recovery solution. Everything was still running alongside the original system and the generation had dedicated power for itself. After running it, we got down to 0.7 – 2 seconds per domain in 30 CPU threads simultaneously. We will improve this solution in the future and deploy it as part of our DNS. We deployed the solution in the morning and significantly accelerated the generation.

With all these changes, it is important not only to generate new zone files, but to ensure their consistency and error-freeness and at the same time to distribute everything to our 4 DNS servers we have (2 in the Czech Republic, 1 in Germany and 1 in the Netherlands) and then reload all zones. Considering that it was about 179,000 domains, you can understand that such an “operation” was very demanding.

Our keys expired before 19. Tuesday night. We started to register the first problems before 21:00 and at 21:40 we were already restoring domains. The next day at about 15:30, everything was taken care of, with the last active services (except for parked domains) being fully functional at about 13:30 (then zones were generated for domains that have no services).

We informed about the whole situation both on social networks (the first post was after a few minutes after the first problems), as well as on the status pages of our services and also in the chat and administration of services for our customers. Over the course of the night, we updated everything several times and worked on it with several staff members at the same time.

We had instructions on the website on how to possibly restore your service faster. Just write to us or click on update DNS settings in the customer administration. This generated the zone as a priority and within 30 minutes the service was fully operational from all networks.

How the DNSSEC outage affected our services

The affected domains were .cz domains with DNSSEC (i.e. websites and e-mails on these domains).

Emails

Let’s start with the emails first. For security reasons, our mailserver verifies whether the email was actually sent from the domain it declares. It uses our DNS to do this. Since there was no valid DNSSEC record, the mailserver refused to send the email. This only applied to emails sent via SMTP. Emails sent using the PHP script were leaving fine.

We solved the problem with the emails by reconfiguring the authentication from our DNS to the DNS servers of CZ.NIC.
Adding to our original statement “Yes, it is a surprise that CZ.NIC’s DNS servers do not validate DNSSEC.” and clarification from CZ.NIC (April 21, 2017):
“Of course our DNS validates, in your case it worked partly because unbound by default (our case) holds expired signatures for 10% of the expiration-inception period. With a minimum time of one hour and a maximum time of 24 hours.

So I guess we can conclude from that that indeed for some time our ODVR did not return an error even though DNSSEC was expired.”
We apologize for the inaccurate wording. It was certainly nothing against CZ.NIC, but we are describing the whole situation.

Websites

.CZ domains that have DNS with DNSSEC set to DNSSEC appeared to be unavailable to a visitor whose resolver verifies the validity of DNSSEC. This affected an estimated 3-5% of visitors. These were visitors from the Czech Republic who use Google DNS, some visitors to the O2 and Vodafone networks and some other networks.

We arrived at this figure by comparing accesses from web hosts from the logs of our DDoS protection, which monitors the network in detail, and by comparing data transfers from our connectivity providers.

So it was not that everything was unavailable or that some requests were not going through, but that some domains were actually unavailable or less available from some networks, depending on which DNS servers a particular visitor used.

Traffic comparison

The following graph shows the total traffic from 16:00 on 3.4.2017 to 16:00 on 4.4.2017. At the top is incoming traffic from the Internet to us and at the bottom is outgoing traffic from us to the Internet (to visitors).

For comparison total traffic from 16:00 3.4.2017 to 16:00 6.4.2017.

As you can see on the graphs no big drops in traffic are visible during the incident. DDoS is not verified by most.

Of course, the overall operation includes web hosting, VPS, dedicated servers and WEDOS disk. That’s why we’ve also selected a chart of 88,000 pure NoLimit web hosts over IPv4 and IPv6 by type of communication. The chart covers the period from 16:00 on 3.4.2017 to 16:00 on 6.4.2017. Our DDoS protection can check and evaluate everything in real time in great detail.

Individual networks

We wrote that only some networks were affected by the unavailability. For comparison, here are the transfer charts from the biggest ones. This is the period from 16:00 3.4.2017 to 16:00 6.6.2017.

Telefonica Czech Republic, a.s. (O2)

UPC Liberty Global Operations B.V.

T-Mobile Czech Republic a.s.

Vodafone Czech Republic a.s.

What we’ve done to make sure it never happens again and planned improvements

We have improved the monitoring of similar situations and not only automatic checking but also regular human checking will be carried out.

A faster zone file generator allows you to make changes to DNS faster.

We also plan to modify our web interface for emails so that some of the restrictions disappear. After all, once a user has logged into the web interface, it is not necessary to verify whether they are sending an email from the domain they are logged into. The goal is to achieve a state where you can send email through our web interface at all times.

Customer support

During the evening, our customer support was overloaded as they received several thousand emails and chats with requests related to the above problem.

If an operator cannot accept a chat due to having more open chats than they can handle, the communication will “drop” into requests after a certain period of time (similar to contact form messages). As soon as the operator has “free hands”, he goes through the requests and gradually answers them by email.

Colleagues dealt with the requests one by one and everything was dealt with during the night. In the morning we answered all the questions continuously.

Conclusion

Sorry for the complication. We have to admit that this was an unpleasant mistake and we were initially worried that the impact would be much more unpleasant than it ended up being. We have learnt from this unpleasant experience and will make some adjustments to prevent this from happening again.