Information report about technical problems in our datacenter in the afternoon of 23rd and in the morning of 24th. May 2014.
First of all, we would like to apologize to all our customers for the service outage that occurred on Friday afternoon. 23 May and Saturday morning May 24, when technical problems occurred, which resulted in a power interruption to the servers.
We therefore take full responsibility and will not make excuses for bad weather or force majeure or coincidence. The blame is entirely on our side and we should have anticipated that this situation could happen.
We also apologize for the first report on Friday, which misinformed you that we had everything under control and the problem was solved. There was no malicious intent, we didn’t want to lie to anyone. We managed to get the first three-minute outage under control and by the time we published the report, more than 95% of all our services were up and running. The other non-functional ones we came up with later. Sorry about that.
Everything our company went through this Friday and Saturday is a huge lesson. We will now keep our backup resources under close scrutiny. We will immediately increase the testing time for backup power supplies, we are ordering a new UPS on Monday and other necessary measures to better secure our services will follow and we will, as always, keep you informed.
Once again, we apologize very much and now we will take the liberty of detailing the entire two days so that you can have an accurate idea of what happened here. In fact, there have been incredible coincidences that have led to the failure of several different systems, despite the fact that these are new devices in a redundant design.
The whole situation arose from a confluence of several related problems. The first reason was the heat that was the source of the storms, which caused significant fluctuations in the electricity grid. Because of the fluctuations in the el. energy, one of the air conditioners was damaged after the air conditioners were started and the main circuit breaker of the entire datacenter was blown. This main circuit breaker overheated due to a power surge. The diesel generator started automatically, but after 15 minutes it stopped working due to a failure of the cooling system of the engine-generator (the last check was on Thursday, the day before – we do this regularly every week and the generator showed no signs of failure), so the UPS started and kept the servers running for another 33 minutes. At this point, we are already in contact with the electricity supplier, who is sending a repairman to us. Backup air conditioning was running all the time. Before we could replace the damaged circuit breaker with an emergency solution, the batteries were discharged, causing a complete power outage for 3 minutes. After that, almost everything started working, but as we mentioned above, we informed on our website that everything was under control, but this was a misinformation, because a small part of our services was not yet functional.
After about an hour, the power grid continued to fluctuate and the UPS was damaged on one power branch. And with that, our power grid short-circuited again. It was necessary to replace the emergency main circuit breaker solution with a new fully functional main circuit breaker, which we secured in the meantime. Unfortunately, the UPS from one power branch was so damaged that it could not be used, and the other UPS had such dead batteries from a previous power outage that everything collapsed a second time. This time for 13 minutes. After this time, the main circuit breaker was replaced and the electricity supply was restored.
Most of the servers were up and running immediately after this second outage, only a few percent of the servers (unfortunately it’s a few thousand customers) had problems for a longer period of time and we managed to restore one mailserver only on Saturday morning.
Since Friday afternoon, the entire company has been working intensively to eliminate the consequences of the outage. Unfortunately, the whole situation was all the more complicated because the outage resulted in damage to the primary and backup firewalls in our offices, which meant that we could not access the internet from our offices and servers that are only allowed to be accessed from computers in our offices for security reasons. This made the restoration of the remaining non-functional services (a few pieces of servers) very slow and complicated. We had to provide a replacement (third) firewall and after it was put into operation we gradually restored all services.
Due to the damage of the UPS on one power branch, the servers were powered only through one power branch, which is very risky, so we agreed with the UPS supplier that after charging the batteries of a working UPS, we will switch the damaged UPS to bypass mode and thus provide the servers with power from the second branch. This operation was scheduled for 8am and was to be a routine operation without any disruption.
This morning a technician came to repair the UPS (it had been switched to bypass). Everything seemed to be solved, but the repair, although it seemed fine (according to all available measurements), did not go well, respectively. didn’t reveal a defective component. When the UPS was switched to bypass mode, a short circuit occurred and the power supply was interrupted. Unfortunately, the batteries that were acting as a backup power supply on Friday evening were not yet fully charged, so they powered the servers for 20 minutes. Before the technician could reinstall the new main breaker, a second outage occurred for a full 15 minutes. After this outage, our technicians restored all services in a very short time.
We spent the rest of the day figuring out how to avoid such a situation once and for all. The generator was repaired on Saturday and we believe that such a confluence of so many unlikely events will not happen again. And even if it does, we are better prepared for it. We’re also ordering a brand new UPS on Monday. We will keep you informed about further improvements to prevent similar outages.
Believe me, no one here was idle and we took the situation very seriously. Colleagues who were off shift also came to work and tried to help. As soon as it was possible to answer the phone and answer your questions, there were 10 administrators available to patiently answer and explain.
We are aware that when we host tens of thousands of clients, everything has to work at 100% and even the protection mechanisms that deal with power outages, attacks, etc., must still have backup solutions and other options to manage everything so that the client does not have to feel anything but satisfaction with our services.
We will automatically compensate all affected customers with free services (to be resolved within the next month). Customers who are entitled to higher compensation under the terms of the contract will be dealt with individually.
We will learn from the above issues and once again apologize to all our clients. We would like to thank the employees, suppliers and cooperating companies for resolving the situation. And we believe that it is very unlikely that such a coincidence will happen again, as not only primary but also backup sources have been damaged, causing major complications.
Further information:
Currently, customers whose VPS has stopped on the selection before the OS boot (Grub, crash repair tool, …) may still have a problem. The solution is KVM. In the worst case, we will perform a free restore from backup. Additionally, some tables in the web hosting databases may be corrupted and you need to contact us to help repair them. We estimate that these problems affect several dozen customers and unfortunately it is not in our power to find out who specifically is affected, so it is necessary for the customers concerned to contact us themselves.
In conclusion, we apologize once again to all customers for the problems and for the limited possibility of communication at the time of the outage. We have already arranged alternative means of communication in the event of a similar crisis situation. Since Friday afternoon we have been 100% dedicated to solving the whole situation until 11 o’clock today, when all damaged devices are fully functional (except for one UPS, which will be replaced by a new one).