Murphy’s Laws work or everything that doesn’t kill us makes us stronger

[gtranslate]

Ten days ago, we had a major incident. We had power problems and this caused complications in providing services to our customers.

I write this article as the founder, chairman of the board and majority owner of WEDOS Internet, a.s., i.e. as Josef Grill. This is not an official statement, because we already published that a week ago, but rather a reaction to the whole situation that we had to deal with a few days ago.

Apology to all customers

First of all, once again, I am very sorry to all our clients.

I am fully aware of the gravity of the situation and the problem that has arisen and I will take the liberty of explaining a few things.

WEDOS has been on the market for less than 4 years and this is practically the first time we have had a similar problem, or rather a problem affecting all services simultaneously, for more than 2.5 years. I myself have experience with hosting for over 17 years, over 13 years with running my own datacenter and I have never experienced a similar situation and I hope I will not experience it again. On the one hand, statistically it is almost impossible for me to live to see it, and on the other hand, we will learn from the whole situation so that it does not happen again.

Apologies

We will set up a free week for all clients for all services (except for domains where this is not possible). We’ll do it during June because our system doesn’t do that so we have to reprogram it. So please be patient. No need to ask us for anything.

For clients who have special availability arrangements or services with increased availability, we will deal with the situation on an individual basis and should be contacted.

Do you want to know how we value our clients? For the record, providing a free week means about CZK 1.5 million less revenue for us. We took this helpful step because we value our clients. According to the contractual conditions, we should be able to get significantly lower compensation and it would not mean anything for us economically (tens of thousands of crowns). We value our clients so we gave ourselves a very strict (and expensive) punishment.

In addition, last week we ordered a new UPS (for more than 500 thousand crowns), which arrived from France on Saturday and is already in operation (and replaced the damaged older UPS). See photo gallery below.

I’m very glad you trust us. Thank you. Thank you.

Although it was mentioned in many discussions that customers are moving away from us, it was mostly from anonymous over-sleepers and therefore critics who do not host with us. To date, I know of only a handful of customers who have moved their website away from us. On the contrary, you have expressed strong support for us in the discussions, and if someone supported us, they were almost always signed and wrote their website. We are very grateful for this support and appreciate it very much.

So we can think about who wrote the “anti “wedos posts… But the charts show that customers trust us, and that’s what matters.

My (in)presence or Murphy’s Laws work.

I have been in the hosting business for 17 years (in May it was exactly like that). I always had my laptop with me at all times and always travelled accordingly to access the internet. Even on the road, I worked and handled everything. For the first time in 17 years, I went on a foreign trip without a laptop and with limited internet access. And of course, “it” came…

During Friday afternoon and Saturday morning, I handled a total of over 450 phone calls, communicating with colleagues, customers and suppliers to resolve the matter as quickly as possible.

What was the main problem or why we didn’t communicate

Technically, we have already described the problems in another article. However, I would like to explain the whole situation in terms of the fact that we had limited communication.

Everyone who knows us well knows that we don’t hide anything. So we certainly didn’t hide this situation or any of the causes. Unfortunately, so many coincidences happened that it made the whole situation very complicated and our communication was not optimal.

Problem number 1 – technical

The biggest problem in communication was caused by the fact that after Friday’s outage, both firewalls for our offices were damaged and we had all employees “cut off” from the internet for several hours until we secured another (third) replacement firewall that had to be shipped to us.This created a relatively long delay in communication.

Problem number 2 – security settings

All things are only accessible from our offices and we do not have access to the system from anywhere other than our offices via special cables and special firewalls. Due to the damage to both firewalls, we were unable to communicate with clients in any other way and fix some of the broken services (several servers were involved).

Unfortunately, our security measures are so stringent that the situation could not be handled in any other way.

Problem number 3 – organizational and human

You may argue that we could have posted more on social media. Yes, you’re right, but we didn’t have internet access in our offices and for security reasons everything was limited to our network and our PCs. It should be noted that 2-3 people have access to the company’s social media accounts and one of them was out of the country and another was stuck in the office without a connection.

After commissioning everything was already fine. Only colleagues were a bit afraid to respond to some posts because at first they didn’t even know exactly what the problem was, so they were afraid to post inaccurate information. At the same time, we put maximum effort into service recovery and communication with our customers, who sent us thousands of email inquiries and chats within a few hours. The whole thing was dealt with in full force and virtually all colleagues were at work both on Friday afternoon and night and on Saturday during the day.

What doesn’t kill us makes us stronger…

We’ll learn from the whole problem. We will adjust our crisis plan and prepare many, many adjustments in our company.

What has already changed?

On Saturday, we installed a new UPS to replace one damaged UPS. The new UPS not only has a longer backup time (15 minutes at full load), but also some better features and better protection against fluctuations from the external power grid. Plus, it’s more economical, which saves on electricity bills. UPS arrived from France in record time.

We have ordered a device that will allow us to connect “out ” even if our LAN in the offices is not working.

We have modified the crisis scenarios for dealing with crisis situations.

More colleagues will have access to social media accounts.

What else will change?

We’re going to change some things about organizational and security settings.

We’ve already adjusted some things about communication and we’re still tweaking it. Organizationally it shouldn’t be a problem. There are over 20 of us in the company, so we’ll share it.

We probably have to relax some security rules and adjust some settings so that some things can be handled from somewhere other than our offices and our LAN.

Let’s get some hands-on training for crisis situations.

We’ll test the motor-generator under load. Up to now we have regularly tested whether it starts, whether it runs for about 10 minutes, whether it has diesel etc, but now we have had a problem with it malfunctioning after about 15 minutes of running under full load.

We need to come up with a technical solution for a similar situation, where any information about something happening can be published quickly and easily. So far, this has only been possible in the administration, chat and front page. However, this system was impossible to access. We need to expand and simplify it in terms of access and at the same time we need to come up with an alternative website or a website where similar crisis sitautions will be published.

We add monitoring of individual services that the customer has to the administration.

We will prepare a system that will send an email to all customers in a similar situation, informing them that something is wrong. We now have such a system, but it is prepared primarily for planned outages and not for a similar problem. In addition, there is the complication that you need to send, for example, over 100,000 emails in an extremely short time and not be blocked on the various free-mails that clients use.

We will prepare a few more modifications. We will keep you informed.

What are we up to?

We have been preparing a project for Datacentre WEDOS 2 for some time, which we will accelerate. We want to have two independent buildings, two independent technologies and thus full redundancy in case of similar accidents.

In the second datacenter, for example, we want to have an extra motor-generator on the second power branch at full 100% power, which is not available in the current solution and is not common in other datacenters. We would like to build the new datacentre from the beginning to meet the TIER IV certification, the highest level of “reliability”.

Yes, it won’t be right away, but then again, it won’t take many, many years. I now know that our decision to build a second datacentre is the right one.

Reaction to some comments or response to criticism

Communication problems

As I wrote above, the communication problems were not intentional, but a consequence of the whole problem. We don’t really keep anything secret here, especially not such a major event.

Coincidence of many coincidences

One thing didn’t go wrong. Several things have failed simultaneously or sequentially as a result of some problem in the power grid. It was all related to the electricity, which is backed up (protected by surge protectors and current protectors) and yet the outage occurred.

Wiring

We have two full power supply branches – each with a separate UPS and a motor-generator on one branch. Each power supply branch leads to the servers completely separately, one branch to one power supply and the other branch to the other power supply (in the same server). Dual power supply is also available for switches, which is not a common solution. Each power supply branch is completely full-featured and has sufficient power and capacity to deliver 100% performance.

The probability of failure of individual items is once every few years. With us, things are still redundant and this reduces the probability of failure and so the “chance” that something will go wrong is really minimal. Unfortunately for us, a combination of factors combined to cause a real blackout. The likelihood of several things happening at the same time is minimal, but they did. We have to learn our lesson.

Everyone can see for themselves how our wiring is done.

We didn’t rob anything

We didn’t rob anything. We didn’t cheat anything when we built our datacenter. Not even the wiring. We even have surge protectors and current protectors everywhere.

DNS servers

It was also mentioned that we had the wrong DNS servers and that there was a subsequent problem because of that. It’s not true. We have the perfect DNS server solution for your domains. We have 4 DNS servers, each on a different domain extension (.cz, .eu, .net and .com) and each on a different physical server and IP address range and they are in 3 different locations and in 3 different countries and with 3 different providers and on 2 different software solutions. There really wasn’t a problem there.

Data loss or repair speed

There was a bit of a problem with one of several mailservers. We had to decide whether to run it quickly but lose some data or to handle the situation more slowly and without data loss. We opted for a slower solution and nobody lost any data. Yes, one of the mailervers was renewed until Saturday morning, i.e. about 16 hours of trouble, but nobody lost anything and that’s the bottom line.

Telephones and not receiving calls

Our phone calls didn’t work because everything is over the internet and when our LAN didn’t work, neither did our VoIP phones.

Teamwork

The whole problem was solved because we are not a one man show and my absence was not a major problem. Perhaps only the communication was not good…There are more than 20 of us in the company, everyone has exactly the same tasks.

I would also like to thank the contractors who helped us to repair the damaged equipment and get all the systems up and running in record time.

In closing, I apologize again

Please accept our apologies once again and trust that we have learned and will learn from this unpleasant problem and will take many measures to prevent a similar situation from happening again.

We are aware that over 11% of the Czech internet is hosted by us, so in case of problems the impact is really big and you won’t get to every 9. website in the Czech Republic. Therefore, it is necessary to approach everything as responsibly as possible.

Perhaps for the last time, I will excuse myself and focus on our internal adjustments and improvements and the new services we are preparing for our clients.

Gallery