How our website was under attack or someone doesn't like us again

[gtranslate]

Our development is moving in the right direction. It’s easy to tell. Someone’s really bothering us again.

Dear customers, we would like to inform you about the reasons for the unavailability of our website and customer administration on 2. February 2018 in the afternoon and at night. We have always been honest with you and we intend to continue to do so, because that is the only way you can have full confidence in our services.

Attack at 17:54

Around 6pm, a very strong DDoS attack was carried out on our hosting.wedos.com website. At the same time, there were smaller, more difficult-to-detect attacks on some other servers, especially the new web hosts that we have been setting up on HPE Moonshot since the end of November.

The main attack was done via various ICMP and UDP, which would not be a problem for us as we normally filter them in large numbers. At the same time, however, there was a massive TCP+SYN packet attack to overwhelm our web servers. This has led to an overload of IPS/IDS protection, which is also used by NoLimit web hosts set up before November last year. The IPS/IDS protection was thus “clogged” with requests and could not keep up with packet and connection switching. Automatically, the DDoS protection mechanism was supposed to help the IDS/IPS protection to ease and bring the situation to normal. However, this did not happen. Unfortunately, the fail-safe mechanism, where the IDS/IPS protection system for the attacked network should have disconnected, did not work either. After a few minutes the technician did this by disconnecting the cable and the attacked traffic from that moment on did not go through IDS/IPS protection, but outside the backup routes (that’s why it was enough to disconnect the cable either by configuration or manually).

The attack was primarily directed at all of our WEDOS.xxy domains. We have the main stuff on separate servers, which were overloaded, but the individual domains (individual TLD endings) are on different parts of the network and different servers and there were outages or slowdowns. Other services such as VPS ON, VPS, VPS SSD and dedicated servers were not affected by this attack.

Attack at 21:40

Another attack, and a much stronger one, took place around 21:40. The scenario was the same, but we were already prepared for everything. The technician manually dropped the IPS/IDS protection and so shared web hosts set up before November last year experienced only a short outage of a few seconds, or slowdown for a few seconds.

This evening attack was about twice as strong as in the early evening. In the evening, we had about 2 million connection requests per second coming to us (to our main site) in the form of TCP+SYN packets, i.e. packets that are just supposed to establish a connection and are waiting for a response. The attacks included several additional millions of packets per second, mostly UDP, to overwhelm our links to the servers. Since we redesigned the backbone to X times 100 Gbps, this was not a threat and there was a line clogging for about 1 second towards the final attacked servers. Then DDoS protection worked and we filtered these UDP packets before they entered our network. The problem was with TCP+SYN packets.

When the attackers found out that our other services were up and running, they targeted the servers where the new HPE Moonshot web hosts were running. Servers are built for heavy load both hardware and software (one HPE Moonshot with OpenNebula takes care of more than tens of millions of files). Unfortunately, fast accesses caused too many log entries. There is plenty of space for these cases and there was a huge reserve on the servers, but some protection mechanism in Docker (or related things) preemptively shut down everything and still some servers repeatedly shut down with the information that 87% of the space on some disks was exceeded. Before our technicians and developers resolved this by modifying it, the storage rebooted and then had to be synchronized between servers in the cloud. This was reflected in the unavailability and subsequent slowdown of some new web hosts. We’ve been working on the new web hosts and storage very intensively for the last month and we’ve been rejoicing that they’ve been running for 11 days without a single hesitation and especially extremely fast. We will write a separate article about it.

This problem has already been solved. The developers are also preparing several improvements to prevent such situations.

We had TCP+SYN filtering

We already had TCP+SYN filtering in the original version of the network, but why doesn’t it work now? After switching to X times 100 Gbps, the protection against TCP+SYN did not work properly, because we completely changed the routing logic and everything is fully automatically backed up (every router and filter in the network has “its brother” as an online backup, which is a prevention against outages and at the same time a possibility of balancing traffic) and routing is not symmetrical. Packets going into our network “flow” along a different route than packets leaving our network. In this case TCP+SYN packet filtering does not work.

During the night from Friday to Saturday we made some adjustments in the routing so that we could filter the packets. It’s selectively. We’ll pick out the contested traffic and send it through a different route than the rest. So we send the outgoing packets the same way as the incoming ones, thus enabling filtering.

New filter in action

The priority for us was to keep the hosting services running, which is why our website was down for so long. Your service is always our priority. Part of the team was working on solving several problematic servers (about 7 servers out of the total of about 1500 we have) and part of the team was working on how to solve the situation as soon as possible and part of the team was working on modifications to avoid problems. In addition, we also posted a message on social media and my colleague responded there. We have many years of experience with similar situations, so setting up teams and coordinating their activities in crisis situations is essentially routine. Everyone knows who to inform, who to get further instructions from, or to tell them what they have already arranged.

Of course, part of our team was constantly engaged in repelling the attack. We also decided to put a new f80 filter in front of our site. This is a new filter, not yet tested in live operation, which our developers and engineers have been preparing for VPS SSD and VPS ON and the newly prepared WEDOS Cloud. However, its deployment was delayed due to the network upgrade to X times 100 Gbps. The new f80 filter allows traffic to be filtered by country of origin. If you have a service for a limited number of people, other approaches are just an unnecessary burden. It also serves as an emergency solution for situations like ours right now.

On the new f80 filter we have set up accesses only from the Czech Republic, Slovakia, Poland and connected it to the network. He immediately started filtering traffic and our website was running at full speed. The server load fell immediately to normal operation. No slow loading – everything went as if nothing had happened. Yet the server was originally (literally) under fire from millions of requests per second. The new f80 filter has been developed from the beginning for huge loads, but until you test it in real operation you don’t know how effective it will be or how much it can handle in the long run. Trying it out in “combat” conditions on our own site, where we basically risked nothing, was ideal. It passed the test with flying colours 🙂

New filtration options for our clients

We will want to offer this new filtering method to our clients for all services (including web hosting). Clients will be able to choose the countries from which they want to allow access to their server (primarily the service will be for VPS, VPS ON and WEDOS Cloud). For example, if you know that you have a server with visitors from the Czech Republic and Slovakia, you will only allow these two countries. Or maybe you know you only want visitors from Europe, so you only allow access from Europe. Of course they will go to make exceptions in the settings. Everything can be set up in the administration.

We now know it works and we will not be afraid to offer this service to our customers.

Now we still want to complete the transition of the network to X times 100 Gbps, where we basically only need to launch IPv6 for new services (new web hosts, VPS ON and WEDOS Cloud). We are working on a solution there.

Conclusion

More happened on Friday, but that would be a script for a feature film:). When we open our second datacentre and do open days, we will be happy to talk to you about it in detail. We have nothing to be ashamed of. We knew what was happening and how to resolve it with minimal impact on our customers’ service. Such attacks could have lasted for hours and repeated every day. In fact, they were repeated the next day, but you didn’t notice anything because we prepared for them and our DDoS protection reacted flawlessly this time.

The attacks were mainly directed at us – our main website, administration and also domains that we don’t use much, like wedos.cz (here is just a redirect). Someone must not like us. Maybe we should do an emergency discount for at least a week on our new services 🙂