Statement on the connectivity problem with VPS on 12.08.2016

[gtranslate]

Last week, on Friday afternoon, we were struggling with connectivity issues with the VPS. We have prided ourselves on openness since the beginning of our activities, so let us take some stock and explain what actually happened.

What happened

On Friday 12. 8. 2016 from 13:39 to 16:20 we had a degraded availability of our VPS. During this time, the virtual servers had a packet loss rate of about 50%, sometimes more, sometimes less, the servers themselves worked without any problems.

The cause was an entirely new type of attack on our infrastructure. Furthermore, in the following 72 hours we made several adjustments to the network and so we had to immediately intervene in the settings and there were several other minor outages, especially with virtual servers using IPv6.

The situation has marginally affected our website and support availability. Our website was sometimes slower and at times unresponsive. Customer support was overloaded and not able to respond to all client requests, and chat (as our main communication tool) was sometimes down because it was also under attack. So we tried to briefly report on the issues on social media.

The web hosts were completely free of outages. The dedicated servers were affected for about a minute before we redirected them to another router.

Approximately 5,900 services were affected. Unfortunately, these were virtual servers, which can often be challenging projects. The other services – 80,000 web hosts (130,000 websites in total), 11,000 WEDOS disks and 246,000 domains – worked without problems.

Introductory apology

First of all, we apologize for the complications that have occurred to our customers. We are very sorry for the situation described above. However, we took away a lot of experience from it. It has certainly moved us forward and it has certainly enabled us to offer better services in the future.

Why does it take so long to explain?

The reason is that we have been analysing the whole situation for some time to piece together exactly what happened, and at the same time we have had to take several countermeasures to ensure that a similar situation does not happen again. We could not publish some information without having the situation accurately verified and mapped out. At the same time, we could not release the details without being prepared for a similar situation in the future. We’d just be giving instructions for a similar situation to happen again.

DDoS protection and its development at WEDOS

About 2 years ago, we started dealing with a larger wave of DDoS attacks. First we solved the situation manually, then we tested several different solutions (including “boxes” worth tens of millions). In the end, we decided to go with a solution that is built mostly on opensource. We can control it, we know what it does and we can expand it. Unfortunately, some threats are for the first time. For example, like the one on Friday and we first need to “teach our system protection” so that next time it will solve everything by itself and for that we need to analyze the situation thoroughly and make some adjustments and settings.

After several months of development and testing, we deployed the solution in the autumn of 2014 and are continuously improving and expanding it. Basically, there are 11 very powerful servers that evaluate all the flows in the network and apply different rules for filtering or changes in the network based on that.

In 21 months, we have detected and filtered out an incredible 257,000 attacks on us or our customers. The most common target of attacks is our website and our routers. Some of the attacks were in the tens of Gbps and were millions of packets per second. The system worked and works great. Coincidentally, some time ago we celebrated a year without (unplanned) outages.

In designing the protection, we based it on situations that we have already dealt with, as well as on the form and methods of attacks that we had almost on a daily basis.

So far, 97% of all attacks have been handled on IPv4, while we already have about 13% of traffic on IPv6. Only a relatively small proportion of attacks were on IPv6. The system always picked up everything, sorted it out, and we could work in peace without having to deal with anything.

We’ve invested several million, thousands of hours of work into DDoS protection and we’re constantly pushing the envelope. In our network we operate the most web hosts and virtual servers in the Czech Republic.

Attackers are constantly coming up with new threats and new methods of attack. Previously, it was the use of “brute force” where the goal was to “clog” and overload our infrastructure. We have gradually increased the capacity of the lines to the Internet (up to the current 70 Gbps). We’ve learned to filter it all even more, so “brute force” isn’t as effective. That’s why attackers come up with other methods. They use advanced methods of attack where it is not “brute force” but intelligent attacks. Various packets that cause problems on the network or on end machines. These new types of attacks are harder to detect and defend against. If attackers come up with something new, we have to analyse the whole situation and then we have to prepare the system to be able to defend against the new threat in the future. That’s exactly what happened on Friday – a new threat, analysis and now we’ve been implementing various countermeasures for several days.

We don’t sleep and that’s why IDS/IPS

About a year ago we started to prepare IDS/IPS protection for our customers. This is an intelligent form of protection against attacks on customer applications. For example, it can block attackers who just overload websites and exhaust free processes, try to crack customer passwords or exploit various security vulnerabilities in content management systems. We are able to protect customer sites from attack because filtering rules are used for tens of thousands of the most common security threats. Everything is online and in real time. So, for example, if an attacker wants to access your website on a page that may be vulnerable to attack, and a potential attacker sends a request that may constitute an attack, that request is blocked before it reaches the server.

This additional method of protection is very technically demanding and very complex to prepare, set up, configure and operate. We also tried ready-made (paid) solutions (including hardware), but after all the tests we decided to use opensource solutions again, paying for some parts. During this spring we have been working on a large number of modifications to our network in order to offer this protection (in the first phase) to web hosts. You wouldn’t believe how much work it is, but how effective it is at the same time. After only about a week after deployment, we reduced the average data flow to our network by about 200,000 packets per second. It was just unnecessary traffic that was accessing our servers and generating load or was a security risk. The system automatically blocked all of this.

Why do we mention it? The reason is that we were a bit concerned whether this protection (and all the changes associated with it) had an impact on the situation on Friday. We got a little nervous and that was the first thing we turned off. We are gradually returning the system to web hosts.

IDS/IPS is another huge step forward. For web hosting, the results are very positive. It’s millions more invested in protecting our clients and another huge effort. We have been working on the issue for about a year very intensively.

Symptoms of last week’s situation

It was a completely new form of attack. Extremely sophisticated, probably long planned and professionally prepared. It was not some coincidence or some attempt for a few dollars, but the fact that the attackers also had some of the computing power used for the attacks ready in our country (normally on paid services).

The attack was not a dangerous force because at peak we had about 1.5 million problem packets per second coming in from the outside. It was not dangerous even in terms of data flow, because the vast majority of data was deliberately only in the form of so-called. TCP+SYN requests. These are small packets, a small amount of data that is meant to overload (exhaust all available slots so the server can’t accept any more connections). Protection against TCP+SYN attacks is not entirely simple, but we have it working and tested for a long time and under normal circumstances no one would recognize any attack. Many tools can be used to generate a relatively strong TCP+SYN attack. Easy, cheap, effective and you need almost nothing.

Strong attacks were conducted from our network on virtual servers, on our website, on our routers and on other VPS in our network.

You can only be thrown overboard by those who are on the same boat as you

Yes, it is. The biggest problem last week was that we were attacked by our own customers. What? Yes, our customers, or rather several customer VPSs attacked us, our website and our routers from the inside.

Unfortunately, we did not directly foresee such a situation in the protection proposal. The situation was such that we had no protection implemented to filter attacks from our own network between each other and we were only thinking about it and preparing its deployment.

The attackers were faster and ordered various VPS from us, paid and attacked us from the inside. They were attacking us (our website and routers) and other (other customer) VPS at our site. The attacks were quite strong (many times stronger than from the outside). At the peak, it was about 6 million problematic packets per second, which were “multiplied” in various ways by the fact that the network was circulating responses from VPS and routers that were attacked.

We don’t know if the attackers were just attackers who bought paid services or if the VPS of regular customers were also exploited (by means of some security vulnerabilities), but we can’t rule out that either.

What actually happened?

There was a widespread attack on a few hundred VPSs using IPv4, which was probably a disguise maneuver for a strong attack on IPv6. IPv6 traffic was much stronger and more aggressive. The attacks were directed at hundreds of thousands of IPv6 IPs that are used (or allocated) on virtual servers in our country. These were not IP addresses actually used by the individual services, but machine-generated IPv6 addresses.

Our website was receiving tens of thousands of requests per second from the outside. It was much much more from the internal network and from our virtual servers.

The problem was that the virtual servers were attacking our website and our routers. Our website was receiving connection requests in the order of several hundred thousand per second.

The routers dealt with the requests in the form of replies saying that we do not host such an IP address. Since we were receiving about 1.5 million requests per second for non-existent IP addresses (or existing but not actually used), the routers responded back and thus the data flow doubled. It still wouldn’t be a big deal.

Other parts of the attacks were then taking place in our network (at the VPS) in the form of bulk generation of IPv6 attacks from the ranges that were allocated to individual VPSs. At the same time, we have also seen attacks from IPv6 that are not in our scope at all and the traffic was generated from our hosted VPS.

Our border routers were receiving quite a lot of traffic from outside that was from spoofed IPv6 addresses, which we successfully blocked.

The problem occurred when routers and individual DDoS protection cells started applying different security settings and started calculating different checksums for IPv6 and wanted to protect individual IP addresses that were targeted for attack. At this point, they were trying to count the huge number of attacking IP addresses (where it is done by prefixes) and mainly to protect the IPv6 we were running the attack on. There were several hundred thousand of those under active attack, and overall the system was trying to protect virtually our entire range… At this point, there were technical problems. The routers and protection were so overloaded that they stopped responding appropriately. The machines on the network ran out of memory and that started the problems.

The result was that eventually the physical interface was disconnected and at that point the individual devices were not reachable within our backbone network. And thus the default routing rules started to apply. The packets then travelled back and forth in our network until their TTL expired. This cycling was happening within the backbone and it was an effect and it didn’t affect functionality and it wasn’t the cause.

How did we handle it?

First we had to find out what was going on and how. Due to the problems (overload) of some parts of the DDoS protection (especially online sensors) we were a bit paralyzed and looked for the causes where we could. Instead of having the information online from the sensors, we were without information or with a long delay. This made the situation very complicated and it was very difficult to find the cause of the whole situation.

We then separated the IPv4 traffic from IPv6 traffic to another router, redirected the traffic of other services to another router, and this stabilized the situation and gave us time to address the IPv6 traffic.

On the one hand, we have set different restrictions and different filters. On the other side, we were trying to figure out who was attacking. We shut down several attacking VPSs and reduced other problematic ones. Step by step, we stabilized the situation. In the meantime, we were trying to solve a problem with DDoS protection, which had a problem with lack of RAM and power for such a powerful area attack, where the protection had to calculate protection for hundreds of thousands or millions or billions of IP addresses that were attacked and on the other side that attacked…

Where was the problem with us?

Goodness

From the beginning, we assign the IPv6 prefix /112 to the VPS, which is 65,536 IPv6 addresses for each VPS. It’s a huge amount. Currently we have 393,216,000 IPv6 addresses allocated (for VPS only). Yes, the whole 393 million. Just to give you an idea – there are at most 4 billion IPv4s worldwide. So, if we compare that, our VPS is allocated 10% of the total number of all IPv4 addresses that are used in the world (in terms of IPv4) This is a huge number. Maybe we should have been more careful, but it’s such a standard in the world, plus some blacklists often use blocking just for networks /112 etc. So smaller ranges would in turn be more complicated for other clients.

Root VPS

We didn’t limit what clients could do. There was freedom on our root VPS where clients could install whatever they wanted and could set the ranges they wanted to use on the network interface, so they could make different network waves between the VPS. Because of this, it was possible to realistically have tens of thousands of different IPv6 used on a single VPS, which is a problem when multiple clients do this and you then want (or need) to protect them.

Massiveness and especially the flatness of the attack

We did not count on hundreds of thousands of IP addresses being attacked simultaneously. The elements of the protection version were not prepared for this and we have not yet encountered such a widespread attack. However, we have already made the necessary adjustments to prevent a similar situation from happening again.

Fighting attacks from within

As mentioned, we have thought about this but have not yet prepared it. We have now added detailed traffic monitoring to the internal network. So far we have had monitoring at the entrance to our network, before DDoS protection and after DDoS protection. Inside the network we had netflow, which is delayed and not suitable for dealing with DDoS attacks. We now have a detailed overview online and are able to better assess and resolve the situation (even automatically).

What was that all about? A little theory

To use the information about IPv6 from Wikipedia, “Examples are commonly given to show how absurdly large the IPv6 address space is. It contains a total of ²¹²⁸(roughly 3.^4×1038) addresses, which corresponds to ^5×1028 addresses for each of the 7 billion people alive today. Or 252 addresses for every star in the known universe (a million times more addresses for each star than IPv4 allowed for our planet). Possibly a comparison with the number of atoms in the known universe (¹⁰⁷⁸ to¹⁰¹⁰⁰).“, we’ll see how huge the number of IPv6 is ready for use. Compared to about 4 billion IPv4, this is a diametrically opposed difference.

For WEDOS, we have an allocated network range for IPv6 /32, which is a whopping 4,294,967,296 subnets /64, with each of the subnets /64 having 18,446,744,073,709,551,616 IPv6. So the result is that WEDOS has 7,922816251×10²⁸ IPv6 addresses assigned to it, and all of them were pointing to WEDOS and could theoretically lead to an attack on any of them. For such a large number of IP addresses, no technology can count packets against these IP addresses and protect them from DDoS attacks.

Compare this to IPv4, where we have 10,240 IPv4 addresses.

Surely you understand that there is a difference between protecting 7,922816251×10²⁸ or 10,240 IP addresses on the one hand, and protecting against a different number of attackers on the other, With IPv6 you have 3.^4×1038 potential attackers and with IPv4 you only have about 4 billion (232 addresses (about ^4×109 = 4 billion addresses)). These are the differences. Big differences.

When you’re dealing with DDoS protection, you have to count the number of packets for each IP address or range that you’re protecting and monitor (and count) what the packets are and how malicious they are. For IPv4 you can easily do the math (given the counts above), but for IPv6… You can always group it somehow, but with IPv6 you’re hitting huge numbers and if you group a lot of it, the protection is ineffective.

If you want to address protection and you don’t want to use just plain blackholing (IP address cancellation), but you want to filter packets and deliver only the benign ones, then you have to do a lot more calculations and you have to do other things as well. So you are not only dealing with destination IP addresses, but also source ranges, traffic on individual protocols and other parameters (port traffic and packet sizes). And that’s the crux of the problem. You need power and memory to count. Let’s ask ourselves whether there is a way (in the world today) to calculate and achieve this.

IPv6 kills

Similar situations are likely to follow for many other providers. IPv6 is modern and we are big promoters ourselves, but Friday’s situation is a big sobering. Big. We’d prefer to end IPv6…

With IPv6, we are faced with situations that do not exist with IPv4 and never will. This is just one of them. Quite essential.

With IPv6, there are protection issues, setup issues. Everything needs much more system resources than IPv4.

Measures in place

Network

We’ve adjusted some things on the network, splitting the IPv4 traffic over different routers than the IPv6 traffic. A significant step in terms of stability. We have modified the filtering rules for IPv6. Much earlier, we send individual IPv6 to the so-called. blackholing and their traffic will be banned from our network and disappear from the global routing tables.

Security

We’ve modified parts of our DDoS protection, added 512 GB of RAM to the sensors, added more sensors to the network. The new sensors monitor traffic behind our routers, directly between services. Up until now, we had online sensors on the backbone routers, behind DDoS filters, but we did not have online monitoring of network traffic between individual servers on our network. We find Netflow slow and insufficient, so we have added additional online sensors.

We are considering whether to deploy IDS/IPS also at VPS. This has been proven to work for web hosts. Traffic has dropped, the number of network incidents has dropped, and the number of compromised sites has dropped.

Protection

All VPSs will continue to be allocated the /112 range, which is 65,536 IPv6 addresses for each VPS, but in reality it will only be possible to use individual IP addresses, which will have to be registered in the customer administration and thus this IP address can be used. We have this solution ready and will activate it in the next week. We will first send a message to all affected clients and then begin to restrict. Each customer with a VPS will automatically have an active IPv6 with number one (from their range) and will have to register the others in the administration first. You can already preset IPv6 enabled in the customer administration, but for now it has no effect on the functionality. Everything will be active after the weekend.

Restrictions on root services

We have already made some small measures and so there are no more possibilities to abuse other IP addresses etc. on the VPS, which is now also true for IPv6. Previously some things could not be done with IPv4. We don’t want to limit the nature of root services, but we are looking for a suitable solution so that it doesn’t have a negative impact on clients.

Monitoring and social media status

We will be able to put a rule on social networks about a more global issue. Not like before, when we can export individual problems with individual servers.

Once again, we apologise and we certainly do not intend to keep anything quiet and we are not hiding any problems. Believe me, this is not our style.

Restrictions on certain packet types

For VPS, we have started to slightly limit some types of traffic and we automatically block exceptionally high values (in case of attacks). These are for example ICMP packets and so in case of strong attacks ICMP packets may not always match (for example pings may not always match). Some protocols may be restricted but without affecting the virtual server itself. It is in the interest of quality of service to always run traffic on other protocols that are important to the operation of the VPS. It is true that, for example, this limitation may cause false alarms for various monitoring etc… We apologize for the above and believe that this will be a temporary situation.

More details – sorry, but no…

Sorry we’re not releasing more details about the attack. We could write what tools were used to carry out the attack, etc., but we don’t want to give instructions on how to “harm” and attack IPv6. That wouldn’t help anyone. Not us, not other service providers. Thank you for your understanding.

Sorry for not disclosing further details of the measures taken, but the reasons are similar. We don’t want to give out information about our protection so that we can give potential attackers a hint about how to do harm in our country. Thank you for your understanding.

We hope that you will appreciate our openness and that you will appreciate the fact that we have explained and described the whole situation at length and have not been satisfied with any excuse or justification. We described what happened and also where we had mistakes and what we changed or what we still need to change. Openness and no secrets is our style. The whole situation has moved us on and again we have been able to improve our service.

IPv4 for all

Because of what happened we decided to give all VPS IPv4 for free (or for a crown per month), even if we have to buy them at a high price. We’re out of IPv4. We didn’t really have any free ones, so we had to buy them and charge clients and force clients to use IPv6. Now it’s backfired on us. From that perspective, we’ll give everyone IPv4 because there are not as many filtering requirements and we can filter it in a much better and more reliable way. At the same time, we will separate the flow for IPv4 and IPv6 for clients, and in case of similar complications, one of the versions will always be available. We believe that we will not deal with a similar situation again.

Answers to individual questions that have been asked in the past few days

Question: Why is the DDOS contraption somehow not appearing this time when I look at the transfer charts from yesterday. (Why the charts don’t show DDoS)

A: The public graphs show the flows that are behind the primary DDoS protection. This means that the flow coming from outside is mostly filtered out and not visible on the charts. The problem was not so much the flow from the outside as from the inside. At the same time, it was not a large data flow, but a large number of small packets that consumed free slots on individual VPS.

Question: When I tested this, at certain times/intervals packets would get looped between two wedos routers and then dropped due to TTL.

A: Yes, this happened when the interface on the router was broken and the router was looking for another (alternate path). Then the default routs were applied there and the packets could be like that. This was used in our backbone network where it was not a problem. The problem we had was mainly behind the VPS router, which was overloaded from the inside.

Question: Why couldn’t I get in touch with customer support?

A: Customer support did their best on Friday afternoon. It’s holiday time and so there is less workload on support so there are not dozens of staff but for example 4 staff plus technicians. Unfortunately, when we have our website or chat under attack, accessibility is an issue and sometimes it doesn’t work perfectly. The phones are ringing, but there’s no way to serve everyone… There are dozens of emails coming in with questions, etc… At that point, you don’t know what to do first. From that perspective, we post messages to the administration and chat. We are posting more short information on social media. All efforts are devoted to solving problems and addressing the situation.

Question: Why wasn’t there an immediate alert on some special WEDOS Status social media accounts?

A: The vending machine was not prepared for a mass situation. Following our previous experience, we have introduced sub-service machines. We take this as an incentive to improve and will work on it immediately.

Question (added after publication): what will be the compensation?

A: Compensation is handled by our sales department. Just write via the contact form.

Question (added after publication).

A: Yes, you are right. There are both stylistic and grammatical errors in the text. We’re sorry. We wanted to communicate what happened and explain the situation. We could have written an apology and not explained further, but that’s not our style. We could have concluded the whole thing with any explanation, but we wrote the truth and we also devoted ourselves to a theoretical explanation of the problem so that everyone could understand. We apologize for any mistakes in the Czech language or errors in style. For us, the technical issues related to the whole situation are now topical. We are dedicated to them. Thank you for your understanding.Question (added after publication): that captcha is pretty annoying. I could understand it for logging into the administration, once to verify that I’m not a robot, but not every time I add an address, when I add 10 addresses, it asks me for 10 verifications… A: We agree with the annoyance, but it’s a defense against bots (clients that use bots) and who put in hundreds or thousands or tens of thousands of IPv6 and the whole action and defense is useless. Question (added after publication): I’m sorry, but although this story sounds very plausible, one thing doesn’t seem right. Your network has never really been attacked from inside the VPS? A: We could make up anything, but we wrote the truth. There were attacks from the inside, but they were after IP4. Question (added after publication): why didn’t you have protection against insider attacks applied as a precaution? Sorry, but you seem like total networking amateurs to me, even though your other innovations are excellent. A: We had protection on IPv4. On IPv6, the situation is much more complicated and the experience with protection on IPv6 is less globally. The number of IP addresses is many times larger and so protection much more complicated… Question (added after publication). In my opinion, your internal routing is just crazy. So why wasn’t the third route from ČD Telematika active (according to the graph) and the second route has graphs only since this “attack”??? Answer. The graphs did not measure because the SNMP data collection was not working at the time of the problem and the routers were not responding to provide the data correctly. We admitted our mistakes in the text. Believe me, a routing error would be the easiest and certainly less of a problem than looking for what happened and explaining this and finding a solution to prevent it in the future. That captcha is pretty annoying. I could understand it for logging into the administration, once to verify that I’m not a robot, but not every time I add an address, when I add 10, it asks me for 10 verifications…

If you have any further questions, don’t hesitate to ask. We will be happy to answer.

Conclusion

Once again, we apologize to all affected VPS users. We believe we have learned enough from this situation. We have now put several measures in place and are preparing more to prevent a similar situation from happening again.

At the moment, we can perhaps add that we are working on the construction of a second datacenter and on the preparation of a new service that will be launched for public tests in the coming weeks. The new service will be a combination of web hosting and VPS. It will be a “sort of” managed VPS, where you will have the parameters and performance of the VPS, but we will take care of the management of the VPS (intended for web hosting). There will be a lot of the news you’ve been asking for.

The new data centre is already standing and we are waiting for the delivery of the windows and then we will start installing the technology in the interior. We would like to remind you that it will be one of the most secure datacentres in the Czech Republic (servers underground and behind 30-110 cm of reinforced concrete), which will also be the most economical and most environmentally friendly datacentre (at least) in the Czech Republic. The servers will be cooled in an oil bath and the waste heat will be used to heat the town’s swimming pool.

At the same time, we are also preparing other news in our offer, but more about that next time…