What's going on in the network or a little about DDoS and planned network modifications

[gtranslate]

Let us inform you about what is currently happening and what we are preparing (in terms of the network).

Description of the situation on the afternoon of 26. August

Some of you (about 1/3) had trouble accessing some services this afternoon between 1:14 and 2:00 pm. This problem was caused by an extensive IP network failure at ČD Telematika (hereinafter referred to as ČDT), one of the suppliers of connectivity for our datacenter.

Roughly 2/3 of the clients had no problem except for about 30 seconds when the routing table was recalculated and the paths on the routes to us were changed (which could be another 30-60 seconds or so).

Update at 20:00:

As of 19:33, a strong attack was launched on our network, which is still ongoing at the time of writing. Details below.

What exactly happened?

At CTD Telematics there was a failure of the backbone router, which received all packets and did not route them further (simply put – it threw them into a black hole). Unfortunately, this router was receiving packets on both sides, both from us and from the internet. Therefore, it was not possible to immediately diagnose the problem, which was the cause of the extensive failure of the entire IP network of ČD Telematika, one of the leading ISPs in the Czech Republic.

Simply put, the router was “pretending” to be fine on all sides, but it wasn’t transmitting anything.

The backup router of ČDT Telematika, which we have connected on the second route (to our other router), was overloaded and thus handled only a small part of the transmissions (about 10%).

Update 27. August 20:00

According to today’s information from CTD, the crash of their routers was caused by an attack on our network. We have no reason to disbelieve this and it just goes to show the power and execution of the DDoS attack that took place. A complication in their network (but much smaller) was caused by the evening attack. We have discussed the whole situation with CTD several times today and we believe that together we will find a way to successfully defend ourselves.

Why didn’t the backup help?

The backup didn’t help precisely because the primary CTD router “pretended” to be fine and not that it had an outage or malfunction. If a failure occurred, all traffic (out and in our network) would be automatically routed through another supplier.

The backup CTD router was also connected, but it was overloaded, so it was not automatically redirected to another supplier.

Unfortunately, there was nothing to see from our side and nothing to see from the outside. The problem with the router at CTD could not be diagnosed. Basically, it was a problem on the Internet, completely outside our network.

We registered the outage at 13:14 almost immediately by our monitoring. We first checked if there were any problems on the optical routes. There was no problem there. So we contacted the connectivity provider Kaora, who also confirmed that there was no problem. Subsequently, we tried to contact CTD support, where we were told that they had an extensive IP service outage, but that they did not know the cause yet, so we tried to find out where the problem was. We first tried manually disconnecting one of the routes, which didn’t help. This was followed by manual disconnection of the second route to CTD. During the changes, the BGP routing tables were modified, resulting in a problem for about 30-60 seconds on virtually all connectivity.

After disconnecting both peering links, which also serve as upstream connections for our network, where about 1/3 of our traffic flows, connectivity was immediately restored in its entirety.

Today’s problem was beyond our control and at the same time it was practically impossible to solve it faster than in about 40-45 minutes (during this time about 1/3 of the requests had a problem). We apologize to our clients for the complications.

Update on the situation between 19:33 and 20:00

At 19:33, a strong DDoS attack started, which clogged the route to our provider Kaora. Due to afternoon problems with CTD, we did not have their network connected. We wanted to do this at night because we didn’t want to risk any problems during the day.

Due to the circumstances, we were forced to turn on the CTD to reduce the impact of the attack as everything was split between two routes. During the changes there was a network outage of about 3 minutes and about 5 minutes there was significantly reduced availability and about 15 minutes there was worse availability (with loss from some directions). We are still addressing the situation and there is still a strong attack from some locations (worldwide) which is already being addressed by the route providers “above us”, but the reason for this is that there is packet loss from some directions.

Recurring DDoS attacks

Since the beginning of our activity, we have been regularly informing about DDoS attacks against our company. We have recently reported on the increasing frequency and intensity of attacks. In recent weeks, we have indeed been going through a major stress test of our network.

Believe us, we are handling the whole thing with the utmost interest all the time.

The power is growing

When we reported last year on attacks that were 2-3 Gbps, we thought that was enough. In winter it was already 6 Gbps. In the spring we surpassed 10 Gbps and now in the summer we have already set records pushing towards 20 Gbps. Twice now we have intercepted attacks that were over 17 Gbps for extended periods of time and even higher for short periods of time. These are already values that are not at all common in the Czech Republic and probably have never been elsewhere…

What happens during DDoS

There are different types of DDoS attacks. Some are against individual services, others against our routers, others against our main site, and others against our IP ranges. Everyone is different. It is not possible to generalize, but in any case, it is that the attacker, or rather his computers (which he controls) is trying to disable our network (or some services) and disable it. There are literally millions of packets per second pouring into our routers or servers that do nothing useful but just put a strain on the infrastructure or take capacity away from services that then can’t have it for themselves.

Sometimes it happens that the server no longer has the spare capacity to connect real clients because it was previously “asked” to connect by several million other computers in a matter of seconds. Otherwise, the switch or the server connection will become clogged. The individual end servers are connected at 1 Gbps and when an attack is aimed at one particular server, the server’s connectivity is virtually knocked out of service when it exceeds 1 Gbps.

What about it?

DDoS attacks are difficult to defend against. Very hard. Some can be solved easily, others more difficult.

We have almost always handled DDoS with a smile. However, as the force is growing, we’ve already had 2 minor problems and a third today (the latter probably wouldn’t have occurred assuming we had peering with CTD).

We are preparing major changes and modifications so that everything works as it should, even during DDoS attacks.

How do we defend ourselves now?

It’s dangerous to give anything away, because it could help the attackers a lot.

We now have different rules set up on our routers and servers. Everything is monitored, and practically immediately (within seconds). We are in contact with our connectivity providers and we address every attack together immediately. At the same time, our connectivity is now primarily routed through a vendor that has DDoS protection, but each protection only evaluates an attack after a certain amount of time and each attack is different so not everything works as it should. The problem is that some of the connectivity reaches us via a second network and other providers. Another and much more significant problem is that if the attacks are so strong, there will be (literally) a “clogging” of the routes of the providers, including the providers “above us”. Thus, part of the routes go (e.g. from the Czech Republic or Slovakia) and Europe, but part of the routes do not go overseas or to Asia. Other times there are attacks from South America, for example, and so the routes there are clogged.

What we are preparing

We are planning a major overhaul of the network. We will replace the edge routers, which will have many times higher capacities and will also have 40 Gbps ports. The total routing capacity at the border between us and the Internet will be 2.56 Tbps and the packet throughput will be 1.904 tera packets per second. At the same time, tomorrow we want to launch a commercial DDoS attack detection system, which we want to buy and if the tests are good, we will buy it (or keep it).

Today, we discussed with a vendor who has an Arbor solution and through whom we already have foreign connectivity flowing, that we would expand our cooperation and tighten detection, thus significantly improving the protection of our network.

At the same time, we are working on an additional fibre optic route to increase the total capacity to 40 Gbps.

In the coming days, we want to rebuild our entire infrastructure. This step has been in the pipeline for some time and could not be accelerated.

The attack detection we have now at Hluboká we’ll put directly into Prague. This should prevent the routes to us from being literally clogged.

Just for the record, the current situation is a priority for us in all respects and the total investment in crowns will be seven places. Moreover, it will cost us hundreds of thousands of crowns every month for licences and commercial filtering. We believe it will help.

Unfortunately, it can’t be done any faster. For example, it takes several weeks to deliver the hardware, and the preparation and everything related is not an afternoon’s work.

Get ready, we’re off

New network modification hardware will be arriving on Thursday. We’ll get right on it. We’re going to change that gradually. In a week we will have a significant portion of the modifications done, but we estimate it will take about a month to be completely done.

We want to do everything at night and without outages, but these are very major network modifications that may cause minor and very short-term unavailability (during the night). We will keep you informed about everything on our website.

We’re going to make big changes, big adjustments. In just a few days, we start…

Finally…

We are very, very sorry to all clients and our priority is to take the steps described above to limit the impact of further similar complications.

We believe that we have clearly demonstrated in recent weeks that we have significantly improved crisis communication in similar situations.

It’s obvious that somebody really cares about us, because organizing such a “hunt” is not free and it must be really worth it for somebody…

Again, in conclusion, we can only write that what does not kill us makes us stronger. Just like the spring power complications we resolved, we will resolve this one. By the way, we would just like to add that in a few days the second mortorgenerator will arrive, which has been ordered all along and we are waiting for it to be delivered to the site.