The number of blocked traffic exceeded 60% or how we have made progress in protecting your websites

[gtranslate]

It’s been about four and a half years since we started filtering traffic for web hosts with our IPS/IDS protection. Advanced threat detection and elimination system. The original filtering rules can no longer be compared to today’s. They are far more benevolent, yet more effective than ever before.

Three levels of protection

Before someone can access your website, they have to go through three levels of protection.

DDoS protection – First of all, it’s the sensors of our massive DDoS protection. They are mainly looking for non-standard (above limit) suspicious traffic. If they find it, they will divert the traffic through the switch to the powerful servers where they will start filtering. In March 2020, we stopped a 44.5 Gbps DDoS attack. Nobody noticed anything.
Pre-assigned Protection – We introduced you to Pre-assigned Protection in April 2019. It is a very fast filtering based on the analysis of accesses from all our webservers, which we download to one central location and evaluate in real time.
IPS/IDS protection – Now comes IPS/IDS protection, which scrutinizes incoming and outgoing traffic. Based on more than 20,000 rules that are continuously added and modified (manually and automatically), it can detect both known threats and potential ones. The downside is that we can’t filter HTTPS traffic yet. Soon we’ll be able to do that too.

This does not include persistent rules on the servers and routers themselves. So every packet going to our servers is assessed 3 times and only then it reaches the servers. The whole thing takes a few microseconds. The little delay is worth it.

We have been developing the protection system for several years and are constantly improving it.

The number of blocked accesses is growing

This week, I noticed that the number of blocked accesses on prefetching and IPS/IDS protection went over 60% of the traffic. The overwhelming majority of accesses are blocked just on forward protection. We were a bit worried if we were blocking too much, but compared to previous years we are more benevolent towards robots thanks to smarter rules. This has also reduced the proportion of false positives and we block for a shorter period of time.

The explanation is probably simple. Hardware, connectivity and other things related to the Internet are cheaper. The attackers simply have more resources. It’s like email SPAM. There is more and more of it (but we can block it very well).

A lot of blocked accesses may not directly want to exploit a vulnerability, but are just looking to see if it happens to be present on the target site. We’re blocking that, of course. Usually automatically, but we can also intervene manually, as in the case of the PHP framework Nette vulnerability or in the case where we blocked the exploitation of a “leaky” WordPress plugin. And there are many more such examples…

Different rules for different servers or the most important is the human after the robot

It’s all very complicated and there is a reputation system in place. It’s all about big data and analyzing it and setting empty rules. We collect logs from all servers, use dozens of different blacklists (including paid ones) and download different threat databases (including paid ones) while our filters behave intelligently. All of this changes our filtering dynamically and in real time and even several times per second.

Although we collect data from all NoLimit and WMS hosts, individual servers may have individual rules and settings that change dynamically over time.

For example, if we have a server where, as a result of a successful advertising campaign, several websites get more traffic at the same time than all the reserves, it may limit the access of robots. These situations are quite extreme, but they can happen occasionally. The aim is to keep the service running even if the robots have to wait for an hour.

Of course, this doesn’t mean that you’ll immediately start getting unavailability warnings from monitoring. Restrictions refer to repeated accesses that substantially burden the server with excessive activity.

Just to give you an idea, a “normal” aggressive robot can do a thousand requests per second and you can’t wait there because customers would immediately know it. Such an overloaded webserver has a problem to breathe and in some cases it would have to end with a restart of the webserver. That’s long minutes of slow loading of your website with subsequent downtime. We simply have to defend ourselves against that, and we are defending ourselves. Our customers come first.

With robots, it’s what they do, not what they are

Nowadays, you can’t rely on a robot claiming to be, say, GoogleBot. If we allow all “googlebots” to move around the servers unrestricted within some whitelist, it will not end well.

By the way, we have specialized filters that look for fake bots. If someone pretends to be a robot that accesses only from certain IP addresses and suddenly we have access from another IP address, we can block this access, significantly limit the number of accesses (trial operation) or limit what they can do – for example, they can not submit forms or access the WordPress administration.

That’s what smart protection is all about. Lots of options where there is a compromise for almost everyone.

Therefore, we mainly monitor the activity of IP addresses. If an IP address starts running one or more filters at once, we will block it for a limited time. If he won’t rest, he’ll rest for longer and then longer…

Search engine robots have exceptions, but they have to behave

We all love search engines and want all of our content to be available to them as soon as possible. That’s why we treat all known search engines differently than, say, robots that collect marketing data.

Search engine robots take into account that they can overload a website or server, so they also react to various warning signs, such as a limited number of accesses in a certain time or increasing response times.

Google, for example, has managed to calculate the limits of our servers so accurately that they are virtually limitless. This can be seen in the following graph. The top graph shows the number of GoogleBot accesses and the bottom shows how many accesses were blocked. This is a 7 day chart by the hour.

Only a fraction of accesses are blocked. For GoogleBot we record specific IPv4 addresses.

It’s amazing how nicely Google can spread the load over time. In contrast, SeznamBot once in a while calls “Run!” and the fans in the servers kick into high gear because the CPUs need to be overclocked to higher performance 🙂

That’s also why it has a larger number of blocked requests. Even so, this is only a negligible fraction.

Other highlights:

Both are most interested in the robots.txt file
- Google had 7,785,218 requests in the last 7 days
- The list had 6,129,646 requests in the last 7 days
Google is also quite actively looking for the ads.txt file – 667,876 requests

Robots by number of accesses

In the following table you will find the approach of the top 45 most active robots (according to how they identified themselves) over the last 7 days. These robots have therefore passed through DDoS protection, preemptive protection and IPS/IDS protection. Scripts and universal robots such as crawler, robot, Python request, Apache-HttpClient etc. are removed from the table.

Robot	To whom it belongs	Number of requests
Googlebot	Google	58 230 724
bingbot	Bing	56 851 376
ListBot	List	48 036 897
YandexBot	Yandex	13 631 396
MJ12bot	Majestic	11 453 595
FacebookBot	Facebook	10 818 181
Googlebot-Image	Google	9 615 900
AdsBot-Google	Google	5 865 139
UptimeRobot	UptimeRobot	5 487 728
Adsbot	Google	3 696 208
SemrushBot	SEMrush	3 163 259
Mediapartners-Google	Google	2 955 552
ZoominfoBot	ZoomInfo	2 837 140
serpstatbot	Serpsat	2 477 853
Seekport Crawler		2 406 872
Applebot	Apple	2 134 281
heritrix	Internet Archive	1 779 452
PetalBot	Aspiegel	1 758 849
BingPreview	Bing	1 662 438
WP Fastest Cache Preload Bot		1 501 166
DotBot	Moz	1 402 027
YandexImages	Yandex	1 367 933
dns-crawler	CZ.NIC	1 239 963
aranhabot	Amazon	1 230 600
Pinterestbot	Pinterest	1 039 024
AhrefsBot	Ahrefs	987 148
Datanyze	Datanyze	870 914
Heurekabot	Heureka.cz	822 107
ptolemaiaBot		767 102
Mail.RU_Bot	Mail.ru	722 754
de/bot		593 131
Mediatoolkitbot	Mediatoolkit	558 655
DuckDuckBot	DuckDuckGo	532 675
magpie-crawler	Brandwatch	425 309
AimySitemapCrawler	Aimy	403 751
PingdomBot	Pingdom	387 363
Sogou web spider	Sogou	372 302
BLEXBot	WebMeUp	333 534
CFNetwork	Apple	332 778
SimplePie	SimplePie	283 075
Electron		264 931
DuckDuckGo-Favicons-Bot	DuckDuckGo	251 105
List-Zbozi-robot	List	247 458
Amazonbot	Amazon	239 876

A few years ago, we would have blocked some of these robots. Today, we have smart filters that just limit them when it’s in the interest of our customers. Otherwise, we do not prevent them from browsing the web.

What’s currently wrong

Currently, we have the biggest problems with CDNs such as CloudFlare, through which a large number of attacks go through. Their users often don’t realise that they have to pay extra for more advanced protection. We cannot filter them on IPS/IDS because HTTPS is used. IP addresses can’t be restricted either, as everything goes over common CloudFlare IP addresses. Cloudflare is just one big complication. Domains that have CloudFlare don’t use our DNS either, so we can’t migrate them quickly or defend them in any other way. But I guess we’ll write about that next time.

In the future, we will solve this by improving IPS/IDS protection, where we will also check traffic over HTTPS.

Other than that, we’ve had sub-teens of issues throughout the time where we’ve blocked a little more than we probably should have. This is a fantastic achievement.

Conclusion

Otherwise, we are extremely satisfied with the IPS/IDS protection. It blocks a lot of malicious traffic that no one misses. We rarely encounter false positives. Most of the time when we look for a problem, we find attacks from compromised computers or someone trying various penetration tests rather clumsily. Then he may find that he can’t log into the editorial system administration for a few hours because IPS/IDS is blocking POST requests.

Of course, among the more than 20,000 different filters, sometimes something suspicious can get in by mistake. If you find something like this, just drop us a line via the contact form and we’ll check it out.

We consider our protections to be our greatest competitive advantage. You won’t find a solution this complex… And we’ve only described it superficially, because that would make for hundreds of similar articles.