It’s been about four and a half years since we started filtering traffic for web hosts with our IPS/IDS protection. Advanced threat detection and elimination system. The original filtering rules can no longer be compared to today’s. They are far more benevolent, yet more effective than ever before.
Three levels of protection
Before someone can access your website, they have to go through three levels of protection.
- DDoS protection – First of all, it’s the sensors of our massive DDoS protection. They are mainly looking for non-standard (above limit) suspicious traffic. If they find it, they will divert the traffic through the switch to the powerful servers where they will start filtering. In March 2020, we stopped a 44.5 Gbps DDoS attack. Nobody noticed anything.
- Pre-assigned Protection – We introduced you to Pre-assigned Protection in April 2019. It is a very fast filtering based on the analysis of accesses from all our webservers, which we download to one central location and evaluate in real time.
- IPS/IDS protection – Now comes IPS/IDS protection, which scrutinizes incoming and outgoing traffic. Based on more than 20,000 rules that are continuously added and modified (manually and automatically), it can detect both known threats and potential ones. The downside is that we can’t filter HTTPS traffic yet. Soon we’ll be able to do that too.
This does not include persistent rules on the servers and routers themselves. So every packet going to our servers is assessed 3 times and only then it reaches the servers. The whole thing takes a few microseconds. The little delay is worth it.
We have been developing the protection system for several years and are constantly improving it.
The number of blocked accesses is growing
This week, I noticed that the number of blocked accesses on prefetching and IPS/IDS protection went over 60% of the traffic. The overwhelming majority of accesses are blocked just on forward protection. We were a bit worried if we were blocking too much, but compared to previous years we are more benevolent towards robots thanks to smarter rules. This has also reduced the proportion of false positives and we block for a shorter period of time.
The explanation is probably simple. Hardware, connectivity and other things related to the Internet are cheaper. The attackers simply have more resources. It’s like email SPAM. There is more and more of it (but we can block it very well).
A lot of blocked accesses may not directly want to exploit a vulnerability, but are just looking to see if it happens to be present on the target site. We’re blocking that, of course. Usually automatically, but we can also intervene manually, as in the case of the PHP framework Nette vulnerability or in the case where we blocked the exploitation of a “leaky” WordPress plugin. And there are many more such examples…
Different rules for different servers or the most important is the human after the robot
It’s all very complicated and there is a reputation system in place. It’s all about big data and analyzing it and setting empty rules. We collect logs from all servers, use dozens of different blacklists (including paid ones) and download different threat databases (including paid ones) while our filters behave intelligently. All of this changes our filtering dynamically and in real time and even several times per second.
Although we collect data from all NoLimit and WMS hosts, individual servers may have individual rules and settings that change dynamically over time.
For example, if we have a server where, as a result of a successful advertising campaign, several websites get more traffic at the same time than all the reserves, it may limit the access of robots. These situations are quite extreme, but they can happen occasionally. The aim is to keep the service running even if the robots have to wait for an hour.
Of course, this doesn’t mean that you’ll immediately start getting unavailability warnings from monitoring. Restrictions refer to repeated accesses that substantially burden the server with excessive activity.
Just to give you an idea, a “normal” aggressive robot can do a thousand requests per second and you can’t wait there because customers would immediately know it. Such an overloaded webserver has a problem to breathe and in some cases it would have to end with a restart of the webserver. That’s long minutes of slow loading of your website with subsequent downtime. We simply have to defend ourselves against that, and we are defending ourselves. Our customers come first.
With robots, it’s what they do, not what they are
Nowadays, you can’t rely on a robot claiming to be, say, GoogleBot. If we allow all “googlebots” to move around the servers unrestricted within some whitelist, it will not end well.
By the way, we have specialized filters that look for fake bots. If someone pretends to be a robot that accesses only from certain IP addresses and suddenly we have access from another IP address, we can block this access, significantly limit the number of accesses (trial operation) or limit what they can do – for example, they can not submit forms or access the WordPress administration.
That’s what smart protection is all about. Lots of options where there is a compromise for almost everyone.
Therefore, we mainly monitor the activity of IP addresses. If an IP address starts running one or more filters at once, we will block it for a limited time. If he won’t rest, he’ll rest for longer and then longer…
Search engine robots have exceptions, but they have to behave
We all love search engines and want all of our content to be available to them as soon as possible. That’s why we treat all known search engines differently than, say, robots that collect marketing data.
Search engine robots take into account that they can overload a website or server, so they also react to various warning signs, such as a limited number of accesses in a certain time or increasing response times.
Google, for example, has managed to calculate the limits of our servers so accurately that they are virtually limitless. This can be seen in the following graph. The top graph shows the number of GoogleBot accesses and the bottom shows how many accesses were blocked. This is a 7 day chart by the hour.
Only a fraction of accesses are blocked. For GoogleBot we record specific IPv4 addresses.
It’s amazing how nicely Google can spread the load over time. In contrast, SeznamBot once in a while calls “Run!” and the fans in the servers kick into high gear because the CPUs need to be overclocked to higher performance 🙂
That’s also why it has a larger number of blocked requests. Even so, this is only a negligible fraction.
Other highlights:
- Both are most interested in the robots.txt file
- Google had 7,785,218 requests in the last 7 days
- The list had 6,129,646 requests in the last 7 days
- Google is also quite actively looking for the ads.txt file – 667,876 requests
Robots by number of accesses
In the following table you will find the approach of the top 45 most active robots (according to how they identified themselves) over the last 7 days. These robots have therefore passed through DDoS protection, preemptive protection and IPS/IDS protection. Scripts and universal robots such as crawler, robot, Python request, Apache-HttpClient etc. are removed from the table.
Robot | To whom it belongs | Number of requests |
Googlebot | 58 230 724 | |
bingbot | Bing | 56 851 376 |
ListBot | List | 48 036 897 |
YandexBot | Yandex | 13 631 396 |
MJ12bot | Majestic | 11 453 595 |
FacebookBot | 10 818 181 | |
Googlebot-Image | 9 615 900 | |
AdsBot-Google | 5 865 139 | |
UptimeRobot | UptimeRobot | 5 487 728 |
Adsbot | 3 696 208 | |
SemrushBot | SEMrush | 3 163 259 |
Mediapartners-Google | 2 955 552 | |
ZoominfoBot | ZoomInfo | 2 837 140 |
serpstatbot | Serpsat | 2 477 853 |
Seekport Crawler | 2 406 872 | |
Applebot | Apple | 2 134 281 |
heritrix | Internet Archive | 1 779 452 |
PetalBot | Aspiegel | 1 758 849 |
BingPreview | Bing | 1 662 438 |
WP Fastest Cache Preload Bot | 1 501 166 | |
DotBot | Moz | 1 402 027 |
YandexImages | Yandex | 1 367 933 |
dns-crawler | CZ.NIC | 1 239 963 |
aranhabot | Amazon | 1 230 600 |
Pinterestbot | 1 039 024 | |
AhrefsBot | Ahrefs | 987 148 |
Datanyze | Datanyze | 870 914 |
Heurekabot | Heureka.cz | 822 107 |
ptolemaiaBot | 767 102 | |
Mail.RU_Bot | Mail.ru | 722 754 |
de/bot | 593 131 | |
Mediatoolkitbot | Mediatoolkit | 558 655 |
DuckDuckBot | DuckDuckGo | 532 675 |
magpie-crawler | Brandwatch | 425 309 |
AimySitemapCrawler | Aimy | 403 751 |
PingdomBot | Pingdom | 387 363 |
Sogou web spider | Sogou | 372 302 |
BLEXBot | WebMeUp | 333 534 |
CFNetwork | Apple | 332 778 |
SimplePie | SimplePie | 283 075 |
Electron | 264 931 | |
DuckDuckGo-Favicons-Bot | DuckDuckGo | 251 105 |
List-Zbozi-robot | List | 247 458 |
Amazonbot | Amazon | 239 876 |
A few years ago, we would have blocked some of these robots. Today, we have smart filters that just limit them when it’s in the interest of our customers. Otherwise, we do not prevent them from browsing the web.
What’s currently wrong
Currently, we have the biggest problems with CDNs such as CloudFlare, through which a large number of attacks go through. Their users often don’t realise that they have to pay extra for more advanced protection. We cannot filter them on IPS/IDS because HTTPS is used. IP addresses can’t be restricted either, as everything goes over common CloudFlare IP addresses. Cloudflare is just one big complication. Domains that have CloudFlare don’t use our DNS either, so we can’t migrate them quickly or defend them in any other way. But I guess we’ll write about that next time.
In the future, we will solve this by improving IPS/IDS protection, where we will also check traffic over HTTPS.
Other than that, we’ve had sub-teens of issues throughout the time where we’ve blocked a little more than we probably should have. This is a fantastic achievement.
Conclusion
Otherwise, we are extremely satisfied with the IPS/IDS protection. It blocks a lot of malicious traffic that no one misses. We rarely encounter false positives. Most of the time when we look for a problem, we find attacks from compromised computers or someone trying various penetration tests rather clumsily. Then he may find that he can’t log into the editorial system administration for a few hours because IPS/IDS is blocking POST requests.
Of course, among the more than 20,000 different filters, sometimes something suspicious can get in by mistake. If you find something like this, just drop us a line via the contact form and we’ll check it out.
We consider our protections to be our greatest competitive advantage. You won’t find a solution this complex… And we’ve only described it superficially, because that would make for hundreds of similar articles.