As a highly regarded web site, stackoverflow.com will get a variety of consideration. A few of it’s good, just like the time we had been nominated for a Webby Award. Different instances, that spotlight is distinctly much less good, as after we get focused for distributed denial of service (DDoS) assaults.
For just a few months, we’ve been the goal of ongoing DDoS assaults. These assaults have been break up into two varieties of assault: our API has been hit by software layer assaults whereas the primary web site has been topic to a volume-based assault. Each of those assaults make the most of the surfaces that we expose to the web.
We’re nonetheless getting hit repeatedly, however due to our SRE and DBRE groups, together with some code modifications made by our Public Platform Staff, we’ve been in a position to decrease the influence that they’ve on our customers’ expertise. A few of these assaults are actually solely seen by means of our logs and dashboards.
We wished to share a few of the normal techniques that we’ve used to dampen the impact of DDoS assaults in order that others underneath the identical assaults can decrease them.
Botnet assaults on costly SQL queries
Between two application-layer assaults, an attacker leveraged a really massive botnet to set off a really costly question. Some again finish servers hit 100% CPU utilization throughout this assault. What made this additional difficult is that the assault was distributed over an enormous pool of IP addresses; some IPs solely despatched two requests, so charge limiting by IP deal with can be ineffective.
We needed to create a filter that separated the malicious requests from the reliable ones so we might block these particular requests. Initially, the filter was a bit overzealous however, over time, we slowly refined the filter to determine solely the malicious requests.
After we mitigated the assault, they regrouped and tried concentrating on person pages by requesting tremendous excessive web page counts. To keep away from detection or bans they incremented the web page quantity their bots requested. This subverted our earlier controls by attacking a unique space of the website whereas nonetheless exploiting the identical vulnerability. In response, we put a filter to determine and block the malicious visitors.
These API routes, like every API that pulls information from a database, are essential to the day-to-day functioning of Stack Overflow. To guard routes like these from DDoS, right here’s what you are able to do:
- Insist that each API name be authenticated. It will assist determine malicious customers. If having solely authenticated API calls is just not doable, set stricter limits for nameless / unauthenticated visitors.
- Reduce the quantity of information a single API name can return. Once we construct our entrance web page query checklist, we don’t retrieve all the information for each query. We paginate, lazy load solely the info within the viewport, and request solely the info that will probably be seen (that’s, we don’t request the textual content for each reply till loading the query web page itself).
- Charge-limit all API calls. This goes hand-in-hand with minimizing information per name; to get massive quantities of information, the attacker might want to name the API a number of instances. No person must name your API 100 instances per second.
- Filter malicious visitors earlier than it hits your software. HAProxy load balancers sit between all requests and our servers to steadiness the quantity of visitors throughout our servers. However that doesn’t imply all visitors has to go to a kind of servers. Implement thorough and simply queryable logs so malicious requests could be simply recognized and blocked.
Whack-a-mole on malicious IPs
We additionally had been topic to some volume-based assaults. A botnet despatched numerous `POST` requests to `stackoverflow.com/questions/`. This one was simple: since we don’t use trailing slash on that URL, we blocked all visitors on that particular path.
The attacker figured it out, dropped the trailing slash, and got here again at us. As an alternative of simply reactively blocking each route the attacker hit, we collected the botnet IPs and blocked them by means of our CDN, Fastly. This attacker took three swings at us: the primary two triggered us some difficulties, however as soon as we collected the IPs from the second assault, we might block the third assault immediately. The malicious visitors by no means even made it to our servers.
A brand new volume-based assault—probably from the identical attacker—took a unique strategy. As an alternative of throwing your complete botnet at us, they activated simply sufficient bots to disrupt the location. We’d put these IPs on our CDN’s blocklist, and the attacker would ship the following wave at us. It was like a recreation of Whack-a-mole, besides not enjoyable and we didn’t win any prizes on the finish.
As an alternative of getting our incident groups scramble and ban IPs as they got here in, we automated it like good little SREs. We created a script that might test our visitors logs for IPs behaving a particular means and robotically add them to the ban checklist. Our response time improved on each assault. The attacker stored going till they bought bored or ran out of IPs to throw at us.
Quantity-based assaults could be extra insidious. They appear like common visitors, simply extra of it. Even when a botnet is specializing in a single URL, you possibly can’t all the time simply block the URL. Respectable visitors hits that web page, too. Listed here are just a few takeaways from our efforts:
- Block bizarre URLs. For those who begin seeing trailing slashes the place you don’t use them, `POST` requests to invalid paths, flag and block these requests. If in case you have different catch-all pages and begin seeing unusual URLs coming in, block them.
- Block malicious IPs even when reliable visitors can originate from them. This does trigger some collateral injury nevertheless it’s higher to dam some reliable visitors than be down for all visitors.
- Automate your blocklist. The issue with blocking a botnet manually is the toil concerned with figuring out a bot and sending the IPs to your blocklist. For those who can acknowledge the patterns of a bot then automate blocking based mostly on that sample, your response time will go down and your uptime time will go up.
- Tar pitting is an effective way to decelerate botnets and mitigate quantity based mostly assaults. The thought is to scale back the variety of requests being despatched by botnet by growing the time between requests.
Different issues we realized
By having to cope with a variety of DDoS assaults back-to-back, we had been in a position to study and enhance our total infrastructure and resiliency. We’re not about to say thanks to the botnets, however nothing teaches higher than a disaster. Listed here are just a few of the massive total classes we realized.
Spend money on monitoring and alerting. We recognized just a few gaps in our monitoring protocols that might have alerted us to those assaults sooner. The appliance layer assaults particularly had telltale indicators that we might add to our monitoring portfolio. Usually, bettering our tooling total has helped us reply and preserve web site uptime.
Automate all of the issues. As a result of we had been coping with a number of DDoS assaults in a row, we might spot the patterns in our workflow higher. When an SRE sees a sample, they automate it, which is strictly what we did. By letting our programs deal with the repetitive work, we lowered our response time.
Write all of it down. For those who can’t automate it, file it for future firefighters. It may be arduous to step again throughout a disaster and take notes. However we managed to take a while and create runbooks for future assaults. The following time a botnet floods us with visitors, we’ve bought a headstart on dealing with it.
Discuss to your customers. Tor exit nodes had been the supply of a major quantity of visitors throughout one of many quantity assaults, so we blocked them. That didn’t sit effectively with reliable customers that occurred to make use of the identical IPs. Customers began a bit of untamed hypothesis, blaming Chinese language Communists for stopping nameless entry to the location (to be truthful, that’s half proper: I’m Chinese language). We had no intention of blocking Tor entry completely, nevertheless it was stopping different customers from reaching the location, so we bought on Meta to clarify the scenario earlier than the pitchforks got here out en masse. We’re now including communication duties and tooling into our incident response runbooks so we could be extra proactive about informing customers.
DDoS assaults can typically include success on the web. We’ve gotten a variety of consideration during the last 12 years, and a few of it’s sure to be unfavorable. For those who discover yourselves on the receiving finish of a botnet’s consideration, we hope the teachings that we’ve realized will help you out as effectively.