The past two years have been an exciting time at Stack Overflow. While we still run stackoverflow.com in an on-premises datacenter, we have taken on the journey of migrating Stack Overflow for Teams to Microsoft Azure.
Stack Overflow for Teams is our private SaaS product for internal knowledge sharing and collaboration within organizations. It’s basically your own private stackoverflow.com. Being a SaaS product, it’s very well suited to run in the cloud with flexible scaling and abstraction of the hardware we run on. But because of its origin as an outgrowth of Stack Overflow, it began its life using the same infrastructure and hardware as the public sites.
This two-part blog series will discuss our cloud journey, the decisions we made, and the things we learned while moving Stack Overflow for Teams to Azure.
Why go to the cloud?
Because the starting, Stack Overflow and its websites ran on bodily {hardware} designed to maximise the environment friendly use of assets in order that we may run on the least quantity of servers attainable. We have been fairly pleased with this: one of many authentic servers held on the wall of our New York Metropolis workplace lengthy after it stopped working.
However as soon as we noticed our Stack Overflow for Groups enterprise take off, working every part on a set variety of machines was now not possible. We wish to transfer away from having engineers have to bodily go to our information heart to be able to deal with {hardware} points and improve {hardware}. Our engineers ought to concentrate on what provides probably the most worth to our clients, and that’s not sustaining a bodily infrastructure.
One other profit that cloud migration brings us is the flexibility to keep up safety compliance frameworks resembling SOC 2. Azure makes this quite a bit simpler: their information facilities preserve a number of compliance attestations and certifications, and their tooling helps preserve our assets compliant. We went by our first SOC 2 course of in 2020, and it may be time consuming. Azure would simplify a number of this.
An extra profit is that with digital infrastructure, Azure makes it quite a bit simpler to spin up extra ephemeral environments the place we are able to check new options and infrastructure modifications with out disrupting different builders.
So how did you do it?
Three years in the past, we got here up with our plan to separate our Stack Overflow for Groups product and transfer simply the Enterprise tier into Azure. This might imply getting SOC 2 Kind II accreditation for our Enterprise clients as shortly as attainable whereas holding our Free and Fundamental clients on-premises and making the migration a future drawback.
As we executed this plan, we came upon that it was extremely troublesome to separate our product throughout two places with out severely impacting the consumer expertise. Out of the blue, the consumer wanted to know to which setting their account was linked when making an attempt to register. Whereas we may construct one thing to assist with logins, this cut up between Enterprise and Fundamental/Free environments proved deadly when it got here to integrations (Slack, Jira, Microsoft Groups) since we won’t management their app set up course of with out creating two separate apps, which the app shops didn’t permit.
After engaged on this path for nearly a 12 months, we determined to pivot to a brand new plan: transfer all three tiers of Groups and all present clients to Azure all of sudden.
V2 consisted of a number of phases:
- Section I: Transfer Stack Overflow for Groups from stackoverflow.com to stackoverflowteams.com
- Section II: Decouple Groups and stackoverflow.com infrastructure throughout the information heart
- Section III: Construct a cloud setting in Azure as a read-only duplicate whereas the datacenter stays as main
- Section IV: Change the Azure setting to be the first setting clients use
- Section V: Take away the hyperlink to the on-premises datacenter
On this weblog put up we are going to focus on section I. The second put up will cowl the opposite phases.
A multi-tenant Stack Exchange Network
Once we began Groups six years in the past, it was a part of stackoverflow.com. Our firm is thought for our Stack Trade Community and we needed to provide builders a well-known feeling and combine with their day by day utilization of stackoverflow.com. Naturally, it made sense to let Groups customers entry their personal websites from the sidebar of the general public website.
Now to know how we constructed Groups, you first have to know the way we architected the Stack Trade Community. You might be most acquainted with stackoverflow.com, however should you take a look at https://stackexchange.com/websites, there’s a enormous checklist of websites (173 finally depend) that every one use the identical Q&A basis that we initially constructed for stackoverflow.com.
This basis is multi-tenant. We’ve got a central SQL Server database named “Websites” that comprises information shared throughout the community. Most vital for this dialogue: the Websites database comprises an inventory of community websites. Every community website then has its personal content material database that comprises all of the customers, posts, votes, and different information for that particular website. All of this information belongs to publically accessible websites, so the controls on them have been fairly easy and uniform.
Every website has a number deal with resembling stackoverflow.com, superuser.com, or cooking.stackexchange.com. Each time a request comes into our app, we examine the host and see if that matches one in all our identified community websites. That is why you will see the next if you go to a non-existing web site resembling https://idontexist.stackexchange.com/:
Groups provides one other degree of multi-tenancy the place we’ve a website (the father or mother) that hosts Groups (the kids). Previously if you would hit stackoverflow.com/c/yourteam, the primary layer of multi-tenancy by the Websites database would convey you to stackoverflow.com after which the crew identify was used to seek out the content material of your crew.
Functionally, this gave us what we would have liked to make it work, however as a result of this was personal buyer information as an alternative of public website information, we additionally wanted to consider securing this information.
A secure data center
Traditionally, as a result of we solely hosted publicly out there information, engineers had a number of permissions to entry the servers and databases to troubleshoot points. For instance, at our scale, we typically have efficiency points that we are able to most simply remedy by connecting to a machine and making a reminiscence dump.
This would not work for our Groups product. Groups comprises personal buyer information that we as engineers ought to by no means be capable to entry. So to be sure that we are able to safe buyer information, we needed to make some modifications to the info heart as proven within the following diagram:
On the left facet, you see our DMZ. That is what we have all the time had earlier than launching Groups. The DMZ is the place your request as a consumer is available in, the Websites database is positioned, and all of the content material databases for the totally different Stack Trade websites reside.
Now should you hit https://stackoverflow.com/c/my-team, your request will get intercepted and forwarded to the appropriate facet of the diagram: the TFZ (groups firewall zone or groups enjoyable zone relying on who you ask). The TFZ is totally locked down. Engineers haven’t got entry exterior of documented break-the-glass conditions and buyer information can’t be queried.
This does imply that though all of the Groups databases are contained in the TFZ, the Websites database is shared between Groups and Stack Overflow.
Getting Groups into Azure meant that we needed to separate Groups from Public on a useful degree all the way in which all the way down to the {hardware}. Splitting off Groups to its personal {hardware} and databases was a posh venture. What scared us most was that sure steps have been massive bang steps with out a risk to roll again. We needed to get issues proper or repair ahead and that added a number of threat.
That’s why we checked out making the person phases smaller and fewer dangerous. We determined the very first thing we may do was transfer Groups from stackoverflow.com to its personal area: stackoverflowteams.com.
Moving to stackoverflowteams.com
We are able to already run a number of websites from our software and websites can have a father or mother. We got here up with the concept of creating Groups a separate website within the Stack Trade community with some particular settings and make all Groups a baby of this new website. The brand new father or mother website may have its personal area: stackoverflowteams.com. We’d construct out this new website in small steps and work out all of the consumer expertise issues we needed to change. This fashion, we may decouple the infrastructure modifications from the consumer expertise modifications, making issues easier and fewer dangerous—and get rid of a few of these massive bang steps altogether.
We added a brand new entry within the Websites DB for our new ‘Stack Overflow for Groups’ website and added a brand new website kind, TeamsShellSite, that we may use in code to distinguish between a daily Stack Trade website resembling stackoverflow.com and our new website: stackoverflowteams.com.
The TeamsShellSite turned the brand new father or mother for all particular person Groups. Should you go to stackoverflowteams.com whereas not logged in, you will note a welcome web page with some data and the choice to create a free crew and log in. That is served from the TeamsShellSite.
Should you do have a crew and go to stackoverflowteams.com/c/your-team, your request nonetheless hits the DMZ, the bottom host deal with is mapped to the brand new TeamsShellSite file, and your requests get forwarded to the TFZ.
Eradicating stackoverflow.com as a father or mother required a number of modifications. Something beforehand hosted on stackoverflow.com now needed to be dealt with by the TeamsShellSite, together with the account web page, navigation between Groups, third-party integrations, and buyer configurations for SSO and SCIM. We additionally had to verify to have redirects in place so clients may entry a Group by each stackoverflow.com/c/your-team and stackoverflowteams.com/c/your-team.
This was particularly vital for our authentication and ChatOps integrations. Plenty of our clients have arrange integrations like SSO, SCIM, Jira, Slack, and Microsoft Groups. These integrations level to stackoverflow.com, and we needed to verify we would not break them once we migrated a Group to stackoverflowteams.com. We additionally needed to level our integrations to a brand new subdomain: integrations.stackoverflowteams.com to decouple them from our host area.
As you possibly can think about, this took a number of testing to verify we coated all edge instances. For instance, till we up to date all our integrations to level to integrations.stackoverflowteams.com, we added redirects from stackoverflow.com. Nonetheless, it turned out the Jira integration set up web page didn’t work with redirects resulting from an embedded iframe Jira makes use of. We needed to work round that limitation by changing host headers as an alternative of redirecting for that particular web page. All these modifications have been the majority of the work for shifting Groups away from Stack Overflow as a father or mother website and to a brand new area.
As soon as we had the code modifications to help the TeamsShellSite in place, we may begin shifting Groups from stackoverflow.com to stackoverflowteams.com by altering the father or mother of a website and updating the cache. We created some inner helpers to maneuver a single crew or a batch of Groups to a brand new father or mother to make this course of simple and painless. The large benefit was that we may simply transfer a crew but additionally transfer it again if one thing went improper. This made all of the modifications we needed to do much less scary since we knew it wasn’t an enormous bang change with out a rollback.
We began with migrating our personal inner Groups—if we have been going to interrupt one thing, we must be those to really feel it. As soon as we had our personal Groups working, we began migrating clients. We first moved all small, free Groups. Then our Fundamental tier and eventually our Enterprise tier.
We bumped into a difficulty with caching. If a crew was moved from stackoverflow.com to stackoverflowteams.com and a consumer tried to entry it on stackoverflow.com, our code regarded on the cached information, couldn’t discover the location, reloaded all websites within the cache solely to then work out it ought to redirect. That occurred each time. Now reloading all of the cached websites is a really costly operation so sure, we would have taken the location down a couple of times however that solely added to all the thrill.
In December 2022, we accomplished this primary section after virtually a 12 months of labor. The client-facing modifications have been now completed, and all clients obtained communication and help for the modifications they needed to make. We have been now efficiently working all of Stack Overflow for Groups by itself area!
Now we may transfer on to Section II: Eradicating the dependency on the shared Websites database and eradicating the dependency we’ve on the DMZ.