Engineering Blog

Failover Systems and LoginRadius’ 99.99% Uptime

Do you remember the Amazon outage that affected several high-profile customers back in 2011? On April 21st, widely-used sites such as Reddit and Quora were brought down, and many others experienced latency or were knocked offline, too. Early that morning, as part of normal scaling activities, Amazon staff performed a network change. However, the change was done incorrectly and, after the staff attempted to correct it through a rollback, the inner mechanisms of the service made it unavailable to serve read and write requests. The ‘inner mechanisms’ responsible for the service unavailability are out of the scope of this article, but they are described in the service disruption summary that Amazon published. What interests us here is the strategies that companies can use to mitigate the negative effects on their products when a service they rely on fails. These strategies make up what is known as a failover system.

A failover system is a set of mechanisms that perform failover. In computer networking, failover is the process of switching to a redundant or standby server or network upon the abnormal termination of a previously-active server or network. The benefit of a failover system is obvious: with one in place, you can ensure your product or service will be available even when adverse events take place. Uptime—time during which a service is operational—is crucial to the success of your business. If your services are unavailable, you are likely to lose customers and any revenue they would have generated. In an article entitled ‘Lessons Netflix Learned from the AWS Outage,’ Netflix talks about the manual process they used to deal with the Amazon outage and acknowledges the importance of automating this process in the future so they can keep scaling their service. That is, the company acknowledges the importance of having an automated failover system in place instead of relying on a team of top engineers manually making changes every time there is an issue.

How is an automated failover system set up? Let us consider the simplest scenario: setting a failover system for one server. By definition of a failover system, in addition to the main server, we need to have another redundant standby server that we can switch to whenever the main one fails. Since the redundant server must provide the same functionality as the first, it must be identical to it. Additionally, we need a tool that ensures client requests are routed to the redundant server in case of failure of the main one. We can achieve this by using a DNS failover tool. DNS is the internet protocol used to translate human-readable hostnames into IP addresses, and a DNS failover tool makes sure the “dictionary” (DNS tables) used for this translation are updated in the event of an outage. DNS failover tools know when to update the tables by periodically checking on the main server’s status. With these three tools—and some configuration—you can set up a simple, automated failover system. Of course, there are other considerations you must take when setting the failover system, such as ensuring the redundant standby server is hosted in a geographic area different from the main server’s location and using different companies to host your services. That way, your services will be less likely to go down simultaneously. 

At LoginRadius, we have set up automated failover systems in all layers of our architecture, which is why we can ensure 99.99% uptime on a monthly basis. With our services, you can be assured your customers will always be able to engage with your business, ensuring you will never lose these customers and any revenue they will generate. If you want to get technical about our availability architecture, click here.

Avatar

About Ruben Gonzalez

Ruben is a Computer Science student at The University of British Columbia. His main interests include web development, AI, classical music, and social justice. Originally from Colombia, he also is fluent in Spanish. Check out his Linkedin and GitHub profiles.

Related Posts