fbpx

Freshworks deploys Spotinst in production to automate AWS-Opsworks on Spot Instances

Since they were founded, Freshwork’s product suite has expanded to 11 products, including: 

Freshsales, for sales teams; Freshrelease, for project management; and Freshcaller, a solution for call centers. 

Freshworks has raised $250 million, to date, and is backed by Accel, Sequoia Capital, and CapitalG. 

Freshworks now has over 2000 employees across 10 global office locations and over 150000 customers which are spread across 197 countries. 

Freshworks’ Cloud Team:

The site reliability team (SRE) is the technological group within Freshworks which is responsible for the availability, performance, latency, efficiency, change management, monitoring, emergency response, and capacity planning of Freshworks’ cloud infrastructure. 

In the company’s early years, the SRE team focused on developing and optimizing the underlying platform infrastructure in order to support the release of new products and prepare for scale, as there was a steady increase in the number of customers.

As of today, the SRE team is managing:

  • Thousands of servers on AWS (CloudFront, Route53, KMS, DynamoDB, Opsworks, etc.) 
  • 0.5 million requests per minute (peak traffic) 
  • 4 million DB reads per minute
  • 15TB logs per day

Adopting a cost-efficient mindset 

As the company scaled, Freshworks AWS bill increased too, and therefore wanted to optimize their cloud computing infrastructure cost. 

“In the beginning, we had never allocated a budget for our infrastructure costs, as it was considered part of the operational costs of running our applications, but as the company grew, we realized that cost efficiency is becoming a necessity when running at scale,”said Pradeep Thangavel, Engineering Manager, Site Reliability Engineering.

In order to optimize cloud spending, the SRE team received a budget for infrastructure costs and was required to provide an uptime SLA to Freshworks platform, with that given budget. 

“Before we were introduced to Spotinst, the only main cost-saving strategy we were able to adopt was purchasing prepaid RI’s (Reserved Instances)” said Pradeep Thangavel. 

AWS EC2 Reserved instances provide a significant discount (up to 75%) compared to On-Demand instances and are pre-purchased 1-3 years in advance.

However, purchasing RI’s is a financial commitment, moreover, the RI’s are not fully utilized and not relevant for scale in peak traffic.

 

Spot Automation – The Challenge

As the most effective cost-reduction approach, Freshworks thought about leveraging Spot Instances to potentially reduce the infrastructure costs by up to 80%. 

Spot Instances are AWS’ spare EC2 instances that are offered to the market at a significant discount of up to 90% (compared to On-Demand). Spot Instances may be used for a large variety of workloads, and are leveraged also when workloads need to scale. 

However, the AWS frameworks implemented as the backbone of their architecture were not ideal to run on Spot Instances. Modifying the architecture was not a viable option either. 

One of Freshworks’ main framework components is ‘AWS Opsworks’, an AWS configuration management service that the SRE team uses to automate the configuration of Freshworks products. 

“The main challenge for us was to integrate both Spot Instances and AWS Opsworks to work together because each has its own lifecycle,” said Pradeep 

In case the SRE team were to independently start using Spot Instances for their Opsworks EC2 instances, they would have been burdened with the overhead of managing the termination and launching of instances while maintaining the same configuration.

On top of that, the two-minute interruption notification by AWS before spot termination may leave the application unavailable, breaching the uptime SLA, and therefore directly impacting the company’s business. 

“After thorough research, we came to realize that reliably managing Spot Instances is a massive automation challenge for us,” said Pradeep Thangavel, Engineering Manager

Spotinst Elastigroup – Spot Automation Solution

Seamless Integration:

In order to address the given challenges, Freshworks was looking for a fully managed Spot automation solution that will answer the company’s requirements. 

When the SRE team was introduced to Spotinst Elastigroup, they were impressed with the fact that it easily integrated with the frameworks they were using, and the migration was a ‘one-time setup’.

Spotinst Elastigroup for AWS is a SaaS platform that provisions, manages, and scales compute infrastructure and saves up to 80% on the cloud-compute costs, by reliably leveraging Spot Instances for the EC2 workloads.

In the integration with Opsworks, the SRE team performed a direct mapping between Spotinst Elastigroup to the Opsworks layers. Every Opsworks layer, which is comprised of hundreds of EC2 instances, is directly mapped to a dedicated Elastigroup. 

“Each of our R&D teams operates on a separate individual framework, and luckily the integration between Spotinst and our workloads on Opsworks, EKS, and Rancher, was seamless,” quoted Pradeep

Handling Spot Interruptions:

“In most cases, Spotinst was able to predict an interruption 15 minutes prior to AWS’ notification and immediately scheduled the EC2 Spot Instance for replacement, and in edge cases, we were even notified 20 minutes in advance,” said Pradeep

When the SRE team first designed the company’s cloud architecture, they took into consideration many factors in order to provide support for scaling EC2 instances, thus adding several layers of elasticity to their infrastructure when running on Spot Instances. 

Freshworks’ cloud workloads were ideal candidates to run on Spot Instances due to the fact that 90% of them are stateless applications that may handle interruptions.

Apart from that, the SRE team has dedicated a lot of time and effort in adjusting their startup and shutdown scripts to properly drain and disconnect the EC2 instances from the load balancer, and in parallel, already spin up a new EC2 instance with the same configuration. 

The smooth scaling mechanism provided them with a comfortable headroom to handle and tolerate Spot interruptions.  

Guaranteeing SLA:

Spotinst was founded in 2015 and is thriving to revolutionize the way companies manage and orchestrate their cloud-compute workloads in the cloud.

Spotinst was a natural choice for Freshworks due to the fact that Spotinst commits to a 99.9% uptime SLA, and this directly addressed the SRE team’s challenge in ensuring a 99.8% uptime SLA of Freshworks’ applications.

On top of that, Spotinst Elastigroup’s ‘Fall-back to On-Demand’ feature ensures a highly available cluster by falling back to an On-demand instance, in cases where the spot market is unstable, or unavailable for that instance type. 

“Our Journey with Spotinst started in 2016 with a small PoC, and as our confidence in the platform grew stronger, we gradually on-boarded more AWS accounts, and today we are running hundreds of EC2 instances across several AWS accounts with Spotinst Elastigroup,” said Pradeep 

Immediate Cost reduction and visibility:

After the initial workload migration to Spotinst Elastigroup, the SRE team immediately observed a massive reduction in their cloud-compute spending, with an average of 65% in savings, as opposed to running solely with On-demand instances for the instances that have been migrated so far

In addition to that, the Spotinst Elastigroup dashboard provided them with deeper visibility into their cloud-compute costs, thereby allowing them to stay on top of their spending at any given time.