AWS <> Spotinst: Workload Automation on EC2 Spot Instances

Intro 

Over the past decade, the cloud-compute sphere has gradually evolved to become the place where companies develop, test and run their applications. Whether its enterprises or SMB’s, the majority of businesses are already running workloads on AWS’ public cloud or planning to do so in the near future as on-premise data centers gradually obsolete. This transition from an on-premise data center to AWS cloud was natural and necessary for engineering teams, following the dramatic expansion of AWS’ cloud service offerings. Starting in 2006, Amazon Web Services has positioned itself as the world leader in the public cloud, owning over 65% of the market

Running applications on the public cloud has provided DevOps engineers with a wider variety of computing services, immediate deployment and elasticity, zero maintenance required, proper resource utilization, easier scaling abilities, and lower operational costs. Meanwhile, in order for AWS to meet the rise in demand for cloud computing resources and to assure that compute capacity is always available for their customers, it requires constantly building additional data centers (and the expansion of existing ones). 

However, constantly adding additional compute resources to cloud data centers to meet demand has resulted in excess capacity: compute resources that are not used and remain idle. In order to make the most out of the situation and to promote a more granular utilization of their data centers, AWS has decided to sell that excess capacity at a significant discount (up to 80%) to the market.

In this blog post, we will cover the challenges of cost optimization in the public cloud, how AWS EC2 Spot instances assist in reducing cloud compute costs, and why Spotinst’s solutions can automate the entire process of leveraging Spot Instances for production workloads

The challenges of Cloud Cost-Optimization and Infrastructure Automation

With the many advantages of migrating from an On-Premise data center to the public cloud come great challenges, specifically in the domain of Cost-optimization. Infrastructure planning that is not cost-mindful can quickly result in unwieldy infrastructure costs.

The key factors which affect companies’ cloud bill are: 

  • Network usage – Transferring data between availability zones\regions 
  • Storage – Storing data on EBS volumes and S3 buckets 
  • Compute – the amount of resources in terms of CPU\Memory 
  • Cost of Service – Cloud providers charge a fixed fee per service

Additionally, one of the main challenges of managing and monitoring the AWS cloud bill is forecasting the cloud-compute usage of application workloads. In order to prevent deviation from a predefined budget estimation, DevOps engineers are required to forecast the compute resource consumption during routine and traffic peaks. 

Furthermore, DevOps teams need to rightsize their applications to the most suitable instance type in order to avoid ‘Oversizing’, which can lead to under-utilized instances. 

During peak traffic, the infrastructure should scale instances automatically in order to support the upcoming application load, and when the peak traffic is reduced, the cluster should scale down instances back to normal.  

Running production workloads with On-demand instances (pay per usage) is costly and can increase the cloud bill dramatically. 

One of the main strategies AWS offers to reduce cloud-compute costs is purchasing pre-paid RI’s (Reserved Instances)  to facilitate the applications’ requirements. Purchasing pre-paid RI’s provides AWS customers with a discount on the instance cost, and the discount differs between a 1-year commitment plan to a 3-year commitment plan. 

The main challenge with pre-paid RI’s is forecasting the applications’ requirements during routine and peak traffic, and in order to fully enjoy the discount, the RI’s need to be utilized most of the time.  

What are Spot Instances? Benefits and challenges

Spot instances are transforming the way engineering teams consume public cloud services. Spot instances are short-lived instances offered by AWS for a very low cost compared to on-demand or reserved instances. AWS leverages the spot market as a method to monetize their excess capacity. The price of spot instances vary with the supply and demand, but, on average, users can save up to 80% compared to on-demand instances.

Since 2009, Amazon EC2 Spot Instances are offered by AWS based on their excess capacity — offering discounts of up to 80% based on supply and demand. EC2 Spot instances can be used alongside other AWS services such as EMR, Auto Scaling groups, ALB\ELB, Elastic Container Service (ECS), Elastic Kubernetes Service (EKS) and AWS Batch.

However, running production workloads on Spot Instances is tricky and requires planning, due to the fact that AWS provides a 2-minute notification prior to the Spot termination with no SLA guaranteed. 

In addition to that, when Spot availability is full, capacity is not always guaranteed – leaving cloud customers with interruptions in services as they try to quickly switch to another instance. Data consistency, data loss, active sessions, and HTTP requests are also an issue. For example, not knowing what happens to data on various network drives when a Spot Instance ends is one of the challenges engineering teams are facing when handling interruptions

AWS EC2 Spot instances are useful for various fault-tolerant and flexible applications, such as big data, containerized workloads, high-performance computing (HPC), stateless web servers, rendering, CI/CD and other test & development workloads. With EC2 Spot Fleet, you could use automation scripts to move workloads to other available instances (including on-demand instances) for long-running workloads to exist beyond the average lifespan of the Spot Instance. 

Due to the Spot termination, only applications that can handle interruptions are ideal candidates to run on Spots, AKA ‘stateless’ applications. 

Besides the challenge of handling application interruption, engineering teams are also required to develop the failover process in case the Spot Instance is terminated. The failover process is necessary in order to manage and handle the application’s availability.  

EC2 Workload Automation – Spotinst Elastigroup

In order to address the challenges of cloud workload automation on Spot Instances, Spotinst has developed its flagship product, Elastigroup for AWS, a platform in which DevOps engineers can manage, provision and scale compute infrastructure on AWS.  

Spot Instances with SLA

Spotinst Elastigroup leverages AWS excess capacity, Spot Instances, in order to provide its users with a cost-efficient compute cluster with reduced costs of up to 80%.  Based on historical and statistical data, Spotinst Elastigroup predicts interruptions approximately 15 minutes ahead of time and automatically migrates instances into different machine Types and Zones. In cases which the spot market is unstable or unavailable for a particular instance type, Spotinst Elastigroup will fall back to an On-Demand instance, in order to ensure high availability and consistency

Elastigroup will also make sure that the preemption is done gradually to ensure service uptime

A perfect blend of Spot, RIs and On-Demand

Spotinst Elastigroup’s cost-efficient strategy does not rely solely on Spot Instances but also on ‘RI utilization’ prior to provisioning Spots. That means that in case the AWS account has pre-purchased RI’s, Elastigroup will first utilize the already paid compute, and only after utilization, it will begin provisioning Spot Instances and On-Demand instances. 

At any given time, Elastigroup automatically scales the application on the best possible mix of instance types – Spot, Reserved, or On-Demand, while guaranteeing a 99.99% SLA.  

Automatic & Predictive Scaling

In terms of infrastructure scaling, Spotinst Elastigroup automatically scales the cluster based on either metrics or events and offers also predictive auto-scaling capabilities.

With Spotinst Elastigroup, the user enjoys advanced health monitoring, and in cases in which the EC2 instance is marked as unhealthy, it is scheduled for a replacement, and once the new instance is healthy, it is automatically registered to the LB

Bring your Own Tools

Spotinst Elastigroup integrates many of AWS services including ALB\ELB, ASG, ECS, EKS, EMR, Beanstalk, CodeDeploy, OpsWorks and more. 

With a few single clicks, the user can easily provision instances to new or existing clusters, and can also automate the entire process via API or provisioning tools such as CloudFormation, Ansible, and Terraform

In addition to that, Spotinst Elastigroup offers integrations with Chef, Rancher, Nomad, Docker Swarm, as well as native Kubernetes operations for pod management and distribution.   

Deep Visibility & Analytics 

Besides management, provisioning and automating cloud workloads, Spotinst Elastigroup provides the user with deeper visibility into his clusters. This visibility is expressed by an account dashboard that includes information such as a live view of infrastructure costs (potential costs of running on On-Demand instances Vs actual costs of running on Spot Instances and savings %), running hours per RI\OD\Spot, and mapping of AWS resources. 

On the cluster level, the user is exposed to the instance distribution per AZ’s, daily cluster cost, CPU\Memory utilization, replacements, and a drill down to the Spot Market info. 

On top of that, Spotinst Elastigroup is empowered by Elastigroup budgets, a tool that can help users govern and administer their cloud compute spendings. 

Stateful Applications 

As mentioned previously, leveraging Spot Instances in order to lower cloud-compute costs was exclusively for stateless applications that can handle interruptions. 

The concept of data integrity and consistency is crucial when managing workloads. This aspect may be trivial when running with On-Demand instances, but it’s not so trivial while working with EC2 Spot Instances, which are conceptually ephemeral and can be revoked at any given moment.

In order to address this challenge, Spotinst Elastigroup has built-in support for stateful applications, therefore expanding the Spot Instance reach to additional use cases

Elastigroup Stateful has allowed Spotinst’s customers to run stateful workloads that can’t handle interruptions, such as Databases, Elasticsearch and more. 

Container Management – Spotinst Ocean 

Over the past few years, the cloud-compute sphere is evolving and migrating to a containerized micro-services based architecture. 

The evolvement is expressed via various container management technologies such as ECS and Kubernetes

Although Spotinst Elastigroup has built-in support for ECS and Kubernetes based clusters, we have decided to revolutionize the way organizations manage their container workloads by providing a dedicated platform.

Spotinst Ocean is our serverless compute engine solution that abstracts containers from the underlying infrastructure and allows engineering teams to focus their time and efforts on building applications and shipping containers, rather than selecting VM’s, utilizing them, and configuring scaling policies when the application reaches peak traffic

With Spotinst Ocean, engineering teams no longer need to worry about managing VM’s to run their container workloads, as Ocean will always select the most suitable instance type on which the containers will run. 

Pod-driven Autoscaling

Spotinst Ocean is empowered with Spotinst’s Pod-driven Auto-Scaler, which launches pods that are scheduled to run, and in case there are insufficient resources in the cluster, it will launch a new node in order to facilitate the scheduled pods. Besides easing and simplifying the scale-up process of the cluster, the Auto-Scaler helps reduce costs by automatically scaling down to the minimum amount of instances, when resources are not required. Spotinst Ocean automatically re-schedules Pods of under-utilized nodes to other nodes for higher resource utilization, thus in order to optimize the cluster for performance and costs, without any action required on the user’s end.

Automatic Pod & Instances right-sizing

Spotinst Ocean also provides ‘right-sizing’ capabilities that supply engineering teams with the actual resource consumption of their pods and suggests real recommendations to the amount of CPU\Memory required in order to operate and thus can prevent over-provisioning of the cluster, which can over time decrease the AWS cloud bill.  

With Spotinst Ocean, the user gains deeper visibility into his Kubernetes cluster with a holistic view of the pods running\scheduled to run, cost show-back, instance distribution, and more. 

Spotinst Ocean is available for native Kubernetes clusters, EKS, GKE, and soon for ECS

Please check out this blog for a deeper drill down into Spotinst Ocean’s capabilities, and why it’s the go-to product for running containers in the cloud in the most automated cost-efficient way.

Conclusion 

In this blog post, we have covered the concepts of excess capacity and how AWS’ EC2 Spot Instances are changing the way businesses consume cloud compute infrastructure. 

Leveraging Spot Instances is considered the most influential strategy for running cost-efficient workloads in the public cloud, however, managing and orchestrating Spot Instances and applying a post-interruption automated failover process still remains an overhead for engineering teams.

Luckily, Spotinst has developed Elastigroup and Ocean in order to tackle these challenges by easying and automating the entire process, so engineering teams can focus their time on what they do best, building applications.