This post was originally written in Japanese, (link to the original article) and was translated in order to share the great experience that ABEJA, an AI integration platform had with using Elastigroup & Amazon ECS for their complex AI workloads.
By using Spot Instances, EC2 costs can be reduced by around
70-80% compared to On-Demand pricing. However, this comes with a risk that these instances can be terminated with only a
two-minute warning. Spotinst is a service that removes this risk. Spotinst has several offerings, but in this post, we will focus mainly on Elastigroup.
Starting off with a conclusion may seem strange, but I’ll cut directly to the point: There is no reason not to use Spotinst thanks to the savings it can bring.
So, What is Elastigroup?
Elastigroup uses a machine learning algorithm to identify spare capacity instances (such as Spot Instances or Low-Priority VMs) that are marked for termination by cloud service providers such as AWS or Azure. Prior to termination, Elastigroup will automatically migrate the application to a different available Spot Instance across multiple AZs or instance families, falling back to On-Demand if no Spot is available to prevent any downtime. Whilst testing this fallback measure, Elastigroup automatically switching back to On-Demand several times without our input. It’s a great feature to be able to predict Spot terminations around 15-20 minutes in advance and automatically switch to available Spot or On-Demand Instances. For the user, it can help completely remove the risk of Spot Instances and leave us confident enough to run mission-critical workloads on Spotinst.
ECS Integration and Container-driven Autoscaling
Elastigroup integrates with ECS natively, and that’s the function we use most often at ABEJA. Elastigroup is smart enough to scale the cluster up and down based on the
ECS Tasks requirements, it detects events (
Insufficient CPU, etc.) caused by insufficient resources when ECS deploys containers and automatically scales the EC2 Instances. Also, if the number of hosts is too large and there are surplus resources, these are automatically scaled down. This is a very cost-effective solution. If you’re using ECS without Spotinst, there is no automatic function for this and so you must create one yourself. Elastigroup, however, will do this automatically.
By setting Headroom in Spotinst, it is possible to always have the resources required to schedule new containers. Where in the case of EC2 Autoscaling, it is measured by how much is consumed by the host CPU / MEM, meaning uneven instance types and making it very troublesome to calculate the container clearly. It is also difficult to recalculate as the type of instance increases. Headroom will ensure proper resource allocation.
Instance Safe Drain
When terminating an EC2 Container host, since the container on the host set for termination is running, application downtime will occur if you do not delete the host after moving the container to a different host. ECS has a draining function, but you must do your own draining before you can terminate.
If you’re using Spotinst, it becomes relaxingly easy. Elastigroup uses AWS ECS API calls to communicate with the ECS cluster’s scheduler to make sure your desired Task and Services are operating as expected. Whenever an EC2 instance is scheduled for replacement, whether if its due to Scale Down activity or a Spot Replacement Elastigroup invokes the
deregisterContainerInstance to notify the ECS scheduler and forces rescheduling of the containers that run on the hosts as well as safely drain the instance from the attached Elastic Load Balancers.
Blue Green Deployment
Blue Green Deployment is the next service used by Elastigroup’s ECS integration. To briefly explain B/G update deployments, the current live production environment is called “blue” whilst the new environment (with the new version of your software) is called “green”. With Elastigroup, this type of deployment is supported natively and you can even set a percentage of your Elastigroup to deploy. These servers will first pass an ELB health check and then a (configurable) grace period will elapse prior to deploying new servers.
Quick note: In fact, you can manage on-demand instances with Elastigroup so you can use the B/G function even for On-Demand Instances.
Control the balance between Spot Instances and On-Demand Instances
You can change the ratio with the slider or set the desired number of On-Demand instances.
Spot Market Scoring
Variety of Instance Types
All AWS Spot Instances types are supported. The following are some of them:
Can stateful Servers be managed by Spotinst?
I’m not currently using this feature, but it seems a powerful one. My Elastigroups run stateless workloads, but it seems that even a stateful application can use Spot Instances (although, some functions may not be available). When I went to re:Invent 2017 and saw Spotinst’s booth, I asked them about Stateful. These are some of the specifications:
- Always keep a Snapshot of Target Instance
- When termination is set to occur, the instance is stopped
- Take incremental Snapshot again after instance shutdown (Snapshot time is short due to difference acquisition from the last Snapshot)
- Start up a new instance (resume here)
By the way, there are also options to maintain
Private IP etc, so even stateful servers may have a chance to reduce costs.
Support is very proactive. There are also extensive help and API documents, but whenever these are not enough or you want a human touch there is always the live chat feature. Even with the live chat going to a real-life human 24/7, the response time is very short.
It is an interesting pricing model that I don’t often see: 20% of the amount saved compared to On-Demand costs. Nothing is taken upfront, making it immediately and definitely cheaper than normal usage.
In other words, it is 20% higher than using Spot Instance yourself, but (in the reference example above) it is 64% cheaper than On-Demand.
Considering the Autoscaler, Headroom, etc. Spotinst was cheaper even including the 20% fee, and it is much cheaper than if you intend to develop and implement an Autoscaler or Headroom yourself.
This feature creates a report of your existing, Cloud-Native infrastructure (such as Autoscaling Groups, instances behind Load Balancers, containerized environments, etc.) and the potential savings that could be achieved if you were using Spotinst for them. To begin saving on these, it is often a simple process taking less than 2 minutes!
Import Existing Resources
There is also an import function to integrate with the following tools and to migrate those existing resources automatically to Spotinst.
There is no reason not to use Spotinst thanks to the savings it can bring.
Check out the original Japanese blog post by Abeja here