A client I work with was recently contemplating a move from an EC2 deployment with some dedicated metal, to a self-hosted or EKS K8s deployment, and was looking for some extra heavy lifting to rebuild their ecosystem to support a containerized workflow. Their initial concern was the unknown performance hit they would take from running solely on containers once migrated versus their existing deployment.
The platform is fairly simple: one public API, one private API, a cluster of internal workers with high network utilization, and a handful of other AWS services: S3, RDS, SQS, MSK, and Elasticache.
Before I was brought in to consult on this transition, their team did some basic initial benchmarks which showed no decrease in performance on a test EKS cluster, and even an INCREASE in performance for the worker cluster, to which I responded…
“Huh…”
This certainly should not be the case*. I was initially suspicious of this claim due to the very nature of virtualization and containerization at large: abstraction.
Regardless of what you run in a container, there should be some amount of measurable overhead versus it’s closer-to-infrastructure counterpart. Each layer up from bare metal will have some amount of observable performance degradation due to the various instructions translations needed to be passed down each layer down to the actual hardware, especially when dealing with storage and networking.
Example:
hardware<>OS<>containerd<>container
should be observably slower than
hardware<>hypervisor<>VM
and both of these slower than
hardware<>OS
In the case of an EC2 deployment, running code will already be two levels above bare metal running atop the hypervisor layer, and containers would be three levels up. So I set out to try and do my own benchmarks of this running stack to try and verify their initial findings of improved speed, and try to figure if this could be possible.
Testing Strategy
There were only a few things I wanted to confirm in my benchmark, all related to timing, and not focusing on resource usage which should be negligible for this stack. I also wanted to make sure I tested a few different scenarios so the difference between each deployment was obvious, and I categorized them as such:
1. Bare Metal — baseline top speed with worst scaling flexibility and low reliability
2. VMs/EC2 Instances — reasonable speed with lesser scaling flexibility and lesser reliability
3. ECS — reasonable speed with reasonable scaling flexibility and high reliability
4. EKS — lesser speed with maximum scaling flexibility and high reliability
Since this is an event-driven architecture, I wanted to make sure some specific timings get measured reliably:
1. API response time at 100 requests per second
2. Database transaction time from API to RDS
3. Worker timings from start to finish
4. Round-trip time from API request to response payload with DB fetch
Environment Setup
For brevity, I can skip the deep details about the specific configuration of each platform since we’re really just looking to measure general overhead in similarly configured environments as they would normally be deployed.
The basics:
Metal
Type: a1.metal
OS: Ubuntu 22.04.1
Net: VPC + ELB for ingress
Deployment consists of APIs and workers running on the same single instance.
EC2
Type: 2x m6in.2xlarge
OS: Ubuntu 22.04.1
Net: VPC + ENA + ALB for ingress
Deployment consists of APIs on one node, with workers on a second node.
ECS
Type: 3x m6a.2xlarge
OS: Ubuntu 22.04.1
Net: VPC + ENA + ALB for ingress
Deployment consists of one service and task for each service, with each getting 2 CPU/4G mem
EKS
Type: 3x m6a.2xlarge
OS: AWS Linux 2 + k8s 1.24 + VPC CNI + AWS Operators
Net: VPC + ALB for ingress via ALB Operator
Deployment consists of one pod for each service, with each getting 2 CPU/4G mem
All will be using the same db.m5.xlarge RDS instance running PostgreSQL 13.8, which is cleared between test runs. The deployed software stack is Python 3.9 front-to-back, with multi-threaded workers. Each environment will be tested singularly with newly provisioned assets, and not running at the same time in the same VPC.
Test Setup
To test, I created a few simple scenarios with Artillery scaled for 100 requests per second. One scenario will only perform a simple POST on the API, one scenario will trigger a database transaction and return, another will enqueue jobs for workers, and the final one will perform a GET that will return data from the database. These will be run 3 times each, and averaged as the result.
TLDR Results
These are the results I had imagined to see with a very simplistic benchmark of this type for a variety of reasons.
We can see that metal is hands-down the fastest, shockingly by 90% in the API tests versus EKS. This is most likely and singularly due to the networking interchange with the VPC CNI*, and less due to pod overhead where we can expect to lose something like 30% performance in high I/O workloads according to other popular benchmarks.
The DB transaction timings are much closer in speed across all platforms, with only a 10.5% gap between metal and EC2, but a 36% gap between metal and EKS. This seems pretty reasonable, and shows that timings between ECS and EKS are much closer as expected.
The worker timings are definitely the most interesting result here. We should expect to see ECS and EKS performing fairly close together due to the similar architecture they employ. Both are container-based with similar resource definitions, and should have similar access to the same levels of networking, but we see a 19.5% difference between the two here. These workers are doing multiple database transactions and a final commit, so this could also be networking related versus a local resource speed issue.
Finally, we see that our API+DB request timings scale almost exactly to the previous API and DB test timings, which is as expected. The one major piece that could skew these as “warm-up” outliers would be database caching, which could drastically reduce these overall timings if repeated tests were performed on the same data, but probably not enough to skew each closer to metal.
Conclusions
The results are clear, but not shocking except in one case, which could defintely be investigated deeper. For this specific client, it seems it would make more sense to deploy on ECS for the extra ~25% performance benefit. I’m not a huge fan of ECS whatsoever, but my reasoning for this is pretty simple, and I would confidently apply to any hosting situation: any amount of overhead you find in lost performance is cash burned. X% better performance is X% better spending for your hosting costs across the board if the capabilities are essentially the same.
It’s better to figure out the few pieces of a platform that you need, and find the easiest platform to manage for your team’s size. In this case, EKS doesn’t match what this team needs, which is simply a more flexible and reliable platform to deploy their platform to which reduces management overhead. Once all the heavy lifting of getting a solid container workflow going is out of the way, it will be simple for them migrate over to some flavor of Kubernetes in the future if needed.
*PS: After talking through these results with the client, we were able to resolve the timing differences between my benchmark measurements and theirs. In their testing, they had deployed a PostgreSQL instance directly into the EKS cluster, inadvertently bypassing the networking overhead of EKS+VPC+RDS altogether.
While this is possibly a resolution to specific performance issues you may find with a similar setup, you lose all the benefits of HA, scalability, and monitoring that a hosted database service provides. It’s not something I ever recommend to clients (especially smaller teams) as the management overhead cost rarely ever clears the performance vs cost hurdle in this specific type of case.
*NOTE: We have no way to inspect the networking layer in AWS directly to confirm this, but I did at least have my suspicions confirmed with AWS Support directly, with some suggestions to try different network configurations aside from the VPC CNI and see if there is any improvement. I may re-test this in the future and write a follow-up if there are major improvements. In general, 42ms response time isn’t objectively “bad”, but when you see the alternative comparison, it seems less than ideal. There are also plenty of posts out there detailing networking performance issues on EKS versus other similar Kubernetes hosting platforms and configurations, so this is a known issue. The trade-off here being a massive difference in configurability and flexiblility of an EKS cluster versus metal or long-lived EC2 instances.
