Projects

(Part 0) Deploying Kubernetes' Applications: The Problem

Over the holiday break, I spent a lot of my leisure coding time rethinking the way we deploy applications to Kubernetes. The blog series this post kicks off will explore how we migrated from an overly simplistic deploy strategy to one giving us the flexibility we need to deploy more complex applications. To ensure a solid foundation, in this first post, we’ll define our requirements for deploying Kubernetes’ applications and evaluate whether our previous systems and strategies met these requirements (spoiler alert… it didn’t).

(Part 3) Reducing the Cost of Running a Personal k8s Cluster: Conclusion

Overall impact

In parts one and two of this series, we sought to reduce our AWS costs by optimizing our computing, networking, and storage expenditures. Since this post is the final one in the series, let’s consider how we did in aggregate. Before any resource optimizations, we had the following bill:

master ec2 (1 m3.medium): (1 * 0.067 $/hour * 24 * 30) = 48.24
nodes ec2 (2 t2.medium): (2 * 0.0464 $/hour * 24 * 30) = 66.82
master ebs (1 64GB gp2): (1 * 64 * .1 $ per GB/month) = 6.40
nodes ebs (2 128GB gp2): (2 * 128 * .1 $ per GB/month) = 25.60
elb (1 classic) (1 * .25 $/hour * 24 * 30): 18.00
total: 165.06

After our resource optimizations, we have the following bill:

(Part 2) Reducing the Cost of Running a Personal k8s Cluster: Volumes and Load Balancers

In the previous post in this series, we showed how utilizing Spot Instances and Reserved Instances reduces the annual bill for running our Kubernetes cluster from ~2K to ~1.2K. In this post, we’ll pursue cost reduction for storage and networking resources, our final two prominent, unoptimized costs.¹ Our quick calculations from the first post in this series show, that with the default Kops configuration, we pay ~$360 annually for EBS (storage) and ~$216 annually for ELBs (networking), for an annual total of just over $500.

(Part 1) Reducing the Cost of Running a Personal k8s Cluster: EC2 Instances

Introduction

In my last blog post, I introduced our goal of decreasing the cost of running a personal k8s cluster, and made the case for why decreasing the cost is important. We also did some quick calculations which showed that EC2 instances are the most expensive part of our cluster, costing ~$115 per month or ~$1.4K per year. There’s no time like the present to actually start decreasing EC2 costs, so let’s get down to business.

For the last couple of months, I’ve spent the majority of my non-work coding time creating a Kubernetes of my own. My central thesis for this work is that Kubernetes is one of the best platforms for individual developers who want to self-host multiple applications with “production” performance needs (i.e. hosting a blog, a private Gitlab, a NextCloud, etc.). Supporting this thesis requires multiple forms of evidence.

via GIPHY

First, we need to show that deploying/maintaining multiple different applications with Kubernetes is doable and enjoyable without quitting our jobs and becoming full time sysadmins for personal projects. Our previous blog posts on deploying this blog via Kubernetes and setting up monitoring and alerting, as well as much of the work outlined in my personal k8s roadmap focus on deploying and maintaining multiple different applications via Kubernetes, so I hope that we’ve demonstrated, and will continue to demonstrate, Kubernetes’ power and ease.

I’m pretty excited to be writing this blog post, as it is the final one in our SLO Implementation series.

via GIPHY

In this final post, we’ll discuss using Prometheus Alerting Rules and Alertmanager to notify us when our blog is violating its SLO. Adding this alerting ensures we will be aware of any severe issues our users may face, and allows us to minimize the error budget spent by each incident.

Alerting Strategies

Our first step is deciding the algorithm we’ll use to determine when we receive an alert. The Site Reliability Workbook has useful guidance in this area. At the highest level, they state our goal “is to be notified for any significant event: an event that consumes a large fraction of the error budget.” They then define four categories by which we can measure an alerting strategy: Precision (i.e. proportion of events detected that were actually significant), Recall (i.e. proportion of actual significant events detected), Detection time (i.e. how long it takes us to notify during a significant event), and Reset time (i.e. how long after an issue is resolved before the alert stops firing).

For the past couple of weeks, our Prometheus cluster has been quietly polling this blog’s web server for metrics. Now that we’re collecting the data, our next job is make the data provide value. Our data provides value when it assists us in understanding our application’s past and current SLO adherence, and when it improves our actual SLO adherence. In this blog post, we’ll focus on the first of the two aforementioned value propositions. Specifically, we will create visualizations of metrics pertaining to our SLO using Grafana.

(Part 2) SLO Implementation: Prometheus Up & Running

For all of you just itching to deploy another application to your Kubernetes cluster, this post is for you.

via GIPHY

In it, I’ll be discussing deploying Prometheus, the foundation of our planned monitoring and alerting, to our Kubernetes cluster. This post will only discuss getting the Prometheus cluster running on our Kubernetes cluster. I’ll leave setting up monitoring, alerting, and useful visualizations for a later blog post in the series.

The Problem

So far, my ideas for experimenting with my personal Kubernetes cluster have been spread out across discrete blog posts. As a result, its difficult for me, and I imagine y’all as the readers, to track a prioritized list of projects.

via GIPHY

I also think that, in the future, it will be useful for us to be able to easily see which projects have been completed and which have not. Particularly as I plan to accompany all projects with blog posts, having a discoverable enumeration of my personal cluster’s features will give readers an easy way to know on which subjects they can expect guidance.

(Part 1) SLO Implementation: Release the Metrics

In the blog post overviewing our SLO implementation, I listed configuring our blog to expose the metrics for Prometheus to scrape as the first step. To fulfill that promise, this post examines the necessary steps for taking our static website and serving it via a production web server which exposes the latency and success metrics our SLO needs.

A brief examination of Prometheus metrics

Application monitoring has two fundamental components: instrumentation and exposition. Instrumentation refers to measuring and recording different quantities and states. Exposition refers to making metrics available to some consumer, which for us is Prometheus. We’ll discuss both in the context of our static website.

(Part 0) Deploying Kubernetes' Applications: The Problem

(Part 3) Reducing the Cost of Running a Personal k8s Cluster: Conclusion

Overall impact

(Part 2) Reducing the Cost of Running a Personal k8s Cluster: Volumes and Load Balancers

(Part 1) Reducing the Cost of Running a Personal k8s Cluster: EC2 Instances

Introduction

(Part 0) Reducing the Cost of Running a Personal k8s Cluster: Introduction

(Part 4) SLO Implementation: Alerting

Alerting Strategies

(Part 3) SLO Implementation: Deploying Grafana

(Part 2) SLO Implementation: Prometheus Up & Running

Personal k8s Cluster Roadmap

The Problem

(Part 1) SLO Implementation: Release the Metrics

A brief examination of Prometheus metrics