Introduction In my last blog post, I introduced our goal of decreasing the cost of running a personal k8s cluster, and made the case for why decreasing the cost is important. We also did some quick calculations which showed that EC2 instances are the most expensive part of our cluster, costing ~$115 per month or ~$1.4K per year. There’s no time like the present to actually start decreasing EC2 costs, so let’s get down to business.
For the last couple of months, I’ve spent the majority of my non-work coding time creating a Kubernetes of my own. My central thesis for this work is that Kubernetes is one of the best platforms for individual developers who want to self-host multiple applications with “production” performance needs (i.e. hosting a blog, a private Gitlab, a NextCloud, etc.). Supporting this thesis requires multiple forms of evidence. via GIPHY
I’m pretty excited to be writing this blog post, as it is the final one in our SLO Implementation series. via GIPHY In this final post, we’ll discuss using Prometheus Alerting Rules and Alertmanager to notify us when our blog is violating its SLO. Adding this alerting ensures we will be aware of any severe issues our users may face, and allows us to minimize the error budget spent by each incident.
For the past couple of weeks, our Prometheus cluster has been quietly polling this blog’s web server for metrics. Now that we’re collecting the data, our next job is make the data provide value. Our data provides value when it assists us in understanding our application’s past and current SLO adherence, and when it improves our actual SLO adherence. In this blog post, we’ll focus on the first of the two aforementioned value propositions.
For all of you just itching to deploy another application to your Kubernetes cluster, this post is for you. via GIPHY In it, I’ll be discussing deploying Prometheus, the foundation of our planned monitoring and alerting, to our Kubernetes cluster. This post will only discuss getting the Prometheus cluster running on our Kubernetes cluster. I’ll leave setting up monitoring, alerting, and useful visualizations for a later blog post in the series.
The Problem So far, my ideas for experimenting with my personal Kubernetes cluster have been spread out across discrete blog posts. As a result, its difficult for me, and I imagine y’all as the readers, to track a prioritized list of projects. via GIPHY I also think that, in the future, it will be useful for us to be able to easily see which projects have been completed and which have not.
In the blog post overviewing our SLO implementation, I listed configuring our blog to expose the metrics for Prometheus to scrape as the first step. To fulfill that promise, this post examines the necessary steps for taking our static website and serving it via a production web server which exposes the latency and success metrics our SLO needs. A brief examination of Prometheus metrics Application monitoring has two fundamental components: instrumentation and exposition.
My last two blog posts enumerated this blog’s SLO and error budget. Our next logical step is adding the monitoring and alerting infrastructure which will transform our SLO usage from theoretical to practical. Like creating a Kubernetes of One’s Own, this project contains multiple steps which we’ll explore over multiple blog posts. While this series focuses on achieving this goal for this blog’s specific SLO, the techniques are applicable to many scenarios.
In my last blog post, I publicized an SLO for this blog. I also mentioned that, in the future, I’d couple the SLO with an error budget and error budget policy. Well, the future is today, because this post will define error budgets and error budget policies and their benefits, before proposing a specific error budget and error budget policy to accompany our previously defined SLO. What are Error Budget and Error Budget Policies?
Background I recently started reading The Site Reliability Workbook, which is the companion book to the excellent Site Reliability Engineering: How Google Runs Production Systems. via GIPHY These books devote considerable attention to Service Level Ojectives (SLOs), which are a way of defining a given level of service that users can expect. More technically, a SLO is a collection of Service Level Indicators (SLIs), metrics that measure whether our service is providing value, and their acceptable ranges.