Minimalistic Monitoring and Alerting for your Kubernetes Cluster with Prometheus and Alertmanager

Prometheus and Grafana are incredibly powerful tools for website and application monitoring. Unfortunately with great power comes great complexity, and, if not properly controlled, ballooning costs. In this article I am going to cover setting up a minimal instance of Prometheus and Alertmanager to monitor a web application and do basic alerting on one metric.

Author

Humberto Evans

I wrote this guide after I found that most guides on getting started with Prometheus led to a bloated setup that would set you up for failure in the future. I wanted to receive an alert on my phone when one or more of our Kubernetes nodes was running low on resources. Should be easy, right? Most guides I found about website monitoring with Prometheus had me scrape every every possible available metric. With that configuration, a Prometheus instance will quickly consume more resources than the app you are monitoring.

The "ingest everything and worry about it later approach" can work for some situations. But at Heii On-Call we prefer to keep things minimal to start and grow as the application requires it. This is generally good engineering practice.

The best way to control Prometheus is to keep a very tight, well documented, leash on the metrics you ingest. If you start this process now, and ingest only critical metrics that you actually intend on using, the amount of money you are spending on Prometheus itself can be quite small.

What is Prometheus?

Prometheus is an open source monitoring tool. It works by scraping (via http requests) targets and collecting metrics that those targets make available in Prometheus's format. Prometheus is only the agent that collects and stores metrics. It is not responsible for generating the metrics. Those are application specific and will be different depending on what you are monitoring. For our use case, we are monitoring nodes running in a Kubernetes cluster. I won't spend too much time covering how Prometheus works. I highly recommend reading their official documentation to learn more.

Minimal Prometheus with Minimal Config

Let's start with a minimal Prometheus configuration. First, create a namespace where the monitoring infrastructure will live. Creating a separate namespace lets you isolate different parts of your production system.

kubectl create namespace monitoring

Next we create a ConfigMap to store the configuration for Prometheus. Create a file called prometheus-config.yaml and add the following:

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-server-conf
    labels:
    name: prometheus-server-conf
    namespace: monitoring
data:
  prometheus.rules: |-
    groups: []
  prometheus.yml: |
    global:
      scrape_interval: 10s
      evaluation_interval: 10s
    rule_files:
      - /etc/prometheus/prometheus.rules
    alerting:
    alertmanagers: []
      scrape_configs: []

Notice that the keys scrape_configs, alertmanagers, and the groups for prometheus.rules are all empty. With this configuration we will get Promtheus to start, but it will not be scraping any targets.

Now apply that configuration to your cluster

kubectl apply -f prometheus-config.yaml

Now, let's create the Prometheus Deployment and Service. Create a file called prometheus.yaml and add the following:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus
  namespace: monitoring
  labels:
    app: prometheus
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      containers:
        - name: prometheus
          image: prom/prometheus
          args:
            - "--storage.tsdb.retention.time=12h"
            - "--config.file=/etc/prometheus/prometheus.yml"
            - "--storage.tsdb.path=/prometheus/"
            - '--web.enable-lifecycle'
          ports:
            - containerPort: 9090
          resources:
            requests:
              memory: 32Mi
            limits:
              memory: 128Mi
          volumeMounts:
            - name: prometheus-config-volume
              mountPath: /etc/prometheus/
            - name: prometheus-storage-volume
              mountPath: /prometheus/
      volumes:
        - name: prometheus-config-volume
          configMap:
            defaultMode: 420
            name: prometheus-server-conf
        - name: prometheus-storage-volume
          emptyDir:
            sizeLimit: 500Mi

---

apiVersion: v1
kind: Service
metadata:
  name: prometheus
  namespace: monitoring
spec:
  type: ClusterIP
  selector:
    app: prometheus
  ports:
    - protocol: TCP
      port: 9090
      targetPort: 9090

A few things to note:

We are choosing a 12h retention period. This is a reasonable default, but you can change it as your needs grow.
We are only requesting 32 MiB of RAM and are limiting to 128 MiB of RAM. This is a relatively small Prometheus deployment that will do just fine when you are getting started. Remember our golden rule, for Prometheus, only ingest metrics you are actually using.

Create the deployment and the service by running:

kubectl apply -f prometheus.yaml

Verify that you have a running Prometheus pod by checking the running pods in the namespace you created:

kubectl get pods -n monitoring

At this point you can also check the Prometheus web UI by port forwarding to the new service:

kubectl port-forward -n monitoring svc/prometheus 9090:9090

Then in your browser visit localhost:9090. It should look like this:

Empty Prometheus

Click on Status -> Targets and notice that there are no targets to scrape

Prometheus with no scrape targets

Find your first metric

We conceived of this guide when we wanted to monitor and alert on memory usage on our Kubernetes nodes, so our first metric will be the total memory used. In Kubernetes, node level metrics are exposed by the kubelet at the /metrics/cadvisor endpoint on each node. The metrics exposed here are cadvisor metrics. You can find their excellent documentation here. Based on this we know what we want to monitor is container_memory_working_set_bytes, which tracks the memory used by containers that cannot be evicted under memory pressure (coincidentally also what the OOM killer is looking for).

Before we dive into having Prometheus ingest metrics, we are going to hit the metrics endpoint we want to scrape. This is important to get a sense of which metrics are being exposed by the agent. In our case we are going to be using the Kubernetes API to proxy through to the node and query the cadvisor metrics. I do not recommend ingesting any metrics endpoint you have not manually queried yourself.

From your command line run:

kubectl proxy

This creates a port forward to the Kubernetes API directly through port 8001 on your computer. Then in your browser go to this URL (replacing NODE_NAME with the name of a node):

http://localhost:8001/api/v1/nodes/{NODE_NAME}/proxy/metrics/cadvisor

You should see all the metrics that the node is making available. Do a find for conainer_memory_working_set_bytes and observe the metrics that are available. Note how cadvisor is reporting metrics for every cgroup in the hierarchy. This means if you are not careful it is very easy to accidentally double count or look at the wrong thing later when you are aggregating metrics. For now, we are only interested in the total RAM being used by all the running containers, which means we can safely ingest only one metric per node. If you wish to monitor metrics about individual pods or containers you will need to be more nuanced about which metrics you ingest.

Ingest metrics for monitoring with Prometheus

When we hit the metrics endpoint from our computer, we were using a port forward from the kubectl CLI that has your admin level permissions. In order to allow the Prometheus pod to discover which nodes are available in the cluster and be able to query the Kubernetes API, we need to give it RBAC permissions. We therefore need to add a ClusterRole that will let pods in our namespace query the nodes/proxy resource. Create a file called prometheus-cluster-role.yaml and add the following:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus
rules:
- apiGroups: [""]
  resources:
  - nodes
  - nodes/proxy
  verbs: ["get", "list", "watch"]

---

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
- kind: ServiceAccount
  name: default
  namespace: monitoring

Then apply the cluster role.

kubectl apply -f prometheus-cluster-role.yaml

Note that we are granting the namespace access to the nodes/proxy resources as well as the nodes resource. We will be using the nodes/proxy endpoint to query the nodes, and we need the nodes endpoint to list the available nodes (more on why we need this later). If you are using Prometheus built-in service discovery for other resources, like endpoints or pods, you will need to add those resources to the list as well.

Now let’s add configuration for Prometheus to start ingesting metrics. Open up prometheus-config.yaml and change the scrape_configs map to add the following entry.

    scrape_configs:
      - job_name: 'node-cadvisor'
        kubernetes_sd_configs:
          - role: node
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
          - target_label: __address__
            replacement: kubernetes.default.svc:443
          - source_labels: [__meta_kubernetes_node_name]
            regex: (.+)
            target_label: __metrics_path__
            replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
        metric_relabel_configs:
          - source_labels: [__name__,id]
            regex: container_memory_working_set_bytes;/$
            action: keep

We are adding a new job called 'node-cadvisor'. You can add multiple jobs in the future to scrape other metrics.

Understanding the Prometheus scrape job configuration

Let’s break down the configuration block above to understand what each section is doing. First we set:

       kubernetes_sd_configs:
          - role: node

This block is telling Prometheus to use the Kubernetes service discovery mechanism (that's what the sd is for). Specifying role: node tells Prometheus to discover nodes. Prometheus Kubernetes service discovery will discover one target per node in your cluster and process each one. There are a few different types of roles available and I highly encourage you to peruse the official documentation. The service and pod roles can be particularly useful.

The tls_config and bearer_token_file blocks are necessary for Prometheus to be able to hit the Kubernetes API over HTTPS.

The relabel_configs block is where you can configure which targets get scraped, as well as configure how the label set is used for each target. Relabeling steps apply in order as defined from top to bottom. Relabeling is an extremely powerful tool, so it is worth taking some time to understand its capabilities. By default, every target has a set of labels associated with with it:

__address__ is set to the : of the target. In our case of Kubernetes node discovery it is set to the internal node IP
__metrics_path__ defaults to /metrics but can be changed in the scrape_config configuration
Other __meta_* labels are usually set by the service discovery mechanism

The extremely powerful feature is that inside relabel_configs you can alter both __address__ and __metrics_path__ to change how the target is scraped! Our first relabel block:

          - target_label: __address__
            replacement: kubernetes.default.svc:443

Sets the address to a constant kubernetes.default.svc:443. This is an address that is always available to any pod to reach the Kubernetes API.

The second relabel step is a bit more complicated:

          - source_labels: [__meta_kubernetes_node_name]
            regex: (.+)
            target_label: __metrics_path__
            replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor

Here we want to replace __metrics_path__ with the path to the metrics for our node. In each relabel_config step Prometheus takes the source_labels: and concatenates them together using a separator (the separator is ";" by default). In this case since we only have one label it is just the value of the __meta_kubernetes_node_name set by the service discovery mechanism. Regex: allows you to match and extract a portion of the source_labels, in this case we are using (.+) which matches the whole thing. Then the target label gets set equal to what we have in replacement, with any matching groups from the regular expression substituted in accordingly. Since our regex matches the whole node name as a captured group it will replace ${1} with our node name. This set or relabels means Prometheus will attempt to scrape https://kubernetes.default.svc:443/api/v1/nodes/{NODE_NAME}/proxy/metrics/cadvisor for every node discovered by the service discovery mechanism.

I highly recommend taking a read through the relabel config documentation. It is a bit dense but once you get the hang of it there is practically nothing you can't make it do for each target. Note that you can also use the keep and drop replace actions to drop targets entirely to prevent scraping parts of your system you are not interested in monitoring.

Now that we have defined how to scrape each target. The metric_relabel_configs block lets us relabel individual metrics. This is the mechanism we will use to ingest the single metric we are interested in and ignore the rest. If we simply left out the metric_relabel_configs block Prometheus would ingest every metric at the target.

        metric_relabel_configs:
          - source_labels: [__name__,id]
            regex: container_memory_working_set_bytes;/$
            action: keep

The grammar for metric_relabel_configs is the same as the target relabel_configs. Each item in metric_relabel_configs is applied from top to bottom. The set of metrics left at the end are ingested into Prometheus. Our goal is to only ingest only container_memory_working_set_bytes for the top level cgroup. This is the metric that has id exactly equal to "/". To do that we are using source_labels to concatenate the metric name and the value of the id label (separated by a semicolon). Then we are using the regex match to keep only the one metric we are interested in.

This step again can give you much power to achieve pretty much anything you want with labels. A common use case is to drop or keep individual labels using labelkeep. If you are doing this make sure your metrics still remain unique after any label drops or renames. Having a metric be non unique will cause bad data to be ingested.

Now save the file and apply the changes to the config.

kubectl apply -f prometheus-config.yaml

This pushes the new config to Kubernetes, but we still need to restart Prometheus so it will read the configuration.

First start up a port forward to Prometheus again (if you killed the old one)

kubectl port-forward -n monitoring svc/prometheus 9090:9090

Then do

curl -X POST http://localhost:9090/-/reload

The /-/reload is a special url we enabled in the Prometheus configuration that lets us restart via a POST request. If you do not wish to enable this feature you can simply restart the deployment.

Prometheus should now be recognizing the targets and ingesting metrics.

Now visit http://localhost:9090/ in your browser and click on Status -> Targets to verify the node-cadvisor job is up.

Prometheus with cadvisor target

Now click on Graph and enter container_memory_working_set_bytes in the search bar to see a graph of the metric over time.

Prometheus node RAM graph

Set up Heii On-Call to receive notifications

Now let's set up Heii On-Call to receive the alerts for our new monitor. In your Heii On-Call organization create a new trigger of type Manual and give it a name. In this case we’ll call it “Cluster RAM Alert”:

Heii On-Call manual trigger form

This creates a trigger that can be fired via a POST request to a URL, which is exactly what Prometheus Alertmanager can be set up to do.

Heiii On-Call trigger

Click Create Trigger and Heii On-Call will create a trigger for your and give you the id.

You will also need to create an API key for Prometheus to authenticate with. Refer to the documentation for API keys.

Set up Alertmanager for alerts and notifications

Now that we have data and a trigger to hit, we can set up Alertmanager so we can receive alerts. Alertmanager is a component of Prometheus that is responsible for aggregating any firing alerts and sending them to the external system you desire. Create a new file called alertmanager-config.yaml and add the following:

kind: ConfigMap
apiVersion: v1
metadata:
  name: alertmanager-config
  namespace: monitoring
data:
  config.yml: |-
    global:
    route:
      receiver: heii-on-call
      group_wait: 10s
      repeat_interval: 30m
      routes: []

    receivers:
    - name: heii-on-call
      webhook_configs:
        - url: "https://api.heiioncall.com/triggers/YOUR-TRIGGER-ID-HERE/alert"
          http_config:
            follow_redirects: false
            authorization:
              credentials: "YOUR-HEII-ON-CALL-API-KEY-HERE"

This is a minimal configuration that will send every triggering alert to this single trigger. This is a great way to get started, but once you have more alerts and more teams you can map different alerts to different triggers.

Apply the configuration. Then create a file called alertmanager.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: alertmanager
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: alertmanager
  template:
    metadata:
      name: alertmanager
      labels:
        app: alertmanager
    spec:
      containers:
      - name: alertmanager
        image: prom/alertmanager:latest
        args:
          - "--config.file=/etc/alertmanager/config.yml"
          - "--storage.path=/alertmanager"
        ports:
          - containerPort: 9093
        resources:
            requests:
              memory: 16Mi
            limits:
              memory: 32Mi
        volumeMounts:
        - name: config-volume
          mountPath: /etc/alertmanager
        - name: alertmanager
          mountPath: /alertmanager
      volumes:
      - name: config-volume
        configMap:
          name: alertmanager-config
      - name: alertmanager
        emptyDir: {}

---

apiVersion: v1
kind: Service
metadata:
  name: alertmanager
  namespace: monitoring
spec:
  selector:
    app: alertmanager
  type: ClusterIP
  ports:
    - port: 9093
      targetPort: 9093

This is creating a deployment for Alertmanager and a Service just like we did for Prometheus. Note again that we are constraining the resources significantly at first because that is all Alertmanger should really need for a small deployment. This is creating the service on port 9093. Alertmanager does also have a web interface, if you wish you can run a port forward and check it out.

Define an alert for our metric

Now we can define an alert rule for Prometheus. Open the prometheus-config.yaml file and change the alerting section to add one Alertmanager configuration block:

    alerting:
      alertmanagers:
        - scheme: http
          static_configs:
            - targets:
              - "alertmanager:9093"

This merely tells Prometheus about the Alertmanager service we deployed earlier.

Then, also within prometheus-config.yaml, change the prometheus.rules key to add a single group with a single rule:

  prometheus.rules: |-
    groups:
    - name: Node Memory is too high
      rules:
      - alert: High Node Memory
        expr: container_memory_working_set_bytes > 3.8e9
        for: 1m

This sets an alert called High Node Memory to fire whenever container_memory_working_set_bytes is greater than 3.8GB for 1 minute. The expr here can be any PromQL expression, so you have immense flexibility on combinations of metrics to alert on. You should change 3.8e9 to be the threshold above which you want to alert. This is an example running on a 4GB machine so we set our threshold at 3.8GB.

Apply the configuration and restart Prometheus like before:

curl -X POST http://localhost:9090/-/reload

If your port forward is still active you can go to http://localhost:9090/alerts and you can see the alert we added:

Prometheus high RAM alert

Done

That is it! With this minimal setup you can enjoy lightweight alerting on critical metrics, and you can expand this easily to CPU metrics, pod or endpoint specific metrics and even application level metrics. With very little overhead you can now receive push notifications, or configure different on-call rotation schedules to handle the alerts. By being specific about ingesting only important metrics, you can gain all the visibility and monitoring you need, without the ongoing DevOps resources and financial costs of a much heavier ingest-everything approach.

Happy Monitoring!