Prometheus and Grafana are incredibly powerful tools for website and application monitoring. Unfortunately with great power comes great complexity, and, if not properly controlled, ballooning costs. In this article I am going to cover setting up a minimal instance of Prometheus and Alertmanager to monitor a web application and do basic alerting on one metric.
I wrote this guide after I found that most guides on getting started with Prometheus led to a bloated setup that would set you up for failure in the future. I wanted to receive an alert on my phone when one or more of our Kubernetes nodes was running low on resources. Should be easy, right? Most guides I found about website monitoring with Prometheus had me scrape every every possible available metric. With that configuration, a Prometheus instance will quickly consume more resources than the app you are monitoring.
The "ingest everything and worry about it later approach" can work for some situations. But at Heii On-Call we prefer to keep things minimal to start and grow as the application requires it. This is generally good engineering practice.
The best way to control Prometheus is to keep a very tight, well documented, leash on the metrics you ingest. If you start this process now, and ingest only critical metrics that you actually intend on using, the amount of money you are spending on Prometheus itself can be quite small.
Prometheus is an open source monitoring tool. It works by scraping (via http requests) targets and collecting metrics that those targets make available in Prometheus's format. Prometheus is only the agent that collects and stores metrics. It is not responsible for generating the metrics. Those are application specific and will be different depending on what you are monitoring. For our use case, we are monitoring nodes running in a Kubernetes cluster. I won't spend too much time covering how Prometheus works. I highly recommend reading their official documentation to learn more.
Let's start with a minimal Prometheus configuration. First, create a namespace where the monitoring infrastructure will live. Creating a separate namespace lets you isolate different parts of your production system.
kubectl create namespace monitoring
Next we create a ConfigMap to store the configuration for Prometheus. Create a file called prometheus-config.yaml and add the following:
apiVersion: v1 kind: ConfigMap metadata: name: prometheus-server-conf labels: name: prometheus-server-conf namespace: monitoring data: prometheus.rules: |- groups: [] prometheus.yml: | global: scrape_interval: 10s evaluation_interval: 10s rule_files: - /etc/prometheus/prometheus.rules alerting: alertmanagers: [] scrape_configs: []
Notice that the keys scrape_configs, alertmanagers, and the groups for prometheus.rules are all empty. With this configuration we will get Promtheus to start, but it will not be scraping any targets.
Now apply that configuration to your cluster
kubectl apply -f prometheus-config.yaml
Now, let's create the Prometheus Deployment and Service. Create a file called prometheus.yaml and add the following:
apiVersion: apps/v1 kind: Deployment metadata: name: prometheus namespace: monitoring labels: app: prometheus spec: replicas: 1 selector: matchLabels: app: prometheus template: metadata: labels: app: prometheus spec: containers: - name: prometheus image: prom/prometheus args: - "--storage.tsdb.retention.time=12h" - "--config.file=/etc/prometheus/prometheus.yml" - "--storage.tsdb.path=/prometheus/" - '--web.enable-lifecycle' ports: - containerPort: 9090 resources: requests: memory: 32Mi limits: memory: 128Mi volumeMounts: - name: prometheus-config-volume mountPath: /etc/prometheus/ - name: prometheus-storage-volume mountPath: /prometheus/ volumes: - name: prometheus-config-volume configMap: defaultMode: 420 name: prometheus-server-conf - name: prometheus-storage-volume emptyDir: sizeLimit: 500Mi --- apiVersion: v1 kind: Service metadata: name: prometheus namespace: monitoring spec: type: ClusterIP selector: app: prometheus ports: - protocol: TCP port: 9090 targetPort: 9090
A few things to note:
Create the deployment and the service by running:
kubectl apply -f prometheus.yaml
Verify that you have a running Prometheus pod by checking the running pods in the namespace you created:
kubectl get pods -n monitoring
At this point you can also check the Prometheus web UI by port forwarding to the new service:
kubectl port-forward -n monitoring svc/prometheus 9090:9090
Then in your browser visit localhost:9090
. It should look like this:
Click on Status -> Targets and notice that there are no targets to scrape
We conceived of this guide when we wanted to monitor and alert on memory usage on our Kubernetes nodes, so our first metric will be the total memory used. In Kubernetes, node level metrics are exposed by the kubelet at the /metrics/cadvisor
endpoint on each node. The metrics exposed here are cadvisor metrics. You can find their excellent documentation here. Based on this we know what we want to monitor is container_memory_working_set_bytes, which tracks the memory used by containers that cannot be evicted under memory pressure (coincidentally also what the OOM killer is looking for).
Before we dive into having Prometheus ingest metrics, we are going to hit the metrics endpoint we want to scrape. This is important to get a sense of which metrics are being exposed by the agent. In our case we are going to be using the Kubernetes API to proxy through to the node and query the cadvisor metrics. I do not recommend ingesting any metrics endpoint you have not manually queried yourself.
From your command line run:
kubectl proxy
This creates a port forward to the Kubernetes API directly through port 8001 on your computer. Then in your browser go to this URL (replacing NODE_NAME with the name of a node):
http://localhost:8001/api/v1/nodes/{NODE_NAME}/proxy/metrics/cadvisor
You should see all the metrics that the node is making available. Do a find for conainer_memory_working_set_bytes and observe the metrics that are available. Note how cadvisor is reporting metrics for every cgroup in the hierarchy. This means if you are not careful it is very easy to accidentally double count or look at the wrong thing later when you are aggregating metrics. For now, we are only interested in the total RAM being used by all the running containers, which means we can safely ingest only one metric per node. If you wish to monitor metrics about individual pods or containers you will need to be more nuanced about which metrics you ingest.
When we hit the metrics endpoint from our computer, we were using a port forward from the kubectl CLI that has your admin level permissions. In order to allow the Prometheus pod to discover which nodes are available in the cluster and be able to query the Kubernetes API, we need to give it RBAC permissions. We therefore need to add a ClusterRole that will let pods in our namespace query the nodes/proxy resource. Create a file called prometheus-cluster-role.yaml and add the following:
apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: prometheus rules: - apiGroups: [""] resources: - nodes - nodes/proxy verbs: ["get", "list", "watch"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: prometheus roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: prometheus subjects: - kind: ServiceAccount name: default namespace: monitoring
Then apply the cluster role.
kubectl apply -f prometheus-cluster-role.yaml
Note that we are granting the namespace access to the nodes/proxy resources as well as the nodes resource. We will be using the nodes/proxy endpoint to query the nodes, and we need the nodes endpoint to list the available nodes (more on why we need this later). If you are using Prometheus built-in service discovery for other resources, like endpoints or pods, you will need to add those resources to the list as well.
Now let’s add configuration for Prometheus to start ingesting metrics. Open up prometheus-config.yaml and change the scrape_configs map to add the following entry.
scrape_configs: - job_name: 'node-cadvisor' kubernetes_sd_configs: - role: node scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - target_label: __address__ replacement: kubernetes.default.svc:443 - source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor metric_relabel_configs: - source_labels: [__name__,id] regex: container_memory_working_set_bytes;/$ action: keep
We are adding a new job called 'node-cadvisor'. You can add multiple jobs in the future to scrape other metrics.
Let’s break down the configuration block above to understand what each section is doing. First we set:
kubernetes_sd_configs: - role: node
This block is telling Prometheus to use the Kubernetes service discovery mechanism (that's what the sd is for). Specifying role: node
tells Prometheus to discover nodes. Prometheus Kubernetes service discovery will discover one target per node in your cluster and process each one. There are a few different types of roles available and I highly encourage you to peruse the official documentation. The service and pod roles can be particularly useful.
The tls_config and bearer_token_file blocks are necessary for Prometheus to be able to hit the Kubernetes API over HTTPS.
The relabel_configs block is where you can configure which targets get scraped, as well as configure how the label set is used for each target. Relabeling steps apply in order as defined from top to bottom. Relabeling is an extremely powerful tool, so it is worth taking some time to understand its capabilities. By default, every target has a set of labels associated with with it:
__address__
is set to the __metrics_path__
defaults to /metrics but can be changed in the scrape_config configuration__meta_*
labels are usually set by the service discovery mechanismThe extremely powerful feature is that inside relabel_configs you can alter both __address__
and __metrics_path__
to change how the target is scraped! Our first relabel block:
- target_label: __address__ replacement: kubernetes.default.svc:443
Sets the address to a constant kubernetes.default.svc:443
. This is an address that is always available to any pod to reach the Kubernetes API.
The second relabel step is a bit more complicated:
- source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
Here we want to replace __metrics_path__
with the path to the metrics for our node. In each relabel_config step Prometheus takes the source_labels: and concatenates them together using a separator (the separator is ";" by default). In this case since we only have one label it is just the value of the __meta_kubernetes_node_name set by the service discovery mechanism. Regex: allows you to match and extract a portion of the source_labels, in this case we are using (.+) which matches the whole thing. Then the target label gets set equal to what we have in replacement, with any matching groups from the regular expression substituted in accordingly. Since our regex matches the whole node name as a captured group it will replace ${1} with our node name. This set or relabels means Prometheus will attempt to scrape https://kubernetes.default.svc:443/api/v1/nodes/{NODE_NAME}/proxy/metrics/cadvisor
for every node discovered by the service discovery mechanism.
I highly recommend taking a read through the relabel config documentation. It is a bit dense but once you get the hang of it there is practically nothing you can't make it do for each target. Note that you can also use the keep and drop replace actions to drop targets entirely to prevent scraping parts of your system you are not interested in monitoring.
Now that we have defined how to scrape each target. The metric_relabel_configs block lets us relabel individual metrics. This is the mechanism we will use to ingest the single metric we are interested in and ignore the rest. If we simply left out the metric_relabel_configs block Prometheus would ingest every metric at the target.
metric_relabel_configs: - source_labels: [__name__,id] regex: container_memory_working_set_bytes;/$ action: keep
The grammar for metric_relabel_configs is the same as the target relabel_configs. Each item in metric_relabel_configs is applied from top to bottom. The set of metrics left at the end are ingested into Prometheus. Our goal is to only ingest only container_memory_working_set_bytes for the top level cgroup. This is the metric that has id exactly equal to "/". To do that we are using source_labels to concatenate the metric name and the value of the id label (separated by a semicolon). Then we are using the regex match to keep only the one metric we are interested in.
This step again can give you much power to achieve pretty much anything you want with labels. A common use case is to drop or keep individual labels using labelkeep. If you are doing this make sure your metrics still remain unique after any label drops or renames. Having a metric be non unique will cause bad data to be ingested.
Now save the file and apply the changes to the config.
kubectl apply -f prometheus-config.yaml
This pushes the new config to Kubernetes, but we still need to restart Prometheus so it will read the configuration.
First start up a port forward to Prometheus again (if you killed the old one)
kubectl port-forward -n monitoring svc/prometheus 9090:9090
Then do
curl -X POST http://localhost:9090/-/reload
The /-/reload is a special url we enabled in the Prometheus configuration that lets us restart via a POST request. If you do not wish to enable this feature you can simply restart the deployment.
Prometheus should now be recognizing the targets and ingesting metrics.
Now visit http://localhost:9090/ in your browser and click on Status -> Targets to verify the node-cadvisor job is up.
Now click on Graph and enter container_memory_working_set_bytes in the search bar to see a graph of the metric over time.
Now let's set up Heii On-Call to receive the alerts for our new monitor. In your Heii On-Call organization create a new trigger of type Manual and give it a name. In this case we’ll call it “Cluster RAM Alert”:
This creates a trigger that can be fired via a POST request to a URL, which is exactly what Prometheus Alertmanager can be set up to do.
Click Create Trigger and Heii On-Call will create a trigger for your and give you the id.
You will also need to create an API key for Prometheus to authenticate with. Refer to the documentation for API keys.
Now that we have data and a trigger to hit, we can set up Alertmanager so we can receive alerts. Alertmanager is a component of Prometheus that is responsible for aggregating any firing alerts and sending them to the external system you desire. Create a new file called alertmanager-config.yaml and add the following:
kind: ConfigMap apiVersion: v1 metadata: name: alertmanager-config namespace: monitoring data: config.yml: |- global: route: receiver: heii-on-call group_wait: 10s repeat_interval: 30m routes: [] receivers: - name: heii-on-call webhook_configs: - url: "https://api.heiioncall.com/triggers/YOUR-TRIGGER-ID-HERE/alert" http_config: follow_redirects: false authorization: credentials: "YOUR-HEII-ON-CALL-API-KEY-HERE"
This is a minimal configuration that will send every triggering alert to this single trigger. This is a great way to get started, but once you have more alerts and more teams you can map different alerts to different triggers.
Apply the configuration. Then create a file called alertmanager.yaml:
apiVersion: apps/v1 kind: Deployment metadata: name: alertmanager namespace: monitoring spec: replicas: 1 selector: matchLabels: app: alertmanager template: metadata: name: alertmanager labels: app: alertmanager spec: containers: - name: alertmanager image: prom/alertmanager:latest args: - "--config.file=/etc/alertmanager/config.yml" - "--storage.path=/alertmanager" ports: - containerPort: 9093 resources: requests: memory: 16Mi limits: memory: 32Mi volumeMounts: - name: config-volume mountPath: /etc/alertmanager - name: alertmanager mountPath: /alertmanager volumes: - name: config-volume configMap: name: alertmanager-config - name: alertmanager emptyDir: {} --- apiVersion: v1 kind: Service metadata: name: alertmanager namespace: monitoring spec: selector: app: alertmanager type: ClusterIP ports: - port: 9093 targetPort: 9093
This is creating a deployment for Alertmanager and a Service just like we did for Prometheus. Note again that we are constraining the resources significantly at first because that is all Alertmanger should really need for a small deployment. This is creating the service on port 9093. Alertmanager does also have a web interface, if you wish you can run a port forward and check it out.
Define an alert for our metric
Now we can define an alert rule for Prometheus. Open the prometheus-config.yaml file and change the alerting section to add one Alertmanager configuration block:
alerting: alertmanagers: - scheme: http static_configs: - targets: - "alertmanager:9093"
This merely tells Prometheus about the Alertmanager service we deployed earlier.
Then, also within prometheus-config.yaml, change the prometheus.rules key to add a single group with a single rule:
prometheus.rules: |- groups: - name: Node Memory is too high rules: - alert: High Node Memory expr: container_memory_working_set_bytes > 3.8e9 for: 1m
This sets an alert called High Node Memory to fire whenever container_memory_working_set_bytes is greater than 3.8GB for 1 minute. The expr here can be any PromQL expression, so you have immense flexibility on combinations of metrics to alert on. You should change 3.8e9 to be the threshold above which you want to alert. This is an example running on a 4GB machine so we set our threshold at 3.8GB.
Apply the configuration and restart Prometheus like before:
curl -X POST http://localhost:9090/-/reload
If your port forward is still active you can go to http://localhost:9090/alerts and you can see the alert we added:
That is it! With this minimal setup you can enjoy lightweight alerting on critical metrics, and you can expand this easily to CPU metrics, pod or endpoint specific metrics and even application level metrics. With very little overhead you can now receive push notifications, or configure different on-call rotation schedules to handle the alerts. By being specific about ingesting only important metrics, you can gain all the visibility and monitoring you need, without the ongoing DevOps resources and financial costs of a much heavier ingest-everything approach.
Happy Monitoring!