Uptime monitoring is an important part of operating a website or app. When you are just getting started though, it is important to not get lost in the weeds and remember to keep working on your product. In this guide I'll walk you through the bare necessities of uptime monitoring for a new website.
You did it! Your app or website is feature complete. You are ready to make it public and show the world your delightful new corner of the internet. An item lingering on the to-do list (or kanban board, if that is your thing) is to set up some sort of uptime monitoring. After all, your small team (maybe just you?) won't catch every issue instantly, and you want to know if prod is broken.
If you read up on this topic you will find many articles and guides about how to integrate complete APM (Application Performance Monitoring) solutions that can alert you about all sorts of metrics about your application. You can find entire online courses about PagerDuty alternatives, and Datadog alternatives. These solutions might be great, but they are NOT FOR YOU (yet). These solutions are made for teams that have dedicated engineers that will be tuning the performance of an app or its deployment configuration in response to the metrics. These solutions are complicated, they are expensive, and they are not what you need. Right now you need a solution that takes 5 minutes to set up, and gets out of your way, yet retains the optionality to integrate more nuanced monitoring down the line. This is a guide to a barebones uptime monitoring setup with Heii On-Call.
The first step is to set up a monitor that will request your publicly accessible webpage and let you know if it is still serving requests at all. In Heii On-Call this monitor is called an Outbound Probe. After creating your first Organization in Heii On-Call you are automatically given a default Rotation and a Service, which are enough to get started. Click into the Service and create a New Trigger.
Give your trigger a name and fill in the URL you wish to monitor.
The timeout value is actually fairly nuanced, and picking well can mean the difference between too many alerts that you end up ignoring (false positives), or missing real downtime (false negatives). A value of "5 minutes" means that Heii On-Call will alert you if your site is detected as down for at least 5 minutes. When picking this value remember the internet is a large distributed system. Parts of the network will fail randomly and self-heal without you having to do anything about it. With this probe we are not trying to measure the % of total availability of a website. We just want to know if something is systematically broken and needs to be dealt with by a human. We recommend a 5 minute timeout to start. This is more than enough time for ISPs, public clouds, and hosting providers to smooth out any small temporary networking blips.
Click “Create” and Heii On-Call will perform a HEAD request to your URL. If the request fails or returns an error, Heii On-Call will alert you based on your notification settings.
If you have any other public endpoints that are useful to monitor, such as an API health check endpoint, create a separate Outbound Probe for each of those.
It may seem simple, but configuring this one check means your site is now constantly being monitored for many common issues at the DNS, TCP, SSL certificate, HTTP server, framework, and application layers of your stack. This monitoring is the highest coverage per effort in terms of monitoring a new website or app.
If you do not have any background processes or cron jobs that are part of your service (kudos to you for keeping your system simple), move on to step 3.
If you do have vital background processes that MUST run, setting up monitoring for them is easy with an Inbound Liveness check. Inbound Liveness checks provide an HTTP endpoint for your code to check in with periodically, and alert you if your system fails to check in. This is perfect for making sure regularly scheduled cron jobs or other background processes are running.
To set one up, create a new trigger, and change the type to Inbound Liveness. Give it a name.
Timeout in this case is also nuanced and it's worth thinking a little bit about. In this case the timeout is the amount of time since the last check-in that Heii On-Call will wait before sending you an alert. The value you enter should be the frequency at which the cron job runs at, plus a bit of buffer time to cover variability in how long the job takes to run. For example if you have an email sending cron job that runs every 30 minutes, you might make the timeout 35 minutes. If you have a daily database cleanup script that runs once a day, you might make the timeout 26 hours (remember, there is a 25-hour day when clocks fall back in the autumn!).
Click create, and take note of the newly created ID. This is the ID that your application code needs to check in with. Check the documentation for more information on how to make your application code send an HTTP request to check in with the inbound liveness check.
If you have event-driven background workers that run when needed but not on a regular schedule, it’s still a good idea to use an Inbound Liveness check to ensure your job queue and worker process stays up and running by periodically injecting a background job that only performs an Inbound Liveness check-in. This will catch any serious downtime in your message queue or worker deployments.
This is the minimum you absolutely need, and you should not do any more when just getting started. At this stage your job is to build your product and talk to your users, not tune your CPU load metrics. All you really need to know is that the app is up and running, and following steps 1 and 2 above will monitor that for you, continuously. Once your website starts growing, and you begin running into scaling issues, then it is time to start thinking about whether you need things like on-call scheduling and Datadog alternatives.