Status

Welcome! Heii On-Call empowers engineering teams with streamlined on-call rotation scheduling, instant iOS/Android alerts, flexible escalation plans, and an extensive set of monitoring tools including synthetic HTTP probes, cronjob check-ins, and API integrations with platforms like Prometheus and Datadog.

This page explains how we monitor the components of our own multi-tier application, and displays the live status of those triggers.

If you'd like to build a similar live status dashboard for your team (and have the on-call engineer get alerts when there's an issue!), sign up and start creating your own triggers.

Website

Our public-facing website is a Ruby on Rails application. We monitor this with a number of Outbound Probes, covering our homepage, a health check endpoint, and key HTTP protocol and subdomain redirects:

Rails Homepage	up 50 ms 3s ago
Rails Health Check Controller	up 155 ms 30s ago
www to root domain redirect	up 13 ms 25s ago
http to https redirect	up 22 ms 4s ago
http://www redirect	up 10 ms 25s ago

API Server

Our API server is written in Crystal and lives at https://api.heiioncall.com, receiving check-ins and state changes from your applications and APMs (see our docs). We monitor this with Outbound Probes as well:

API Subdomain Homepage	up 39 ms 2s ago
API Subdomain Health Check Endpoint	up 42 ms 2s ago

SSL Certificates

Our website and API endpoints are secured by 90-day SSL certificates issued by Let's Encrypt. We monitor cert-manager's automated certificate rotation with Outbound Probes:

heiioncall.com SSL Certificate Valid 2+ Days	up 35 ms 51s ago
heiioncall.com SSL Certificate Valid 14+ Days	up 21 ms 49s ago
api.heiioncall.com SSL Certificate Valid 2+ Days	up 20 ms 2s ago
api.heiioncall.com SSL Certificate Valid 14+ Days	up 17 ms 2s ago

Outbound Prober

Our Outbound Prober is a sharded cluster of Crystal processes. We use an Inbound Liveness trigger to make sure the entire shard space is being probed at all times. We also monitor a diverse control group of popular hosting providers so we can suppress false-positive alerts when it's more likely that our own connectivity is intermittent:

Full Shard Space Coverage	checked in 12s ago
Control Group Majority Up and Fresh	checked in 39s ago
Outbound Probe Escalator	checked in 39s ago

Background Job Queue

We use Sidekiq as a job queue for asynchronous and scheduled background jobs. We have a WorkerHeartbeatJob that hits an Inbound Liveness trigger to verify our scheduler and workers are running continuously:

Worker Heartbeat	checked in 12s ago

ToolServ Microservice

We have an internal Crystal microservice that our Rails code calls for use in one-off queries. There are different approaches and tradeoffs to monitoring non-exposed internal services. Since we care that this microservice is accessible from Rails, we're monitoring this internal microservice by pointing an Outbound Probe at a Rails endpoint that makes a call to the ToolServ health check:

ToolServ Health Check	up 149 ms 46s ago

Deployed Version Convergence

Our production Rails app, API server, Sidekiq workers, Outbound Prober, and ToolServ are all built and deployed from one Git monorepo. While they will be temporarily running different versions during a rolling deployment, all production processes should eventually all be deployed from the same commit ID. We have a job that checks in with an Inbound Liveness trigger if there is exactly one commit ID currently running in production, ensuring we're alerted to any incomplete or inconsistent deployments that don't converge within a few minutes:

Same Git Commit Check	checked in 39s ago

Postgres Database

We monitor some housekeeping tasks on our primary Postgres database, such as ensuring successful logical database backups, and verifying the correct state of schema migrations, primarily using Inbound Liveness triggers which get checked-in on success:

Logical Backup	checked in 15m10s ago
Postgres Disk Usage Monitor	checked in 39s ago
Schema Migrations Up-to-Date (Rails Check)	up 30 ms 58s ago
Schema Migrations Up-to-Date (Sidekiq Check)	checked in 39s ago

Miscellanous Background Jobs

Many of our on-call schedule rotation features are time-dependent, such as updating who's currently on call. These tasks run as scheduled jobs in Sidekiq. We monitor the completion of these important jobs with Inbound Liveness triggers:

All Shift Changes Notify	checked in 40s ago
Schedule Future Shifts	checked in 18m24s ago
Session Cleanup	checked in 3m34s ago
Shift Cleanup	checked in 23m10s ago

External APIs

We monitor critical third-party APIs with scheduled jobs that check in with Inbound Liveness triggers once they verify connectivity and authentication success:

Amazon SES SMTP	checked in 57m40s ago
Google Firebase Cloud Messaging API	checked in 57m40s ago
Slack API	checked in 57m40s ago
Stripe API	checked in 57m40s ago
Telegram API	checked in 57m39s ago

External Monitoring

We're in a special situation: we can't rely on Heii On-Call to monitor Heii On-Call itself, because we might not get alerted if something in the critical alerting flow breaks! Therefore, we monitor Heii On-Call with three external services running on distinct infrastructure in geographically distributed locations. We then monitor these external services with Inbound Liveness triggers so that we can get alerted if they stop probing our special health check endpoints. As long as one of these is up, we should be notified when any downtime occurs:

External Monitor 1A	checked in 5s ago
External Monitor 1B	checked in 16s ago
External Monitor 2A	checked in 4s ago
External Monitor 2B	checked in 10s ago
External Monitor 3A	checked in 11s ago
External Monitor 3B	checked in 5s ago

Inspired to monitor your own website or application? Get Started »