Welcome! Heii On-Call empowers engineering teams with streamlined on-call rotation scheduling, instant iOS/Android alerts, flexible escalation plans, and an extensive set of monitoring tools including synthetic HTTP probes, cronjob check-ins, and API integrations with platforms like Prometheus and Datadog.
This page explains how we monitor the components of our own multi-tier application, and displays the live status of those triggers.
If you'd like to build a similar live status dashboard for your team (and have the on-call engineer get alerts when there's an issue!), sign up and start creating your own triggers.
Our public-facing website is a Ruby on Rails application. We monitor this with a number of Outbound Probes, covering our homepage, a health check endpoint, and key HTTP protocol and subdomain redirects:
Rails Homepage | up 332 ms 4s ago |
---|---|
Rails Health Check Controller | up 107 ms 29s ago |
www to root domain redirect | up 171 ms 9s ago |
http to https redirect | up 3 ms 36s ago |
http://www redirect | up 14 ms 42s ago |
Our API server is written in Crystal and lives at https://api.heiioncall.com, receiving check-ins and state changes from your applications and APMs (see our docs). We monitor this with Outbound Probes as well:
API Subdomain Homepage | up 108 ms 28s ago |
---|---|
API Subdomain Health Check Endpoint | up 176 ms 28s ago |
Our Outbound Prober is a sharded cluster of Crystal processes. We use an Inbound Liveness trigger to make sure the entire shard space is being probed at all times. We also monitor a diverse control group of popular hosting providers so we can suppress false-positive alerts when it's more likely that our own connectivity is intermittent:
Full Shard Space Coverage | checked in 0s ago |
---|---|
Control Group Majority Up and Fresh | checked in 0s ago |
Outbound Probe Escalator | checked in 1s ago |
We use Sidekiq as a job queue for asynchronous and scheduled background jobs. We have a WorkerHeartbeatJob that hits an Inbound Liveness trigger to verify our scheduler and workers are running continuously:
Worker Heartbeat | checked in 1s ago |
---|
We have an internal Crystal microservice that our Rails code calls for use in one-off queries. There are different approaches and tradeoffs to monitoring non-exposed internal services. Since we care that this microservice is accessible from Rails, we're monitoring this internal microservice by pointing an Outbound Probe at a Rails endpoint that makes a call to the ToolServ health check:
ToolServ Health Check | up 211 ms 44s ago |
---|
We monitor some housekeeping tasks on our primary Postgres database, such as ensuring successful logical database backups, and verifying the correct state of schema migrations, primarily using Inbound Liveness triggers which get checked-in on success:
Logical Backup | checked in 59m50s ago |
---|---|
Postgres Disk Usage Monitor | checked in 0s ago |
Schema Migrations Up-to-Date (Rails Check) | up 198 ms 1m0s ago |
Schema Migrations Up-to-Date (Sidekiq Check) | checked in 1s ago |
Many of our on-call schedule rotation features are time-dependent, such as updating who's currently on call. These tasks run as scheduled jobs in Sidekiq. We monitor the completion of these important jobs with Inbound Liveness triggers:
All Shift Changes Notify | checked in 1s ago |
---|---|
Schedule Future Shifts | checked in 2m49s ago |
Session Cleanup | checked in 47m39s ago |
Shift Cleanup | checked in 7m59s ago |
We monitor critical third-party APIs with scheduled jobs that check in with Inbound Liveness triggers once they verify connectivity and authentication success:
Amazon SES SMTP | checked in 41m59s ago |
---|---|
Google Firebase Cloud Messaging API | checked in 41m59s ago |
Slack API | checked in 41m59s ago |
Stripe API | checked in 41m59s ago |
Telegram API | checked in 41m58s ago |
We're in a special situation: we can't rely on Heii On-Call to monitor Heii On-Call itself, because we might not get alerted if something in the critical alerting flow breaks! Therefore, we monitor Heii On-Call with three external services running on distinct infrastructure in geographically distributed locations. We then monitor these external services with Inbound Liveness triggers so that we can get alerted if they stop probing our special health check endpoints. As long as one of these is up, we should be notified when any downtime occurs:
External Monitor 1A | checked in 16s ago |
---|---|
External Monitor 1B | checked in 29s ago |
External Monitor 2A | checked in 19s ago |
External Monitor 2B | checked in 18s ago |
External Monitor 3A | checked in 10s ago |
External Monitor 3B | checked in 10s ago |