Documentation

If you can't find the answer to what you are trying to do here, please send us an email at heii@heiioncall.com. We're happy to help!


Table of Contents


Triggers

Triggers define a specific condition that can cause an alert to be sent to the current on-call individual for a service. Typically, triggers will be specific concerns that may go wrong in your system, such as Database CPU load is greater than 90% or 95th percentile web request latency is greater than 1s.

Triggers have three possible states:

After the underlying issue is fixed, the user is responsible for ensuring the trigger is marked as resolved so that it can fire again in the future.

Trigger Types

Heii On-Call supports three types of triggers: manual triggers, inbound liveness triggers, and outbound probe triggers.

Manual Triggers

Manual triggers can be triggered via an API call or manually from the web UI.

The most common way to trigger an alert is via an API call. This is the easiest way to set up an integration with your backend or your existing monitoring solution. To trigger an alert, simply perform an HTTP POST request to https://api.heiioncall.com/triggers/{TRIGGER_ID}/alert, and be sure to include the Authorization: Bearer {API_KEY} header. Try the curl examples below:

curl -X POST -H 'Authorization: Bearer {API_KEY}' https://api.heiioncall.com./triggers/{TRIGGER_ID}/alert

curl -X POST -H 'Authorization: Bearer {API_KEY}' https://api.heiioncall.com./triggers/{TRIGGER_ID}/acknowledge

curl -X POST -H 'Authorization: Bearer {API_KEY}' https://api.heiioncall.com./triggers/{TRIGGER_ID}/resolve

If your integration modality does not allow you to send the Authorization: Bearer header, then you can include the token as the token url parameter. This is not recommended, but it is available.

curl -X POST https://api.heiioncall.com./triggers/{TRIGGER_ID}/alert?token={API_KEY}

These endpoints are idempotent, so it is safe to call them multiple times. Furthermore, if a trigger is already in the acknowledged state, calling the alert endpoint again will have no effect.

Most of the time, you will be programmatically setting a trigger's state with alert or resolve endpoints. The expectation is that the individual responding will acknowledge the incident directly from the Heii On-Call UI, not via API. But, if you want to acknowledge through the API, we won't stop you.

Inbound Liveness Triggers

Heii On-Call also has an inbound liveness trigger type. This trigger type will automatically transition to alerting if no external service has checked in within a preconfigured time period. Inbound liveness trigger are great to make sure that cronjobs are still running, for example.

To programmatically check in on an inbound liveness trigger, simply perform an HTTP POST to https://api.heiioncall.com/triggers/{TRIGGER_ID}/checkin and set the Authorization: Bearer {API_KEY} header.

curl -X POST -H 'Authorization: Bearer {API_KEY}' https://api.heiioncall.com./triggers/{TRIGGER_ID}/checkin

Calling checkin will automatically transition the trigger to resolved.

Outbound Probe Triggers

An outbound probe trigger uses Heii On-Call's servers to continuously monitor the availability and uptime of your HTTP-exposed services.

By default, we periodically send a HEAD request to your specified HTTP(S) endpoint, such as an API health check endpoint, or your website's home page. (Currently, this probe is sent once per minute, but the probe interval may be adaptive in the future.) Paid organizations may also choose to send GET/POST/PUT/DELETE/PATCH/OPTIONS requests, and may specify custom request headers and a request body.

By default, a timely HTTP response of 200 OK or 204 No Content is considered up. Any other result is considered down. (Redirects will not be followed, so be sure to specify your URL precisely.) Any issues at the DNS, TCP, TLS, or HTTP layers will also be considered down, and the details are shown on the trigger page.

Any up result will transition the trigger to resolved. The outbound probe trigger's timeout configures how long your endpoint must be continuously down before the trigger will automatically transition to alerting. This timeout represents a tradeoff between alert latency and potential false positives, such as intermittent connectivity problems that resolve without intervention. A longer timeout gives more opportunity for any intermittent glitches to resolve themselves before alerting. A shorter timeout brings a real, actionable outage to your attention a few minutes sooner. We recommend a minimum of 3 minutes, and a typical timeout of 5 minutes to 15 minutes for most applications.

Note that by default we use a HEAD request to query your endpoint while economizing on bandwidth usage. Per RFC 7231 section 4.3.2, The HEAD method is identical to GET except that the server MUST NOT send a message body in the response, and this is often used for testing hypertext links for validity, accessibility, and recent modification such as our outbound probes.

If you're writing a health check endpoint for use with Heii On-Call, simply make an ordinary GET endpoint which returns with 200 OK or 204 No Content on success, and returns any other status code on failure. We have confirmed that for many popular web frameworks (including Ruby on Rails, Sinatra, Express.js, Django, Flask, Kemal), the full GET handler will be run by a HEAD request.

Paid organizations may also choose to send GET/POST/PUT/DELETE/PATCH/OPTIONS requests with custom headers and a body. Please be aware that unlike HEAD requests, these will send both request and response bodies and may incur higher data transfer costs on your end, and we may impose a per-request byte transfer limit to protect our services.

Optionally, you may define additional response criteria required for a response to be considered up:

Request Details

Alert Body

You may optionally add a small HTTP body to your /triggers/{TRIGGER_ID}/alert request, which transitions a trigger to alerting. Heii On-Call will store the body (up to 10240 bytes) and display it on the trigger page when an trigger is alerting. This is especially useful when integrating with third-party monitoring solutions to include metadata about the triggering alert, such as a direct link to the system that generated the alert.

You can add data as JSON:

curl -X POST \
  -H 'Authorization: Bearer {API_KEY}' \
  -H 'Content-Type: application/json' \
  -d '{"host":"server1", "status":"onfire"}' https://api.heiioncall.com./triggers/{TRIGGER_ID}/alert

or form-encoded:

curl -X POST \
  -H 'Authorization: Bearer {API_KEY}' \
  -H "Content-Type: application/x-www-form-urlencoded"
  -d 'host=server1&status=onfire' https://api.heiioncall.com./triggers/{TRIGGER_ID}/alert

or as good old plain text:

curl -X POST \
  -H 'Authorization: Bearer {API_KEY}' \
  -H 'Content-Type: text/plain' \
  -d 'server 1 is on fire' https://api.heiioncall.com./triggers/{TRIGGER_ID}/alert

This flexibility should give you plenty of room to integrate with your favorite monitoring solutions. As always, we suggest starting simple. The simplest is to skip the body. Beyond that, most of the time, something as quick as: "Alert from {Your Monitoring Service}" is enough to help the on-call person quickly track down the issue, and you can add complexity down the line.

Timeouts and Retries

Adding alerts or checkins to your application code is subject to the same concerns as adding any other network request into your code: those network requests may fail, or may take an arbitrarily long time to complete. These failures could be caused by any combination of: issues on our end, issues with our hosting providers, issues with your hosting providers, or issues with the network connections between the two.

Make sure you set an appropriate timeout for DNS lookups, TCP connection, and the overall request to protect your application code from being blocked. For example:

curl -X POST \
  --retry 5 --retry-all-errors --retry-max-time 15 \
  -H 'Authorization: Bearer {API_KEY}' \
  https://api.heiioncall.com./triggers/{TRIGGER_ID}/alert

The --retry 5 --retry-all-errors --retry-max-time 15 ensures that the request will be retried on failure, but will timeout after no more that 15 seconds. In Python, consider requests.post(..., timeout=15). In Ruby, consider: Faraday.post(...) { |request| request.options.timeout = 15 }.

API Keys

API Keys are used when programatically transitioning trigger states. You may create as many keys as you wish for each organization. We recommend using a different key for each external service you connect to Heii On-Call, but it's okay to start with just one.

An API Key is a string of the form: heii_{{OPAQUE_SECURE_TOKEN}} and should be stored securely.

You may use the test token heii_TEST_{{OPAQUE_SECURE_TOKEN}} in testing your integration. This test token will verify connectivity to our backend, but will not modify any Triggers. You may create a test-only API Key for use in your continuous integration tests, for example.


Services

Services represent a single logical subsystem. While a trigger represents the particular reason why a subsystem is broken (e.g. API health check endpoint timeout), a service is the subsystem that is broken (e.g. API).

Many teams start with just a single service, something like Application, and then down the line split responsibilities across smaller services like Frontend and Backend. Triggers belong to services, so when a trigger transitions to alerting, it means there is something wrong with the service.

You can have as many services as you want in Heii On-Call. In practice, most organizations find that two to five services is enough to split out their major subsystems.


Rotations

Rotations represent a group of people that, together, are responsible for one or more services.

Rotations are specified in an easy-to-edit text format: one shift specification per line, with each line in EMAIL, DURATION format. This makes editing a rotation as easy as editing lines in text file.

Typically, rotations will map 1:1 to engineering teams in your organization. Typically, every member of the team rotates through on-call responsibility on a regular cadence.

Small teams can almost always get away with having a single Default rotation. Heii On-Call offers you the flexibility to set up your rotations in a way that works for your team right now, and will evolve with you as you grow.

Duration Specifier

DURATION defines when a user's shift ends, and is specified in a human-readable format as either:

Here are examples of how you can specify duration:

In practice, most teams have on-call rotations that are about 1 week long and are anchored to a specific handoff time on a workday, which is why our default DURATION is until Mon 9:00am PT.

Unless a from clause is specified, a shift starts when the last shift ends. (An explicit from can be used to intentionally create a gap in the schedule.)

Example Rotations

A rotation with three people having week-long on call, handing off to the next person on the following Monday morning at 9:00am Pacific Time (PT is a shortcut for America/Los_Angeles):

alice@example.com, until Mon 9:00am PT
bob@example.com, until Mon 9:00am PT
carol@example.com, until Mon 9:00am PT

A day and night rotation, with Alice on call everyday from 7:30am to 7:30pm Eastern Time (ET is a shortcut for America/New_York), and Bob on call overnight:

alice@example.com, until 7:30pm ET
bob@example.com, until 7:30am ET

A weekday-only, daytime-only rotation, with nobody set to be paged overnight or on weekends (possibly useful for dev/staging environments or other non-production systems):

alice@example.com, from Mon 9:00am PT until Mon 5:00pm PT
bob@example.com, from Tue 9:00am PT until Tue 5:00pm PT
carol@example.com, from Wed 9:00am PT until Wed 5:00pm PT
darlene@example.com, from Thu 9:00am PT until Thu 5:00pm PT
erin@example.com, from Fri 9:00am PT until Fri 5:00pm PT


Shifts

Shifts are an instance of a time window when a specific person is responsible for being on-call for a rotation.

Shifts are created automatically from your rotation. Heii On-Call will always create shifts covering at least two weeks into the future. If you wish to populate more shifts in advance, you can: simply use the Add Full Rotation button. This will add one full rotation's worth of shifts into the future.

If something comes up and someone needs to trade, simply edit the shift and assign a different user. If something comes along and ruins your carefully laid plans, you can simply delete all shifts from some point in time, and re-add shifts as necessary.

Heii On-Call lets easily you move blocks of shifts, shrink shifts, and otherwise edit the upcoming schedule as things come up. The system tries to do the right thing when everything is going according to plan, but lets you edit things as much as you want when reality strikes.

Heii On-Call will only automatically schedule future shifts after any existing scheduled shifts end. It will never change the existing set of shifts, or resolve any gaps between existing shifts, without your explicit action.

Notifications

When a trigger transitions from resolved to alerting, the user that is currently on call in the Rotation that is responsible for the Service will be notified based on their configured Notification Channels. Heii On-Call supports numerous notification channels, however we highly recommend each user configures the Heii On-Call mobile app for iOS or Android as their primary notification channel to ensure timely delivery of notifications and the ability to deliver critical alerts. For redundancy, we recommend each user also configure at least one other channel, with a suggested delay of 5-15 minutes, in case the user can't be reached through their primary notification channel.

Follow-Up Notifications

While a trigger remains in the alerting state Heii On-Call will continue to send notifications to the individual on call. The interval between these notifications starts at 5 minutes, and becomes longer after significant time has elapsed and the trigger is still alerting. The expectation is that a user that receives a notification will acknowledge or resolve the trigger to prevent being notified again.

In some cases you may want this notification interval to be longer. In this case you can set the minimum notification interval parameter on a trigger to a duration such as 12 hours. This means that Heii On-Call will only send a follow-up notification at most every 12 hours. The interval for follow-up notifications may be longer than what is set in this parameter after significant time has elapsed and the trigger remains unacknowledged.

The minimum notification interval parameter only applies to follow-up notifications sent to the same user. If a different user is to be notified because of a shift change or a escalation strategy, they will get their first notification immediately.

Critical Alerts

By default, notifications on the Heii On-Call mobile app break through Do Not Disturb settings and always make an audible sound, regardless of mute or volume settings. This behavior can be configured on a per-trigger basis by unchecking the Critical Alert on Mobile Apps? checkbox when creating or editing a trigger. When a trigger is not set to critical, a notification will still be delivered to the mobile device, but it will respect the current notification and volume settings on the mobile device.


Escalations

When a trigger transitions from resolved to alerting, the individual currently on-call is immediately alerted using their primary notification channel. The individual on-call will continue to be notified on their primary channel until the trigger is acknowledged or resolved.

In addition to continuing to ping the on-call individual you can optionally set one or more Escalation Strategies for each rotation. Escalation strategies define one or more secondary on-call individuals that will receive an alert after a configurable amount of time if the primary on-call has not acknowledged the page.

Heii On-Call has multiple types of Escalation Strategies to fit different team workflows.

You can add multiple escalation strategies to alert after different amounts of time to create an escalation ladder. A common setup is to first escalate to the person previously on-call after 15 minutes, followed by the entire rotation after 30 minutes, and finally a single user who is typically a manager (or a founder who has not learned to let go) after 45 minutes.

Escalation strategies only continue to run while the Trigger is still alerting. As soon as any user sets the Trigger to acknowledged, no further notifications will be sent.

In some cases you may not want a particular Trigger to follow the escalation path of the Rotation. This is often the case for non-critical alerts where it is okay if the individual on-call does not acknowledge the issue promptly. You can uncheck the escalate to secondary on-call? checkbox on a Trigger to prevent escalation strategies from running for specific Triggers.

You can can add escalation strategies by clicking on the Actions menu on a Rotation. A single Rotation can have multiple escalation strategies.


Integrating with Heii On-Call

Heii On-Call uses simple webhooks (just a fancy name for an HTTP POST request) to trigger alerts. This means it is trivial to send alerts from your own application code, as well as the vast majority of third-party monitoring and crash reporting solutions.

Here's instructions on how to send webhooks from:

You're not limited to the list above. Simply search for "webhook" in the documentation of your monitoring solution of choice.

If you can't find a way to integrate Heii On-Call with your third-party system, send us an email at heii@heiioncall.com. We would be happy to see if we can find a good way to integrate.


tl;dr