If you can't find the answer to what you are trying to do here, please send us an email at heii@heiioncall.com. We're happy to help!
Triggers define a specific condition that can cause an alert to be sent to the current on-call individual for a service. Typically, triggers will be specific concerns that may go wrong in your system, such as Database CPU load is greater than 90%
or 95th percentile web request latency is greater than 1s
.
Triggers have three possible states:
resolved
— This is the state of a trigger when everything is ok. (This is the default state.)alerting
— When a trigger transitions to Alerting, the currently on-call individual is notified via their preferred notification channel.acknowledged
— Once the on-call individual sees the alert and indicates that they are responding, the trigger is transitioned to Acknowledged.After the underlying issue is fixed, the user is responsible for ensuring the trigger is marked as resolved
so that it can fire again in the future.
Heii On-Call supports three types of triggers: manual triggers, inbound liveness triggers, and outbound probe triggers.
Manual triggers can be triggered via an API call or manually from the web UI.
The most common way to trigger an alert is via an API call. This is the easiest way to set up an integration with your backend or your existing monitoring solution. To trigger an alert, simply perform an HTTP POST request to https://api.heiioncall.com/triggers/{TRIGGER_ID}/alert
, and be sure to include the Authorization: Bearer {API_KEY}
header. Try the curl
examples below:
curl -X POST -H 'Authorization: Bearer {API_KEY}' https://api.heiioncall.com./triggers/{TRIGGER_ID}/alert
curl -X POST -H 'Authorization: Bearer {API_KEY}' https://api.heiioncall.com./triggers/{TRIGGER_ID}/acknowledge
curl -X POST -H 'Authorization: Bearer {API_KEY}' https://api.heiioncall.com./triggers/{TRIGGER_ID}/resolve
If your integration modality does not allow you to send the Authorization: Bearer
header, then you can include the token as the token
url parameter. This is not recommended, but it is available.
curl -X POST https://api.heiioncall.com./triggers/{TRIGGER_ID}/alert?token={API_KEY}
These endpoints are idempotent, so it is safe to call them multiple times. Furthermore, if a trigger is already in the acknowledged
state, calling the alert
endpoint again will have no effect.
alert
or resolve
endpoints. The expectation is that the individual responding will acknowledge
the incident directly from the Heii On-Call UI, not via API. But, if you want to acknowledge
through the API, we won't stop you.
Heii On-Call also has an inbound liveness
trigger type. This trigger type will automatically transition to alerting
if no external service has checked in within a preconfigured time period. Inbound liveness trigger are great to make sure that cronjobs are still running, for example.
To programmatically check in on an inbound liveness trigger, simply perform an HTTP POST to https://api.heiioncall.com/triggers/{TRIGGER_ID}/checkin
and set the Authorization: Bearer {API_KEY}
header.
curl -X POST -H 'Authorization: Bearer {API_KEY}' https://api.heiioncall.com./triggers/{TRIGGER_ID}/checkin
Calling checkin
will automatically transition the trigger to resolved
.
An outbound probe
trigger uses Heii On-Call's servers to continuously monitor the availability and uptime of your HTTP-exposed services.
We periodically send a HEAD
request to your specified HTTP(S) endpoint, such as your website's homepage, or an API health check endpoint. (Currently, this probe is sent once per minute, but the probe interval may be adaptive in the future.) By default, a timely HTTP response of 200 OK
or 204 No Content
is considered up
. Any other result is considered down
. (Redirects will not be followed, so be sure to specify your URL precisely.) Any issues at the DNS, TCP, TLS, or HTTP layers will also be considered down
, and the details are shown on the trigger page.
Any up
result will transition the trigger to resolved
. The outbound probe trigger's timeout
configures how long your endpoint must be continuously down
before the trigger will automatically transition to alerting
. This timeout represents a tradeoff between alert latency and potential false positives, such as intermittent connectivity problems that resolve without intervention. A longer timeout gives more opportunity for any intermittent glitches to resolve themselves before alerting. A shorter timeout brings a real, actionable outage to your attention a few minutes sooner. We recommend an absolute minimum of 3 minutes
, and a typical timeout of 5 minutes
to 15 minutes
for most applications.
Note that we use a HEAD
request to query your endpoint while economizing on bandwidth usage. Per RFC 7231 section 4.3.2, The HEAD method is identical to GET except that the server MUST NOT send a message body in the response
, and this is often used for testing hypertext links for validity, accessibility, and recent modification
such as our outbound probes.
If you're writing a health check endpoint for use with Heii On-Call, simply make an ordinary GET
endpoint which returns with 200 OK
or 204 No Content
on success, and returns any other status code on failure. We have confirmed that for many popular web frameworks (including Ruby on Rails, Sinatra, Express.js, Django, Flask, Kemal), the full GET
handler will be run by a HEAD
request.
Optionally, you may define additional response criteria required for a response to be considered up
:
301
, or you may add a space and a URL to also assert the value of the Location
response header, for example 301 https://example.com/redirect-target
. This feature is typically used for
monitoring HTTP redirects.
You may optionally add a small HTTP body to your /triggers/{TRIGGER_ID}/alert
request, which transitions a trigger to alerting. Heii On-Call will store the body (up to 10240 bytes) and display it on the trigger page when an trigger is alerting. This is especially useful when integrating with third-party monitoring solutions to include metadata about the triggering alert, such as a direct link to the system that generated the alert.
You can add data as JSON:
curl -X POST \ -H 'Authorization: Bearer {API_KEY}' \ -H 'Content-Type: application/json' \ -d '{"host":"server1", "status":"onfire"}' https://api.heiioncall.com./triggers/{TRIGGER_ID}/alert
or form-encoded:
curl -X POST \ -H 'Authorization: Bearer {API_KEY}' \ -H "Content-Type: application/x-www-form-urlencoded" -d 'host=server1&status=onfire' https://api.heiioncall.com./triggers/{TRIGGER_ID}/alert
or as good old plain text:
curl -X POST \ -H 'Authorization: Bearer {API_KEY}' \ -H 'Content-Type: text/plain' \ -d 'server 1 is on fire' https://api.heiioncall.com./triggers/{TRIGGER_ID}/alert
This flexibility should give you plenty of room to integrate with your favorite monitoring solutions. As always, we suggest starting simple. The simplest is to skip the body. Beyond that, most of the time, something as quick as: "Alert from {Your Monitoring Service}" is enough to help the on-call person quickly track down the issue, and you can add complexity down the line.
Adding alerts or checkins to your application code is subject to the same concerns as adding any other network request into your code: those network requests may fail, or may take an arbitrarily long time to complete. These failures could be caused by any combination of: issues on our end, issues with our hosting providers, issues with your hosting providers, or issues with the network connections between the two.
Make sure you set an appropriate timeout for DNS lookups, TCP connection, and the overall request to protect your application code from being blocked. For example:
curl -X POST \ --retry 5 --retry-all-errors --retry-max-time 15 \ -H 'Authorization: Bearer {API_KEY}' \ https://api.heiioncall.com./triggers/{TRIGGER_ID}/alert
The --retry 5 --retry-all-errors --retry-max-time 15
ensures that the request will be retried on failure, but will timeout after no more that 15 seconds. In Python, consider requests.post(..., timeout=15)
. In Ruby, consider: Faraday.post(...) { |request| request.options.timeout = 15 }
.
API Keys are used when programatically transitioning trigger states. You may create as many keys as you wish for each organization. We recommend using a different key for each external service you connect to Heii On-Call, but it's okay to start with just one.
An API Key is a string of the form: heii_{{OPAQUE_SECURE_TOKEN}}
and should be stored securely.
You may use the test token heii_TEST_{{OPAQUE_SECURE_TOKEN}}
in testing your integration. This test token will verify connectivity to our backend, but will not modify any Triggers. You may create a test-only
API Key for use in your continuous integration tests, for example.
Services represent a single logical subsystem. While a trigger represents the particular reason why a subsystem is broken (e.g. API health check endpoint timeout
), a service is the subsystem that is broken (e.g. API
).
Many teams start with just a single service, something like Application
, and then down the line split responsibilities across smaller services like Frontend
and Backend
. Triggers belong to services, so when a trigger transitions to alerting, it means there is something wrong with the service.
You can have as many services as you want in Heii On-Call. In practice, most organizations find that two to five services is enough to split out their major subsystems.
Rotations represent a group of people that, together, are responsible for one or more services.
Rotations are specified in an easy-to-edit text format: one shift specification per line, with each line in EMAIL, DURATION
format. This makes editing a rotation as easy as editing lines in text file.
Typically, rotations will map 1:1 to engineering teams in your organization. Typically, every member of the team rotates through on-call responsibility on a regular cadence.
Small teams can almost always get away with having a single Default
rotation. Heii On-Call offers you the flexibility to set up your rotations in a way that works for your team right now, and will evolve with you as you grow.
DURATION
defines when a user's shift ends, and is specified in a human-readable format as either:
until [DATE_SPECIFIER] TIME TIMEZONE
, orfrom [DATE_SPECIFIER] TIME TIMEZONE until [DATE_SPECIFIER] TIME TIMEZONE
, orfor NUMBER TIME_UNIT
until Mon 9:00am PT
(default)until 10:30pm Asia/Tokyo
until 1st Fri of the month at 2:00pm ET
from 9:00am CT until 5:00pm CT
for 7 days
for 12 hours
In practice, most teams have on-call rotations that are about 1 week long and are anchored to a specific handoff time on a workday, which is why our default DURATION
is until Mon 9:00am PT
.
Unless a from
clause is specified, a shift starts when the last shift ends. (An explicit from
can be used to intentionally create a gap in the schedule.)
PT
is a shortcut for America/Los_Angeles
):
alice@example.com, until Mon 9:00am PT bob@example.com, until Mon 9:00am PT carol@example.com, until Mon 9:00am PT
ET
is a shortcut for America/New_York
), and Bob on call overnight:
alice@example.com, until 7:30pm ET bob@example.com, until 7:30am ET
alice@example.com, from Mon 9:00am PT until Mon 5:00pm PT bob@example.com, from Tue 9:00am PT until Tue 5:00pm PT carol@example.com, from Wed 9:00am PT until Wed 5:00pm PT darlene@example.com, from Thu 9:00am PT until Thu 5:00pm PT erin@example.com, from Fri 9:00am PT until Fri 5:00pm PT
Shifts are an instance of a time window when a specific person is responsible for being on-call for a rotation.
Shifts are created automatically from your rotation. Heii On-Call will always create shifts covering at least two weeks into the future. If you wish to populate more shifts in advance, you can: simply use the Add Full Rotation
button. This will add one full rotation's worth of shifts into the future.
If something comes up and someone needs to trade, simply edit the shift and assign a different user. If something comes along and ruins your carefully laid plans, you can simply delete all shifts from some point in time, and re-add shifts as necessary.
Heii On-Call lets easily you move blocks of shifts, shrink shifts, and otherwise edit the upcoming schedule as things come up. The system tries to do the right thing when everything is going according to plan, but lets you edit things as much as you want when reality strikes.
When a trigger transitions from resolved
to alerting
, the individual currently on-call is immediately alerted using their primary notification channel. The individual on-call will continue to be notified on their primary channel until the trigger is acknowledged
or resolved
.
In addition to continuing to ping the on-call individual you can optionally set one or more Escalation Strategies
for each rotation. Escalation strategies can be set to alert someone else after a configurable amount of minutes if the primary on-call has not acknowledged the page.
You can can add escalation strategies by clicking on the Actions
menu on a Rotation
. A single Rotation
can have multiple escalation strategies.
Heii On-Call uses simple webhooks (just a fancy name for an HTTP POST request) to trigger alerts. This means it is trivial to send alerts from your own application code, as well as the vast majority of third-party monitoring and crash reporting solutions.
Here's instructions on how to send webhooks from:
You're not limited to the list above. Simply search for "webhook" in the documentation of your monitoring solution of choice.
If you can't find a way to integrate Heii On-Call with your third-party system, send us an email at heii@heiioncall.com. We would be happy to see if we can find a good way to integrate.
Default
Rotation and one Application
Service.alerting
, the individual with the current shift on the rotation responsible for that service receives an alert notification from Heii On-Call.checkin
POST request before timeout
, in the case of inbound liveness
Triggers.)outbound probe
Triggers.)