If you can't find the answer to what you are trying to do here, please send us an email at heii@heiioncall.com. We're happy to help!
Triggers define a specific condition that can cause an alert to be sent to the current on-call individual for a service. Typically, triggers will be specific concerns that may go wrong in your system, such as Database CPU load is greater than 90%
or 95th percentile web request latency is greater than 1s
.
Triggers have three possible states:
resolved
— This is the state of a trigger when everything is ok. (This is the default state.)alerting
— When a trigger transitions to Alerting, the currently on-call individual is notified via their preferred notification channel.acknowledged
— Once the on-call individual sees the alert and indicates that they are responding, the trigger is transitioned to Acknowledged.After the underlying issue is fixed, the user is responsible for ensuring the trigger is marked as resolved
so that it can fire again in the future.
Heii On-Call supports three types of triggers: manual triggers, inbound liveness triggers, and outbound probe triggers.
Manual triggers can be triggered via an API call or manually from the web UI.
The most common way to trigger an alert is via an API call. This is the easiest way to set up an integration with your backend or your existing monitoring solution. To trigger an alert, simply perform an HTTP POST request to https://api.heiioncall.com/triggers/{TRIGGER_ID}/alert
, and be sure to include the Authorization: Bearer {API_KEY}
header. Try the curl
examples below:
curl -X POST -H 'Authorization: Bearer {API_KEY}' https://api.heiioncall.com./triggers/{TRIGGER_ID}/alert
curl -X POST -H 'Authorization: Bearer {API_KEY}' https://api.heiioncall.com./triggers/{TRIGGER_ID}/acknowledge
curl -X POST -H 'Authorization: Bearer {API_KEY}' https://api.heiioncall.com./triggers/{TRIGGER_ID}/resolve
If your integration modality does not allow you to send the Authorization: Bearer
header, then you can include the token as the token
url parameter. This is not recommended, but it is available.
curl -X POST https://api.heiioncall.com./triggers/{TRIGGER_ID}/alert?token={API_KEY}
These endpoints are idempotent, so it is safe to call them multiple times. Furthermore, if a trigger is already in the acknowledged
state, calling the alert
endpoint again will have no effect.
alert
or resolve
endpoints. The expectation is that the individual responding will acknowledge
the incident directly from the Heii On-Call UI, not via API. But, if you want to acknowledge
through the API, we won't stop you.
Heii On-Call also has an inbound liveness
trigger type. This trigger type will automatically transition to alerting
if no external service has checked in within a preconfigured time period. Inbound liveness trigger are great to make sure that cronjobs are still running, for example.
To programmatically check in on an inbound liveness trigger, simply perform an HTTP POST to https://api.heiioncall.com/triggers/{TRIGGER_ID}/checkin
and set the Authorization: Bearer {API_KEY}
header.
curl -X POST -H 'Authorization: Bearer {API_KEY}' https://api.heiioncall.com./triggers/{TRIGGER_ID}/checkin
Calling checkin
will automatically transition the trigger to resolved
.
An outbound probe
trigger uses Heii On-Call's servers to continuously monitor the availability and uptime of your HTTP-exposed services.
By default, we periodically send a HEAD
request to your specified HTTP(S) endpoint, such as an API health check endpoint, or your website's home page. (Currently, this probe is sent once per minute, but the probe interval may be adaptive in the future.) Paid organizations may also choose to send GET/POST/PUT/DELETE/PATCH/OPTIONS
requests, and may specify custom request headers and a request body.
By default, a timely HTTP response of 200 OK
or 204 No Content
is considered up
. Any other result is considered down
. (Redirects will not be followed, so be sure to specify your URL precisely.) Any issues at the DNS, TCP, TLS, or HTTP layers will also be considered down
, and the details are shown on the trigger page.
Any up
result will transition the trigger to resolved
. The outbound probe trigger's timeout
configures how long your endpoint must be continuously down
before the trigger will automatically transition to alerting
. This timeout represents a tradeoff between alert latency and potential false positives, such as intermittent connectivity problems that resolve without intervention. A longer timeout gives more opportunity for any intermittent glitches to resolve themselves before alerting. A shorter timeout brings a real, actionable outage to your attention a few minutes sooner. We recommend a minimum of 3 minutes
, and a typical timeout of 5 minutes
to 15 minutes
for most applications.
Note that by default we use a HEAD
request to query your endpoint while economizing on bandwidth usage. Per RFC 7231 section 4.3.2, The HEAD method is identical to GET except that the server MUST NOT send a message body in the response
, and this is often used for testing hypertext links for validity, accessibility, and recent modification
such as our outbound probes.
If you're writing a health check endpoint for use with Heii On-Call, simply make an ordinary GET
endpoint which returns with 200 OK
or 204 No Content
on success, and returns any other status code on failure. We have confirmed that for many popular web frameworks (including Ruby on Rails, Sinatra, Express.js, Django, Flask, Kemal), the full GET
handler will be run by a HEAD
request.
Paid organizations may also choose to send GET/POST/PUT/DELETE/PATCH/OPTIONS
requests with custom headers and a body. Please be aware that unlike HEAD
requests, these will send both request and response bodies and may incur higher data transfer costs on your end, and we may impose a per-request byte transfer limit to protect our services.
Optionally, you may define additional response criteria required for a response to be considered up
:
301
, or you may add a space and a URL to also assert the value of the Location
response header, for example 301 https://example.com/redirect-target
. This feature is typically used for
monitoring HTTP redirects.
You may optionally add a small HTTP body to your /triggers/{TRIGGER_ID}/alert
request, which transitions a trigger to alerting. Heii On-Call will store the body (up to 10240 bytes) and display it on the trigger page when an trigger is alerting. This is especially useful when integrating with third-party monitoring solutions to include metadata about the triggering alert, such as a direct link to the system that generated the alert.
You can add data as JSON:
curl -X POST \
-H 'Authorization: Bearer {API_KEY}' \
-H 'Content-Type: application/json' \
-d '{"host":"server1", "status":"onfire"}' https://api.heiioncall.com./triggers/{TRIGGER_ID}/alert
or form-encoded:
curl -X POST \
-H 'Authorization: Bearer {API_KEY}' \
-H "Content-Type: application/x-www-form-urlencoded"
-d 'host=server1&status=onfire' https://api.heiioncall.com./triggers/{TRIGGER_ID}/alert
or as good old plain text:
curl -X POST \
-H 'Authorization: Bearer {API_KEY}' \
-H 'Content-Type: text/plain' \
-d 'server 1 is on fire' https://api.heiioncall.com./triggers/{TRIGGER_ID}/alert
This flexibility should give you plenty of room to integrate with your favorite monitoring solutions. As always, we suggest starting simple. The simplest is to skip the body. Beyond that, most of the time, something as quick as: "Alert from {Your Monitoring Service}" is enough to help the on-call person quickly track down the issue, and you can add complexity down the line.
Adding alerts or checkins to your application code is subject to the same concerns as adding any other network request into your code: those network requests may fail, or may take an arbitrarily long time to complete. These failures could be caused by any combination of: issues on our end, issues with our hosting providers, issues with your hosting providers, or issues with the network connections between the two.
Make sure you set an appropriate timeout for DNS lookups, TCP connection, and the overall request to protect your application code from being blocked. For example:
curl -X POST \
--retry 5 --retry-all-errors --retry-max-time 15 \
-H 'Authorization: Bearer {API_KEY}' \
https://api.heiioncall.com./triggers/{TRIGGER_ID}/alert
The --retry 5 --retry-all-errors --retry-max-time 15
ensures that the request will be retried on failure, but will timeout after no more that 15 seconds. In Python, consider requests.post(..., timeout=15)
. In Ruby, consider: Faraday.post(...) { |request| request.options.timeout = 15 }
.
API Keys are used when programatically transitioning trigger states. You may create as many keys as you wish for each organization. We recommend using a different key for each external service you connect to Heii On-Call, but it's okay to start with just one.
An API Key is a string of the form: heii_{{OPAQUE_SECURE_TOKEN}}
and should be stored securely.
You may use the test token heii_TEST_{{OPAQUE_SECURE_TOKEN}}
in testing your integration. This test token will verify connectivity to our backend, but will not modify any Triggers. You may create a test-only
API Key for use in your continuous integration tests, for example.
Services represent a single logical subsystem. While a trigger represents the particular reason why a subsystem is broken (e.g. API health check endpoint timeout
), a service is the subsystem that is broken (e.g. API
).
Many teams start with just a single service, something like Application
, and then down the line split responsibilities across smaller services like Frontend
and Backend
. Triggers belong to services, so when a trigger transitions to alerting, it means there is something wrong with the service.
You can have as many services as you want in Heii On-Call. In practice, most organizations find that two to five services is enough to split out their major subsystems.
Rotations represent a group of people that, together, are responsible for one or more services.
Rotations are specified in an easy-to-edit text format: one shift specification per line, with each line in EMAIL, DURATION
format. This makes editing a rotation as easy as editing lines in text file.
Typically, rotations will map 1:1 to engineering teams in your organization. Typically, every member of the team rotates through on-call responsibility on a regular cadence.
Small teams can almost always get away with having a single Default
rotation. Heii On-Call offers you the flexibility to set up your rotations in a way that works for your team right now, and will evolve with you as you grow.
DURATION
defines when a user's shift ends, and is specified in a human-readable format as either:
until [DATE_SPECIFIER] TIME TIMEZONE
, orfrom [DATE_SPECIFIER] TIME TIMEZONE until [DATE_SPECIFIER] TIME TIMEZONE
, orfor NUMBER TIME_UNIT
until Mon 9:00am PT
(default)until 10:30pm Asia/Tokyo
until 1st Fri of the month at 2:00pm ET
from 9:00am CT until 5:00pm CT
for 7 days
for 12 hours
In practice, most teams have on-call rotations that are about 1 week long and are anchored to a specific handoff time on a workday, which is why our default DURATION
is until Mon 9:00am PT
.
Unless a from
clause is specified, a shift starts when the last shift ends. (An explicit from
can be used to intentionally create a gap in the schedule.)
PT
is a shortcut for America/Los_Angeles
):
alice@example.com, until Mon 9:00am PT
bob@example.com, until Mon 9:00am PT
carol@example.com, until Mon 9:00am PT
ET
is a shortcut for America/New_York
), and Bob on call overnight:
alice@example.com, until 7:30pm ET
bob@example.com, until 7:30am ET
alice@example.com, from Mon 9:00am PT until Mon 5:00pm PT
bob@example.com, from Tue 9:00am PT until Tue 5:00pm PT
carol@example.com, from Wed 9:00am PT until Wed 5:00pm PT
darlene@example.com, from Thu 9:00am PT until Thu 5:00pm PT
erin@example.com, from Fri 9:00am PT until Fri 5:00pm PT
Shifts are an instance of a time window when a specific person is responsible for being on-call for a rotation.
Shifts are created automatically from your rotation. Heii On-Call will always create shifts covering at least two weeks into the future. If you wish to populate more shifts in advance, you can: simply use the Add Full Rotation
button. This will add one full rotation's worth of shifts into the future.
If something comes up and someone needs to trade, simply edit the shift and assign a different user. If something comes along and ruins your carefully laid plans, you can simply delete all shifts from some point in time, and re-add shifts as necessary.
Heii On-Call lets easily you move blocks of shifts, shrink shifts, and otherwise edit the upcoming schedule as things come up. The system tries to do the right thing when everything is going according to plan, but lets you edit things as much as you want when reality strikes.
When a trigger transitions from resolved
to alerting
, the user that is currently on call in the Rotation that is responsible for the Service will be notified based on their configured Notification Channels. Heii On-Call supports numerous notification channels, however we highly recommend each user configures the Heii On-Call mobile app for iOS or Android as their primary notification channel to ensure timely delivery of notifications and the ability to deliver critical alerts. For redundancy, we recommend each user also configure at least one other channel, with a suggested delay of 5-15 minutes, in case the user can't be reached through their primary notification channel.
While a trigger remains in the alerting
state Heii On-Call will continue to send notifications to the individual on call. The interval between these notifications starts at 5 minutes, and becomes longer after significant time has elapsed and the trigger is still alerting. The expectation is that a user that receives a notification will acknowledge
or resolve
the trigger to prevent being notified again.
In some cases you may want this notification interval to be longer. In this case you can set the minimum notification interval parameter on a trigger to a duration such as 12 hours
. This means that Heii On-Call will only send a follow-up notification at most every 12 hours. The interval for follow-up notifications may be longer than what is set in this parameter after significant time has elapsed and the trigger remains unacknowledged.
The minimum notification interval parameter only applies to follow-up notifications sent to the same user. If a different user is to be notified because of a shift change or a escalation strategy, they will get their first notification immediately.
By default, notifications on the Heii On-Call mobile app break through Do Not Disturb settings and always make an audible sound, regardless of mute or volume settings. This behavior can be configured on a per-trigger basis by unchecking the Critical Alert on Mobile Apps?
checkbox when creating or editing a trigger. When a trigger is not set to critical, a notification will still be delivered to the mobile device, but it will respect the current notification and volume settings on the mobile device.
When a trigger transitions from resolved
to alerting
, the individual currently on-call is immediately alerted using their primary notification channel. The individual on-call will continue to be notified on their primary channel until the trigger is acknowledged
or resolved
.
In addition to continuing to ping the on-call individual you can optionally set one or more Escalation Strategies
for each rotation. Escalation strategies define one or more secondary on-call individuals that will receive an alert after a configurable amount of time if the primary on-call has not acknowledged the page.
Heii On-Call has multiple types of Escalation Strategies to fit different team workflows.
You can add multiple escalation strategies to alert after different amounts of time to create an escalation ladder. A common setup is to first escalate to the person previously on-call after 15 minutes, followed by the entire rotation after 30 minutes, and finally a single user who is typically a manager (or a founder who has not learned to let go) after 45 minutes.
Escalation strategies only continue to run while the Trigger is still alerting
. As soon as any user sets the Trigger to acknowledged
, no further notifications will be sent.
In some cases you may not want a particular Trigger to follow the escalation path of the Rotation. This is often the case for non-critical alerts where it is okay if the individual on-call does not acknowledge the issue promptly. You can uncheck the escalate to secondary on-call?
checkbox on a Trigger to prevent escalation strategies from running for specific Triggers.
You can can add escalation strategies by clicking on the Actions
menu on a Rotation
. A single Rotation
can have multiple escalation strategies.
Heii On-Call uses simple webhooks (just a fancy name for an HTTP POST request) to trigger alerts. This means it is trivial to send alerts from your own application code, as well as the vast majority of third-party monitoring and crash reporting solutions.
Here's instructions on how to send webhooks from:
You're not limited to the list above. Simply search for "webhook" in the documentation of your monitoring solution of choice.
If you can't find a way to integrate Heii On-Call with your third-party system, send us an email at heii@heiioncall.com. We would be happy to see if we can find a good way to integrate.
Default
Rotation and one Application
Service.alerting
, the individual with the current shift on the rotation responsible for that service receives an alert notification from Heii On-Call.checkin
POST request before timeout
, in the case of inbound liveness
Triggers.)outbound probe
Triggers.)