Automatically stop an alarm when it goes into repair

Not Yet Reviewed

We're currently using Grafana to trigger alarms via xMatters whenever an alert is triggered. Grafana will automatically send out another notification whenever an alert goes into repair.

Is it possible to automatically acknowledge stop the previous alarm and prevent further escalation in xMatter when receiving a repair notification from Grafana?

0

Comments

9 comments
Date Votes

Please sign in to leave a comment.

  • Absolutely, you could write something in the integration builder or you could use flow designer. It would probably be quicker to test in out flow designer, so i'd recommend starting there. You'd want to use the get event and terminate event steps to do this; e.g. have your flow identify repair notifications and have it find and terminate the events in question.

    More info about these flow steps is available here:

    https://help.xmatters.com/ondemand/xmodwelcome/flowdesigner/flow-tools.htm?cshid=FlowFindEventsStep#GetEvents

    0
  • Thanks Travis,

    You're right, we haven't migrated to the Flow Designer yet so are still using the Integration Builder scripts.

    Most of your examples it makes sense, however I'm a bit uncertain what I should pass to the below part:

    if( data.SOMEPROPERTY == 'REPAIR' ) {

    // We'll need some criteria to search for the events
    util.terminateEvents({'PROPNAME': PROPERTY});
    }

    I logged an alert from Grafana with the below data:

    title: [Alerting] Test Alert
    state: ok
    ruleId: 52
    ruleName: Test Alert

    Based on the above, I guess it'd be something like the following:

    // Inspect the incoming data to see if we should 
    // terminate the existing events.
    if( data.state == 'ok' ) {

    // We'll need some criteria to search for the events
    util.terminateEvents({'ruleId': data.ruleId});
    }
    0
  • Yep, looks about right. 

    I'd say just before you test it, throw this line in just to make sure.

    console.log( 'state value: "' + data.state + '"' );

    Happy Coding!

    0
  • We have implemented Travis' suggestion and trialing automatically terminating alarms when a repair is received from Grafana. Unfortunately it appears that alarms still get escalated even though a repair has been received.

    In the log extracts below the following important events take place:

    1. A notification is sent by xMatters at 12:03 GMT+1
    2. A repair is received from Grafana at 11:20 GMT (12:20 GMT+1)
    3. Another notification is sent by xMatters at 12:23 GMT+1
    4. Additional notifications are sent be xMatters from 12:23 GMT+1 to 15:03 GMT+1

    The notifications in 3) & 4) should not have been sent as the alarm should automatically have been terminated upon receiving the repair from Grafana in 2).

    Log entries from the Activity Stream for the Grafana Integration

    > GET https://cellpointmobile.xmatters.com/api/xm/1/events?status=ACTIVE&propertyName=ruleId&propertyValue=31 HTTP/1.1
    > Accept: text/plain, application/json, application/cbor, application/*+json, */*
    > User-Agent: Xerus (EndpointClient)
    > X-Trace: fad7ef78-02f3-cb9f-6ae2-dda6b362be45,6f3dfe19-9168-4a96-adf6-640667e7318c,e0ff9397-5edd-4400-b286-93c3bc0cc5b4
    > Content-Type: application/json; charset=UTF-8
    > Authorization: Bearer ********
    > X-Integration-UUID: 52f54293-6f69-4d03-badb-ae5f145ac1a7
    > X-Flow-Trace: 52f54293-6f69-4d03-badb-ae5f145ac1a7:1579519236559
    > Content-Length: 0


    < HTTP/1.1 200 OK OK
    < date: Mon, 20 Jan 2020 11:20:41 GMT
    < x-trace: fad7ef78-02f3-cb9f-6ae2-dda6b362be45,6f3dfe19-9168-4a96-adf6-640667e7318c,e0ff9397-5edd-4400-b286-93c3bc0cc5b4,8f997c54-3418-4bca-87ed-60414904c157
    < x-application-context: application:overrides
    < content-type: application/json;charset=utf-8
    < x-xss-protection: 1; mode=block
    < cache-control: no-cache, no-store, max-age=0, must-revalidate
    < pragma: no-cache
    < expires: 0
    < x-envoy-upstream-service-time: 104
    < server: envoy
    < transfer-encoding: chunked
    < X-Robots-Tag: noindex
    < X-FRAME-OPTIONS: SAMEORIGIN
    < Strict-Transport-Security: max-age=31536000; includeSubDomains; preload;
    < X-Content-Type-Options: nosniff
    < Via: 1.1 google
    < Alt-Svc: clear
    {"count":0,"total":0,"data":[],"links":{"self":"/api/xm/1/events?limit=100&propertyValue=31&offset=0&propertyName=ruleId&status=ACTIVE"}}

    > GET https://cellpointmobile.xmatters.com/api/xm/1/events?status=SUSPENDED&propertyName=ruleId&propertyValue=31 HTTP/1.1
    > Accept: text/plain, application/json, application/cbor, application/*+json, */*
    > User-Agent: Xerus (EndpointClient)
    > Content-Type: application/json; charset=UTF-8
    > Authorization: Bearer ********
    > X-Integration-UUID: 52f54293-6f69-4d03-badb-ae5f145ac1a7
    > X-Trace: fad7ef78-02f3-cb9f-6ae2-dda6b362be45,6f3dfe19-9168-4a96-adf6-640667e7318c,5bdbba4d-4b80-4d42-a972-b0afc6b4a459
    > X-Flow-Trace: 52f54293-6f69-4d03-badb-ae5f145ac1a7:1579519236559
    > Content-Length: 0


    < HTTP/1.1 200 OK OK
    < date: Mon, 20 Jan 2020 11:20:41 GMT
    < x-trace: fad7ef78-02f3-cb9f-6ae2-dda6b362be45,6f3dfe19-9168-4a96-adf6-640667e7318c,5bdbba4d-4b80-4d42-a972-b0afc6b4a459,9edeffd5-9139-4615-9ca2-2f476ffa9a28
    < x-application-context: application:overrides
    < content-type: application/json;charset=utf-8
    < x-xss-protection: 1; mode=block
    < cache-control: no-cache, no-store, max-age=0, must-revalidate
    < pragma: no-cache
    < expires: 0
    < x-envoy-upstream-service-time: 62
    < server: envoy
    < transfer-encoding: chunked
    < X-Robots-Tag: noindex
    < X-FRAME-OPTIONS: SAMEORIGIN
    < Strict-Transport-Security: max-age=31536000; includeSubDomains; preload;
    < X-Content-Type-Options: nosniff
    < Via: 1.1 google
    < Alt-Svc: clear
    {"count":0,"total":0,"data":[],"links":{"self":"/api/xm/1/events?limit=100&propertyValue=31&offset=0&propertyName=ruleId&status=SUSPENDED"}}

    Terminating event(s) for Rule: ELB - Active Connections(31) due to State: "ok" from Grafana with Result: true

    Log extract from the Event log for the alarm

    Jan 20, 2020 12:13:46.988 CET Device Android Push User A - Android Phone will be notified
    Jan 20, 2020 12:13:46.989 CET Device Email User A - Work Email will be notified
    Jan 20, 2020 12:13:46.990 CET Device Voice User A - Work Phone will be notified
    Jan 20, 2020 12:13:46.990 CET Device Email User A - Work Email is Active
    Jan 20, 2020 12:13:47.229 CET Device Email User A - Work Email - Notification will be processed by (x)Matters Trial SMTP Email PP
    Jan 20, 2020 12:13:47.230 CET Device Voice User A - Work Phone is Active
    Jan 20, 2020 12:13:47.254 CET Device Voice User A - Work Phone - Alternate Protocol Providers are unavailable for this device.
    Jan 20, 2020 12:13:47.254 CET Device Voice Service Provider (x)matters Voice Gateway does not have a valid and enabled Protocol Provider
    Jan 20, 2020 12:13:47.255 CET Device Android Push User A - Android Phone is Active
    Jan 20, 2020 12:13:47.367 CET Device Android Push User A - Android Phone - Notification will be processed by (x)Matters GCM PP
    Jan 20, 2020 12:13:47.447 CET Device Email User A - Work Email - Notification delivered
    Jan 20, 2020 12:13:47.488 CET Device Android Push User A - Android Phone - Notification delivered
    Jan 20, 2020 12:23:47.275 CET User TOPS - Weekly 24/7 Shift - member User B (user.b) - escalation has occurred to the next level

    Jan 20, 2020 12:23:47.396 CET Device Email User B - Work Email will be notified
    Jan 20, 2020 12:23:47.397 CET Device Android Push User B - Android Phone will be notified
    Jan 20, 2020 12:23:47.400 CET Device Android Push User B - Android Phone is Active
    Jan 20, 2020 12:23:47.647 CET Device Android Push User B - Android Phone - Notification will be processed by (x)Matters GCM PP
    Jan 20, 2020 12:23:47.648 CET Device Email User B - Work Email is Active
    Jan 20, 2020 12:23:47.713 CET Device Email User B - Work Email - Notification will be processed by (x)Matters Trial SMTP Email PP
    Jan 20, 2020 12:23:47.867 CET Device Email User B - Work Email - Notification delivered
    Jan 20, 2020 12:23:47.892 CET User TOPS - Weekly 24/7 Shift - member User C (user.c) - escalation has occurred to the next level
    Jan 20, 2020 12:23:47.906 CET Device Android Push User B - Android Phone - Notification delivered

    0
  • Hey Jonatan! Glad to hear you're making progress!

    For events in situation 3 & 4, what is the "ruleId" property value? The terminate shows that it didn't find any events with a "ruleId" of "31"

    {"count":0,"total":0,"data":[],"links":{"self":"/api/xm/1/events?limit=100&propertyValue=31&offset=0&propertyName=ruleId&status=ACTIVE"}}

    Is that value populated in those events and if so, what is the value? 

    0
  • Thanks for the analysis Travis, the ruleId should be populated for all events?

    Here's the complete script we use, please note the lines marked with bold:

    // Instantiate the library for use
    var util = require('Util');

    var payload = JSON.parse(request.body); //Use this if the payload is a JSON object in the request body

    /** Customize form properties **/
    trigger.properties.Title = payload.title;
    trigger.properties.State = payload.state;
    trigger.properties.ruleId = payload.ruleId;
    trigger.properties.ruleName = payload.ruleName;
    trigger.properties.ruleUrl = payload.ruleUrl;

    // Terminate the existing events for Rule
    if (trigger.properties.State == 'ok')
    {
        var result = util.terminateEvents({'ruleId': trigger.properties.ruleId});
        console.log('Terminating event(s) for Rule: '+ trigger.properties.ruleName +'('+ trigger.properties.ruleId +') due to State: "'+ trigger.properties.State +'" from Grafana with Result: '+ result);
    }
    // Discard event for [No Data] notifications sent from Grafana
    else if (!payload.evalMatches || payload.evalMatches.length === 0)
    {
        console.log('Title: '+ trigger.properties.Title);
        console.log('State: '+ trigger.properties.State);
        console.log('Rule ID: '+ trigger.properties.ruleId);
        console.log('Rule Name: '+ trigger.properties.ruleName);
        console.log('Rule URL: '+ trigger.properties.ruleUrl);
        console.log('Ignoring this trigger due to evalMatches being empty');
    }
    // Create new Event
    else
    {
        // Set the event priority:
        trigger.priority = "High";

        trigger.properties.Message = payload.message +'\n'+ 'Value: ' + payload.evalMatches[0].value + ' ' + 'Metric: ' + payload.evalMatches[0].metric;

        form.post(trigger);
    }

    Please let me know if I'm missing something in the above, which prevents the event from being found using the ruleId?

    Also note that the following imports the script you previously provided in this thread per your suggestions:

    // Instantiate the library for use
    var util = require('Util');

    0
  • Ah! I bet it is our search that's the problem. Instead of passing just "ruleId", pass the #en. Like so:

    var result = util.terminateEvents({'ruleId#en': trigger.properties.ruleId});

    Give that a shot and let me know how it goes. 

    0
  • Thanks for the reply Travis, have updated the script per your suggestion.

    However, it seems counter intuitive that the ruleId property is localized and thus have to be postfixed with #en in the lookup. And ID in my mind is never localized as it's purpose is to give a unique reference.

    Are there any grand thoughts behind this approach that would be useful to understand?

    0
  • It's actually all the properties that are localized. The API call to search for events based on property requires the name to be localized. Which makes sense because if you are localizing the properties you need to search the right one. 

    However, it seems like setting a default might be nice so that if you aren't passing the localization, then it is assumed. Especially for instances that only use one language. 

    I'll see about getting an enhancement request opened for that. 

    I hope that helps.
    Happy Friday!

    0

Didn't find what you were looking for?

New post