How does nas storm protection work?

Products

DX Unified Infrastructure Management (Nimsoft / UIM) Unified Infrastructure Management for Mainframe CA Unified Infrastructure Management SaaS (Nimsoft / UIM)

Issue/Introduction

Environment

Release: UIM 20.3 or higher

Resolution

The nas probe (3.60 or higher) supports a built-in storm protection feature that will prevent large continuous event storms from a robot or probe from causing problems for the nas. The algorithm is constructed in a way that the nas maintains a “quarantine list” for possible offenders.

The size of this list is configurable (storm_capacity) and elements will be added or removed or moved to the top depending on the message frequency.

The event “signature” is constructed by source, domain, robot [,probe-id [,supp_key] elements of the inbound alarm message.

If the number of alarms matching the “signature” exceeds a threshold (storm_threshold) within a specified time-window (storm_timewindow) then succeeding alarms will be quarantined by re-publishing the message to configured Subject (storm_subject). The default subject is NAS_QUARANTINE.

The quarantined alarm will not be registered with the nas and a log entry is generated when the first set of messages is placed in quarantine. The alarm message text and severity level can be overridden via a raw configure edit of the nas probe:

setup > storm_message
setup > storm_severity_level

storm_message supports variable expansion from the message header, e.g.

Placing alarm(s) from $domain:$origin:$robot:$prid:suppkey=$supp_key, total:%d

storm_severity_level would be represented as:

storm_severity_level = 5

This would represent changing the alarm severity to Critical.

The storm_protection value causes the key “signature” elements to be:

0. disabled

1. source, domain, robot, probe-id and supp_key

2. source, domain, robot, probe-id

3. source, domain, robot

You enable nas Storm protection by opening the nas GUI, selecting the General tab and picking a type of protection from the Storm protection dropdown menu. This will allow you to choose between Suppression-ID, Robot, or Probe as the source of your message filter.

Once enabled, you will be able to choose your own Storm Subject header which will modify the message header for messages exceeding the threshold. You can then set the threshold by which nas will consider an alarm storm within a set interval of time. The Storm capacity determines on how many messages are retained in the transaction log and how many will be discarded. This is configured in the nas GUI and has a default.

The nas determines that the storm has died down based on same logic i-e 3000 msg/5 min, and when this condition is not true anymore then it will return to a normal state. But, keep in mind these times are asymmetric. If you had a storm of 2990 alarms in the first 10 seconds then 10 more alarms occur at 4:50 seconds… the storm will be over 10 seconds after it started. This is because the arrival time of the first batch was heavily biased to the start of the storm.

That is the duration for quarantined messages to be published back to the nimsoft (NimBUS). It is like the samples value in the cdm probe - when the storm dies down. It is a sliding window.