| Device | Site | Service | State Change | Triage | Received | Status | Incident | |
|---|---|---|---|---|---|---|---|---|
| Loading alerts... | ||||||||
| Time | Action | Status | Ticket # | Device |
|---|
| Reviewer | Total | Correct / Wrong | Last review |
|---|---|---|---|
| Loading... | |||
| When | Reviewer | Verdict | Device / Service | AI severity | Notes | Incident |
|---|---|---|---|---|---|---|
| Loading... | ||||||
Each proposal below is a change to the AI triage system derived from feedback your team has submitted. For each one we show what triggered the idea, exactly what would change in the pipeline, what it still won't handle, and the open questions we need answered before implementing it. This page is read-only — add feedback on individual alerts to keep building the signal.
Before triage, the pipeline looks up the device in ServiceNow CMDB to get routing data (support group, assignment group) and CI criticality. All lookups are non-blocking — triage proceeds even if CMDB is unavailable.
| Lookup Method | Badge | When Used |
|---|---|---|
N-able Device ID → u_nable_id | Device ID | When N-central is configured to send customTags.deviceId — hard link, most reliable |
Device name → CI name | Name match | Current fallback — parsed from N-able details field |
Device IP → CI ip_address | IP match | Second fallback if name match fails |
| No match found | No match | CMDB fields blank; alert still triaged with standard rules |
To enable the hard link: Configure N-central's Custom PSA notification to include customTags.deviceId = N-able Device ID. No code change needed — the pipeline will automatically use Device ID once it appears in the webhook.
When N-able fires a CREATE, the pipeline checks for an existing active alert with the same Device and Service before writing a new record.
| Scenario | Outcome | Side Effect |
|---|---|---|
| No active alert exists for this Device + Service | New record created | firstSeenAt and durationThresholdMins set |
| Active alert exists — incoming severity equal to or lower than existing | Suppressed | dupeSuppressedCount incremented; existing ticket ID returned to N-able |
| Active alert exists — incoming severity higher than existing | Escalation — existing record updated | toState, durationThresholdMins, lastNote updated in place; no new record |
durationThresholdMins is stored on every new alert and indicates how long a condition should persist before a ServiceNow ticket is warranted. The field is captured for future use; the current incident gate (see ServiceNow Incident Creation below) does not consult it. Evaluated in order.
| Condition | Threshold | Rationale |
|---|---|---|
Service name contains connect | 5 min | Connectivity losses are urgent |
toState is failed or critical | 30 min | Hard failures escalate faster than warnings |
toState is warning | 120 min | Warnings are often transient; 2hr buffer before actioning |
| All other cases | 60 min | Default |
State severity ordering: normal=0 → warning=1 → failed=2 → critical=3. Used to determine suppress vs escalate on dedup.
| Transition | Dedup Action | Guidance |
|---|---|---|
Normal → Warning | New or suppressed (0→1) | P4 at most unless connectivity. Ticket only if persists >120 min. |
Normal → Failed | New record | No pre-existing alert, always creates. Threshold 30 min. |
Warning → Failed | Escalation (1→2) | Existing record updated in place; threshold drops to 30 min. |
Warning → Warning | Suppressed (1→1) | Duplicate suppressed; dupeSuppressedCount incremented. |
Failed → Warning | Suppressed (2→1 recovery) | Recovery — lower severity suppressed. N-able RESOLVE closes the record. |
| Any → Normal (RESOLVE) | Resolved | N-able sends RESOLVE action; record marked resolved. No dedup applies. |
Before any AI call, the triage consumer applies cheap deterministic rules to skip the AI entirely for obvious noise. Skipped alerts get a synthetic TriageResults row with actionable=false, severity=P4, confidence=1.0 and preFiltered=true — so they appear in stats but never reach the incident gate.
| Condition | Suppression Reason | Why |
|---|---|---|
toState ∈ {normal, healthy, up, ok, running, passed, good} | recovery_event | Recovery transition — clearance signal, not an incident |
Same device + service was resolved within last 5 min | flap_within_5min | True flap — defeats the receiver-level dedup that only catches concurrent active alerts |
After the pre-triage filter and before any AI call, the triage consumer checks a cache keyed by (deviceName, serviceName, fromState, toState). If a matching triage result was produced in the last 60 minutes, the cached result is reused — no AI call is made. Cache rows are written to the TriageCache table every time the AI produces a fresh result.
Cache hits are tagged on the alert with triageCacheHit=true and on the TriageResults row with cacheHit=true, cacheSourceAlertId=<prior alert id> for traceability. Reasoning, recommendation, and severity are copied verbatim from the source.
Why this works: most triage decisions are a function of the alert signature, not the specific incident. If MK-VE-WCC-C / Disk - C: / Normal → Failed was P2/actionable an hour ago, the same signature on the same device is almost certainly still P2/actionable now. Pairs with the pre-triage filter — flap storms for a single signature collapse to one AI call per hour.
Claude (timer-driven, every 2 minutes, batches of 10) classifies alerts that survive the pre-triage filter. The system prompt enforces aggressive suppression: default to non-actionable (P4) unless evidence of sustained impact.
actionable=true requires ALL three: (1) condition is sustained, not a transient spike; (2) there is a specific action an engineer can take right now; (3) inaction would cause user impact or service degradation. Otherwise actionable=false.
correlated_symptom_of_upstream_failure, non-actionable.reopenCount > 5 → chronic_flapper, severity reduced one tier.| Severity | Definition | Examples | Expected Action |
|---|---|---|---|
| P1 | Critical — outage or data loss risk | Domain controller unreachable, RAID failure, hypervisor down | Immediate on-call; ticket within 15 min |
| P2 | High — significant degradation, customer-impacting | Memory >90%, disk >95%, connectivity failed on primary link | Ticket within 1 hour |
| P3 | Medium — performance warning, no immediate impact | CPU sustained >80%, memory warning, backup job late | Ticket within next business day |
| P4 | Low — informational or transient | Normal → Warning (short duration), patch pending | Log only; ticket only if condition persists beyond threshold |
A separate timer (incidentConsumer, every 5 minutes, batches of 10) reads TriageResults and creates incidents in thrivetest.service-now.com via the scripted REST endpoint /api/thrive/thrive_ai_triage/incident.
| Gate | Behavior |
|---|---|
CREATE_INCIDENTS | Master switch — currently true. |
actionable === true | Required. Pre-filtered + AI-suppressed alerts never qualify. |
Severity ≤ MAX_SEVERITY_FOR_INCIDENT (3) | P1, P2, P3 create incidents. P4 is logged only. |
incidentCreated !== true | Idempotent — already-stamped triage rows are skipped. |
Alert status === 'active' | If the alert auto-resolved between triage and incident creation, no incident is created. |
On success, the incident's sys_id and number are stamped back onto both TriageResults and AlertRegistry. The Incident column on the alerts list links directly to the incident in ServiceNow.
The incident description includes device/service/state, CMDB CI + assignment group, AI investigation notes, recommended action, and the originating Alert ID for traceability.