Alerts

Incidents created (7d)
 
Incidents created (30d)
 
Alerts received (7d)
 
AI calls saved (7d)
pre-filter + cache
Feedback (7d)
 
Device Site Service State Change Triage Received Status Incident
Loading alerts...

Webhook Log

Loading...
Time Action Status Ticket # Device

Feedback

Loading...

Reviewers

Reviewer Total Correct / Wrong Last review
Loading...

Recent feedback

When Reviewer Verdict Device / Service AI severity Notes Incident
Loading...

Proposed Changes

Each proposal below is a change to the AI triage system derived from feedback your team has submitted. For each one we show what triggered the idea, exactly what would change in the pipeline, what it still won't handle, and the open questions we need answered before implementing it. This page is read-only — add feedback on individual alerts to keep building the signal.

Loading proposals...
← Back to all alerts

Alert Detail

How It Works

CMDB Enrichment

Before triage, the pipeline looks up the device in ServiceNow CMDB to get routing data (support group, assignment group) and CI criticality. All lookups are non-blocking — triage proceeds even if CMDB is unavailable.

Lookup MethodBadgeWhen Used
N-able Device ID → u_nable_idDevice IDWhen N-central is configured to send customTags.deviceId — hard link, most reliable
Device name → CI nameName matchCurrent fallback — parsed from N-able details field
Device IP → CI ip_addressIP matchSecond fallback if name match fails
No match foundNo matchCMDB fields blank; alert still triaged with standard rules

To enable the hard link: Configure N-central's Custom PSA notification to include customTags.deviceId = N-able Device ID. No code change needed — the pipeline will automatically use Device ID once it appears in the webhook.

Alert Deduplication

When N-able fires a CREATE, the pipeline checks for an existing active alert with the same Device and Service before writing a new record.

ScenarioOutcomeSide Effect
No active alert exists for this Device + Service New record created firstSeenAt and durationThresholdMins set
Active alert exists — incoming severity equal to or lower than existing Suppressed dupeSuppressedCount incremented; existing ticket ID returned to N-able
Active alert exists — incoming severity higher than existing Escalation — existing record updated toState, durationThresholdMins, lastNote updated in place; no new record

Duration Thresholds Captured, not yet enforced

durationThresholdMins is stored on every new alert and indicates how long a condition should persist before a ServiceNow ticket is warranted. The field is captured for future use; the current incident gate (see ServiceNow Incident Creation below) does not consult it. Evaluated in order.

ConditionThresholdRationale
Service name contains connect5 minConnectivity losses are urgent
toState is failed or critical30 minHard failures escalate faster than warnings
toState is warning120 minWarnings are often transient; 2hr buffer before actioning
All other cases60 minDefault

State Transition Severity

State severity ordering: normal=0 → warning=1 → failed=2 → critical=3. Used to determine suppress vs escalate on dedup.

TransitionDedup ActionGuidance
Normal → WarningNew or suppressed (0→1)P4 at most unless connectivity. Ticket only if persists >120 min.
Normal → FailedNew recordNo pre-existing alert, always creates. Threshold 30 min.
Warning → FailedEscalation (1→2)Existing record updated in place; threshold drops to 30 min.
Warning → WarningSuppressed (1→1)Duplicate suppressed; dupeSuppressedCount incremented.
Failed → WarningSuppressed (2→1 recovery)Recovery — lower severity suppressed. N-able RESOLVE closes the record.
Any → Normal (RESOLVE)ResolvedN-able sends RESOLVE action; record marked resolved. No dedup applies.

Pre-Triage Noise Filter

Before any AI call, the triage consumer applies cheap deterministic rules to skip the AI entirely for obvious noise. Skipped alerts get a synthetic TriageResults row with actionable=false, severity=P4, confidence=1.0 and preFiltered=true — so they appear in stats but never reach the incident gate.

ConditionSuppression ReasonWhy
toState ∈ {normal, healthy, up, ok, running, passed, good}recovery_eventRecovery transition — clearance signal, not an incident
Same device + service was resolved within last 5 minflap_within_5minTrue flap — defeats the receiver-level dedup that only catches concurrent active alerts

Triage Signature Cache

After the pre-triage filter and before any AI call, the triage consumer checks a cache keyed by (deviceName, serviceName, fromState, toState). If a matching triage result was produced in the last 60 minutes, the cached result is reused — no AI call is made. Cache rows are written to the TriageCache table every time the AI produces a fresh result.

Cache hits are tagged on the alert with triageCacheHit=true and on the TriageResults row with cacheHit=true, cacheSourceAlertId=<prior alert id> for traceability. Reasoning, recommendation, and severity are copied verbatim from the source.

Why this works: most triage decisions are a function of the alert signature, not the specific incident. If MK-VE-WCC-C / Disk - C: / Normal → Failed was P2/actionable an hour ago, the same signature on the same device is almost certainly still P2/actionable now. Pairs with the pre-triage filter — flap storms for a single signature collapse to one AI call per hour.

AI Triage

Claude (timer-driven, every 2 minutes, batches of 10) classifies alerts that survive the pre-triage filter. The system prompt enforces aggressive suppression: default to non-actionable (P4) unless evidence of sustained impact.

actionable=true requires ALL three: (1) condition is sustained, not a transient spike; (2) there is a specific action an engineer can take right now; (3) inaction would cause user impact or service degradation. Otherwise actionable=false.

Common alerts the prompt classifies as P4 / non-actionable

  • Agent check-in delays (transient, resolves within minutes)
  • Backup job warnings (unless 3+ consecutive failures)
  • AV/EDR definition age (handled by scheduled update process)
  • Patch pending (informational, handled by patch management)
  • CPU or memory spike under 15 minutes on non-critical servers
  • Single disk I/O spike without sustained degradation
  • Certificate expiry more than 14 days away
  • Network utilization spike under 10 minutes
  • Windows service restart (unless restart loop or critical service)

Other prompt rules

  • Cascading failures: upstream is root cause → upstream P2/actionable, downstream marked correlated_symptom_of_upstream_failure, non-actionable.
  • Disk < 90% on non-critical storage → P4 at most.
  • reopenCount > 5 → chronic_flapper, severity reduced one tier.
  • CMDB Importance 1 (Critical) or 2 (High): raise severity one tier on P2/P3 boundary.

Severity guide

SeverityDefinitionExamplesExpected Action
P1 Critical — outage or data loss risk Domain controller unreachable, RAID failure, hypervisor down Immediate on-call; ticket within 15 min
P2 High — significant degradation, customer-impacting Memory >90%, disk >95%, connectivity failed on primary link Ticket within 1 hour
P3 Medium — performance warning, no immediate impact CPU sustained >80%, memory warning, backup job late Ticket within next business day
P4 Low — informational or transient Normal → Warning (short duration), patch pending Log only; ticket only if condition persists beyond threshold

ServiceNow Incident Creation

A separate timer (incidentConsumer, every 5 minutes, batches of 10) reads TriageResults and creates incidents in thrivetest.service-now.com via the scripted REST endpoint /api/thrive/thrive_ai_triage/incident.

GateBehavior
CREATE_INCIDENTSMaster switch — currently true.
actionable === trueRequired. Pre-filtered + AI-suppressed alerts never qualify.
Severity ≤ MAX_SEVERITY_FOR_INCIDENT (3)P1, P2, P3 create incidents. P4 is logged only.
incidentCreated !== trueIdempotent — already-stamped triage rows are skipped.
Alert status === 'active'If the alert auto-resolved between triage and incident creation, no incident is created.

On success, the incident's sys_id and number are stamped back onto both TriageResults and AlertRegistry. The Incident column on the alerts list links directly to the incident in ServiceNow.

The incident description includes device/service/state, CMDB CI + assignment group, AI investigation notes, recommended action, and the originating Alert ID for traceability.