Alerts

Incidents created (7d)

—

Incidents created (30d)

—

Alerts received (7d)

—

AI calls saved (7d)

—

pre-filter + cache

Feedback (7d)

—

Device	Site	Service	State Change	Triage	Received	Status	Incident
Loading alerts...

Webhook Log

Time	Action	Status	Ticket #	Device

Feedback

Loading... Window:

Reviewers

Reviewer	Total	Correct / Wrong	Last review
Loading...

Recent feedback

When	Reviewer	Verdict	Device / Service	AI severity	Notes	Incident
Loading...

Proposed Changes

Each proposal below is a change to the AI triage system derived from feedback your team has submitted. For each one we show what triggered the idea, exactly what would change in the pipeline, what it still won't handle, and the open questions we need answered before implementing it. This page is read-only — add feedback on individual alerts to keep building the signal.

Loading proposals...

← Back to all alerts

Alert Detail

How It Works

CMDB Enrichment

Before triage, the pipeline looks up the device in ServiceNow CMDB to get routing data (support group, assignment group) and CI criticality. All lookups are non-blocking — triage proceeds even if CMDB is unavailable.

Lookup Method	Badge	When Used
N-able Device ID → `u_nable_id`	Device ID	When N-central is configured to send `customTags.deviceId` — hard link, most reliable
Device name → CI `name`	Name match	Current fallback — parsed from N-able `details` field
Device IP → CI `ip_address`	IP match	Second fallback if name match fails
No match found	No match	CMDB fields blank; alert still triaged with standard rules

To enable the hard link: Configure N-central's Custom PSA notification to include customTags.deviceId = N-able Device ID. No code change needed — the pipeline will automatically use Device ID once it appears in the webhook.

Alert Deduplication

When N-able fires a CREATE, the pipeline checks for an existing active alert with the same Device and Service before writing a new record.

Scenario	Outcome	Side Effect
No active alert exists for this Device + Service	New record created	`firstSeenAt` and `durationThresholdMins` set
Active alert exists — incoming severity equal to or lower than existing	Suppressed	`dupeSuppressedCount` incremented; existing ticket ID returned to N-able
Active alert exists — incoming severity higher than existing	Escalation — existing record updated	`toState`, `durationThresholdMins`, `lastNote` updated in place; no new record

Duration Thresholds Captured, not yet enforced

durationThresholdMins is stored on every new alert and indicates how long a condition should persist before a ServiceNow ticket is warranted. The field is captured for future use; the current incident gate (see ServiceNow Incident Creation below) does not consult it. Evaluated in order.

Condition	Threshold	Rationale
Service name contains `connect`	5 min	Connectivity losses are urgent
`toState` is `failed` or `critical`	30 min	Hard failures escalate faster than warnings
`toState` is `warning`	120 min	Warnings are often transient; 2hr buffer before actioning
All other cases	60 min	Default

State Transition Severity

State severity ordering: normal=0 → warning=1 → failed=2 → critical=3. Used to determine suppress vs escalate on dedup.

Transition	Dedup Action	Guidance
`Normal → Warning`	New or suppressed (0→1)	P4 at most unless connectivity. Ticket only if persists >120 min.
`Normal → Failed`	New record	No pre-existing alert, always creates. Threshold 30 min.
`Warning → Failed`	Escalation (1→2)	Existing record updated in place; threshold drops to 30 min.
`Warning → Warning`	Suppressed (1→1)	Duplicate suppressed; `dupeSuppressedCount` incremented.
`Failed → Warning`	Suppressed (2→1 recovery)	Recovery — lower severity suppressed. N-able RESOLVE closes the record.
Any → Normal (RESOLVE)	Resolved	N-able sends RESOLVE action; record marked resolved. No dedup applies.

Pre-Triage Noise Filter

Before any AI call, the triage consumer applies cheap deterministic rules to skip the AI entirely for obvious noise. Skipped alerts get a synthetic TriageResults row with actionable=false, severity=P4, confidence=1.0 and preFiltered=true — so they appear in stats but never reach the incident gate.

Condition	Suppression Reason	Why
`toState` ∈ {normal, healthy, up, ok, running, passed, good}	`recovery_event`	Recovery transition — clearance signal, not an incident
Same device + service was `resolved` within last 5 min	`flap_within_5min`	True flap — defeats the receiver-level dedup that only catches concurrent active alerts

Triage Signature Cache

After the pre-triage filter and before any AI call, the triage consumer checks a cache keyed by (deviceName, serviceName, fromState, toState). If a matching triage result was produced in the last 60 minutes, the cached result is reused — no AI call is made. Cache rows are written to the TriageCache table every time the AI produces a fresh result.

Cache hits are tagged on the alert with triageCacheHit=true and on the TriageResults row with cacheHit=true, cacheSourceAlertId=<prior alert id> for traceability. Reasoning, recommendation, and severity are copied verbatim from the source.

Why this works: most triage decisions are a function of the alert signature, not the specific incident. If MK-VE-WCC-C / Disk - C: / Normal → Failed was P2/actionable an hour ago, the same signature on the same device is almost certainly still P2/actionable now. Pairs with the pre-triage filter — flap storms for a single signature collapse to one AI call per hour.

AI Triage

Claude (timer-driven, every 2 minutes, batches of 10) classifies alerts that survive the pre-triage filter. The system prompt enforces aggressive suppression: default to non-actionable (P4) unless evidence of sustained impact.

actionable=true requires ALL three: (1) condition is sustained, not a transient spike; (2) there is a specific action an engineer can take right now; (3) inaction would cause user impact or service degradation. Otherwise actionable=false.

Common alerts the prompt classifies as P4 / non-actionable

Agent check-in delays (transient, resolves within minutes)
Backup job warnings (unless 3+ consecutive failures)
AV/EDR definition age (handled by scheduled update process)
Patch pending (informational, handled by patch management)
CPU or memory spike under 15 minutes on non-critical servers
Single disk I/O spike without sustained degradation
Certificate expiry more than 14 days away
Network utilization spike under 10 minutes
Windows service restart (unless restart loop or critical service)

Other prompt rules

Cascading failures: upstream is root cause → upstream P2/actionable, downstream marked correlated_symptom_of_upstream_failure, non-actionable.
Disk < 90% on non-critical storage → P4 at most.
reopenCount > 5 → chronic_flapper, severity reduced one tier.
CMDB Importance 1 (Critical) or 2 (High): raise severity one tier on P2/P3 boundary.

Severity guide

Severity	Definition	Examples	Expected Action
P1	Critical — outage or data loss risk	Domain controller unreachable, RAID failure, hypervisor down	Immediate on-call; ticket within 15 min
P2	High — significant degradation, customer-impacting	Memory >90%, disk >95%, connectivity failed on primary link	Ticket within 1 hour
P3	Medium — performance warning, no immediate impact	CPU sustained >80%, memory warning, backup job late	Ticket within next business day
P4	Low — informational or transient	Normal → Warning (short duration), patch pending	Log only; ticket only if condition persists beyond threshold

ServiceNow Incident Creation

A separate timer (incidentConsumer, every 5 minutes, batches of 10) reads TriageResults and creates incidents in thrivetest.service-now.com via the scripted REST endpoint /api/thrive/thrive_ai_triage/incident.

Gate	Behavior
`CREATE_INCIDENTS`	Master switch — currently true.
`actionable === true`	Required. Pre-filtered + AI-suppressed alerts never qualify.
Severity ≤ `MAX_SEVERITY_FOR_INCIDENT` (3)	P1, P2, P3 create incidents. P4 is logged only.
`incidentCreated !== true`	Idempotent — already-stamped triage rows are skipped.
Alert `status === 'active'`	If the alert auto-resolved between triage and incident creation, no incident is created.

On success, the incident's sys_id and number are stamped back onto both TriageResults and AlertRegistry. The Incident column on the alerts list links directly to the incident in ServiceNow.

The incident description includes device/service/state, CMDB CI + assignment group, AI investigation notes, recommended action, and the originating Alert ID for traceability.