Measure what customers feel, what gets resolved, how fast it happens, how safely it runs, and how much it costs. Put KPIs into seven buckets: customer outcomes, speed, model/recognition quality, task/Tool success, handover quality, reliability, compliance/safety, and economics.
- Customer outcomes
- CSAT/PSAT (post-call survey) and NPS: track by intent, hour, and language.
- Sentiment delta: change from call start to end; target positive shift.
- First Contact Resolution (FCR): issue resolved without recontact within X days.
- No-repeat within 72 hours: percent of calls that don’t trigger follow-ups on the same issue.
- Abandonment rate: callers who drop before engagement or during long silences.
- Speed and responsiveness
- Time to answer (ASA) and time to first word (TTFW): speed from connect to AI speaking.
- End-to-end handle time (AHT) for contained calls; resolution time for multi-step journeys.
- Latency p50/p95 per turn: ASR, LLM/reasoning, TTS; barge-in responsiveness.
- Queue time to human: when an escalation occurs.
- Callback time met SLA: when scheduling replaces live transfer.
- Model and recognition quality
- Intent recognition accuracy: correct top intent on first try (by ground-truth set).
- Entity/slot capture accuracy: IDs, dates, amounts captured and validated correctly.
- ASR quality: word error rate (WER) and entity WER; out-of-vocabulary error rate.
- Groundedness rate: answers supported by approved sources; hallucination rate.
- Clarification effectiveness: % of low-confidence turns successfully resolved after one clarification.
- Escalation confidence calibration: low-confidence triggers that correctly needed a handover.
- Task and tool success (what the AI actually completes)
- Containment rate: % of conversations resolved without human transfer.
- Tool success rate: successful API actions (payments, IDV, bookings) / attempts.
- RAG hit rate: retrieval returns the right doc/snippet; doc freshness coverage.
- Authentication success rate: verified identity without human help.
- Payment success rate (PCI-safe flows): tokenization complete and receipt issued.
- Scheduling/booking completion rate; reschedule/cancel success.
- Callback completion rate and within-SLA completion.
- Link engagement: SMS/email click-through for instructions or documents.
- Handover quality (when AI and humans collaborate)
- Transfer rate: % of conversations handed to humans (aim for smart, not just low).
- Time to human: from transfer decision to human pick-up.
- Warm transfer context completeness: identity verified, summary, attempted steps, disposition included.
- No-repeat after transfer: customer doesn’t need to restate info; human resolves in one go.
- Minutes saved on escalations: time AI saved the human (prefill fields, summary, reduced ACW).
- Reliability and resilience
- Availability/uptime by region; incident minutes outside SLO.
- Error rate by type: ASR failures, API timeouts, LLM errors, tool exceptions.
- Telephony health: connect rate, drop rate, jitter/packet loss beyond thresholds.
- Rate limiting/backoff events and graceful degradation success (message delivered, callback set).
- Compliance and safety
- Consent capture rate (recording and outreach, jurisdiction-aware).
- Redaction efficacy: PII/PHI/PAN leakage rate in transcripts/logs (target: near zero).
- PCI compliance adherence: DTMF masking engaged where needed; zero PAN/CVV in prompts/logs.
- Policy adherence: responses adhere to approved content; risky-topic deflection success.
- Data subject request SLA: export/delete completed on time.
- Economics and capacity impact
- Cost per resolved interaction (AI-contained vs escalated vs human-only).
- Containment-adjusted cost savings: baseline vs post-AI period.
- Agent assist impact: AHT reduction, ACW reduction, suggestion acceptance rate.
- Volume shift: % of total volume handled after-hours; language coverage without added headcount.
- ROI: savings + revenue protection (reduced churn/retention rescues) minus AI stack costs.
Quick checklist
- Define clear outcome labels (resolved, escalated, callback set, abandoned).
- Instrument turn-level events and timestamps; capture confidences and retrieved sources.
- Maintain gold-standard test sets and human QA workflows.
- Segment KPIs by intent, hour, language, and region; publish a weekly scorecard.
- Tie KPIs to actions: a named owner for each metric, threshold alerts, and a backlog of fixes.
- Protect privacy in analytics: redact, tokenize, limit access, and audit exports.
Success isn’t one number. Track a balanced set of KPIs that reflect customer happiness, speed, correctness, safe operations, and cost. Instrument from day one, audit weekly, run comparisons against human baselines, and use the insights to tune prompts, content, and routing. That’s how you turn an AI voice agent into a reliable, measurable business asset.