When handling sensitive data always start from the principle: less data, less risk. Use data minimization, strong encryption, redaction and tokenization, zero-trust access, and purpose-built flows (PCI-safe for payments, HIPAA-compliant for health). Keep sensitive values out of models and logs, limit retention, and prove controls with audits.
What counts as sensitive
- Payment data: PAN, CVV, expiry, bank account numbers, tokens
- Health data: PHI (diagnoses, treatments, member IDs), biometric identifiers
- Government IDs: SSN/NIN, driver’s license, passport
- Authentication secrets: passwords, OTPs, recovery codes
- Special categories (GDPR): health, biometrics, racial/ethnic origin, etc.
Core principles
- Collect the minimum necessary; only for clear, lawful purposes
- Keep sensitive values out of transcripts, prompts, and general logs
- Process through specialized, certified systems; never in general LLMs
- Encrypt everywhere; restrict access rigorously; monitor continuously
- Delete promptly according to policy; retain only what is legally required
Security foundations
- In transit: TLS 1.2+ for signaling/APIs, SRTP/DTLS-SRTP for media
- At rest: AES-256 with cloud KMS/HSM, per-tenant keys, periodic rotation
- Network: private networking (VPC/PrivateLink), IP allowlists, mTLS to back-end services
- Access: SSO/MFA, least-privilege RBAC/ABAC, just-in-time access, audited exports
- Environment isolation: no production data in non-prod; use synthetic data for testing
Payment information (PCI DSS)
- Scope reduction
- DTMF masking/tone suppression to collect card data; pause/resume recording
- Optional web handoff to a hosted payment page; keep PAN/CVV out of voice pipeline
- Tokenization
- Replace PAN with tokens from a PCI-certified gateway; store only tokens and last 4 digits
- No-go list
- Never send PAN/CVV to LLMs, transcripts, analytics, or support tickets
- Never store CVV post-authorization
- Evidence
- Annual assessments (SAQ/DSS), quarterly scans, segmented network, least-privilege access
- Receipt storage without PAN; encryption and strict retention
Health data (HIPAA and GDPR Art. 9)
- Contracting
- Business Associate Agreements with all PHI-handling vendors
- Minimum necessary and segregation
- Separate PHI stores; deny default access for non-care teams
- De-identification
- Remove HIPAA Safe Harbor identifiers for analytics; use limited data sets with DUAs where needed
- Model handling
- Do not train models on PHI; prefer VPC/on-prem inference or PHI-isolated providers
- Logging and retention
- Redact PHI from transcripts/logs before analytics; short TTL caches; policy-driven retention
- Patient rights
- Support access/amendment where applicable; secure portals for disclosures
Government IDs and KYC
- Capture via OTP, document verification services, or masked DTMF
- Hash or tokenize identifiers; never send raw values to general-purpose LLMs
- Retain only as long as required by KYC/AML laws; encrypt at rest; monitor access
Redaction and masking
- Real time
- Detect and mask numbers that look like PAN/SSN; tone-mask in audio; pause recording during sensitive steps
- Post processing
- Auto-redact PII/PHI in transcripts before storage, indexing, or analytics
- Structured capture
- Use validators and class-based grammars to reduce miscapture; confirm critical fields back to the user without repeating full sensitive values
Model and vendor data handling
- Default to data isolation; opt out of provider training on your data
- Region pinning and data residency; Standard Contractual Clauses or Data Privacy Framework for cross-border transfers
- Limit prompts to non-sensitive context; prefer retrieval from secure KBs and deterministic APIs
- Prefer private/VPC inference for regulated workloads; monitor for prompt injection attempts and block tool misuse
Purpose limitation and lawful basis
- Payments: legitimate interest/contract necessity; store consent for recurring charges
- Health: explicit consent or applicable legal basis; disclose processing purposes clearly
- Recording: comply with one/all-party consent rules; play correct disclosures by locale; log consent outcome
Retention, residency, and deletion
- Configurable retention by data type (audio, transcripts, tokens, PHI)
- Localize storage/processing in required regions; separate EU/UK/US
- Automated deletion with verified erasure; immutable/WORM archives only where legally mandated (e.g., FINRA/MiFID)
- Data subject requests: search, export, and delete across systems; documented timelines
Monitoring, audit, and incident response
- Tamper-evident audit logs for access, exports, and admin actions; stream to SIEM
- Real-time alerts for anomalous queries, large exports, failed mTLS, or excessive PII in prompts
- Regular pen tests, vulnerability scans, and access reviews
- Incident runbooks: containment, forensics, regulator/customer notification within required timelines
What to never do
- No PAN/CVV/SSN/PHI in LLM prompts, summaries, or analytics datasets
- No plaintext secrets in code or logs; no unsecured exports
- No indiscriminate retention “just in case”
- No vendor usage without DPA/BAA, deletion SLAs, and audit reports
Managing sensitive data safely is about disciplined design and operations: collect less, process through the right specialized paths, encrypt and isolate, keep values out of models and logs, and delete quickly. With these controls, you can deliver fast, helpful voice experiences without compromising privacy or compliance.