Healthcare AI Pipeline. From Research to Clinician Assist

February 12, 2025

Problem

A healthcare partner needed an AI-assisted workflow for a specific clinical task. Requirements: (1) inference in under 500 ms p99, (2) every prediction traceable to model version and input hash, (3) no PII in logs or telemetry, (4) clinician always in the loop, system suggests, human decides. We had to integrate with existing EHR-adjacent systems and pass security and compliance review.

Architecture

Data flows one way: EHR/API → our ingestion layer (de-identify where needed, hash for idempotency) → feature store → model serving → result + metadata stored and returned to the client app. All steps are logged with correlation IDs; no PHI leaves the boundary. Clinician UI shows the suggestion, confidence (when available), and the option to accept, reject, or edit.

Flow: EHR → Ingestion (de-id) → Features → Model → Audit DB; result and metadata to Clinician UI.

Tech stack

Ingestion: Python service consuming HL7/FHIR messages; de-identification and hashing for idempotency; output to feature store and to audit (hashes only, no PHI).
Model serving: ONNX Runtime for inference; model version pinned and logged with every request.
Storage: PostgreSQL for audit (request ID, model version, input hash, output, timestamp); no PII. Feature store for derived inputs; retention per policy.
Orchestration: Airflow for periodic retraining and evaluation; pipelines versioned in Git.

Tradeoffs

Latency vs. explainability: We could have added explainability (e.g. feature importance) at the cost of extra compute and latency. We shipped with confidence scores and left full explainability for a later phase when we had latency headroom.
Flexibility vs. compliance: We locked down the production path (no ad-hoc queries, no raw data export) to satisfy compliance. Analytics and research used a separate, approved pipeline with strict access control.
Model freshness vs. stability: We chose scheduled retrains (e.g. monthly) with full validation rather than continuous deployment. That reduced risk of regressions and made audit trails clearer.

Metrics

Inference: p99 latency < 500 ms; target met with ONNX and adequate hardware.
Audit: 100% of predictions written to audit DB with model version and input hash.
Compliance: zero PII in application logs or telemetry; only hashes and IDs.
Uptime: 99.95% over the measurement period; incidents were non-AI (e.g. DB failover).

Lessons

Auditability from day one. Designing the audit log and correlation IDs into the first version avoided a painful retrofit and made compliance review straightforward.
Clinician in the loop is a product constraint. The UI had to make it obvious that the AI suggests and the human decides; we avoided any flow that could auto-apply without confirmation.
De-identify at the boundary. Keeping PHI out of our logs and telemetry required discipline at ingestion; once we had the pattern, it scaled to new data sources.
Version everything. Model version, pipeline version, and config version in the audit trail made debugging and rollback tractable.