Claims AI Reliability Platform
A governed release and monitoring system for deciding when a claims knowledge assistant is safe enough to scale — and when it should be restricted, reviewed, or rolled back.
I built this project around a harder question than “can the assistant answer claims questions?” The real question was whether a Swiss insurer could trust that assistant enough to release it into controlled production use.
That meant the project had to go beyond a useful interface or a decent retrieval flow. It needed governed release behavior, benchmark discipline, escalation-sensitive evaluation, live telemetry, incident logic, review workflows, and containment recommendations if live behavior later became unsafe.
Release discipline before deployment, and containment logic after deployment.
Why a normal prototype was not enough
A promising internal assistant is still a risk if the business cannot tell when it is safe to release, when it should escalate, and how unsafe live behavior should be contained once the system is already in use.
The real business tension
In a claims setting, helpful is not enough. The assistant also has to be governable.
Management did not need a prettier prototype. It needed evidence. The company had to know whether a release candidate handled high-risk boundary cases correctly, whether escalation behavior held up under evaluation, and whether live degradation could still be detected after approval.
That changed the shape of the project. Instead of stopping at answer generation, I built around operational trust.
What made the problem operationally serious
- A fluent answer is not the same as a safe answer
- Claims workflows have boundary cases where escalation matters more than helpfulness
- Management needed release evidence before broader adoption, not confidence based on a good demo
- Even an approved release still needed monitoring in case live behavior later degraded
What I actually built
The project is best understood as a controlled release and monitoring system around a claims knowledge assistant, not as a standalone chatbot.
Governed assistant
A FastAPI claims knowledge assistant built with release-aware behavior, explicit retrieval control, and boundary-sensitive answer logic rather than open-ended assistant behavior.
Benchmark release gate
A structured evaluation layer across smoke, release-gate, boundary, and change-sensitive benchmark suites, with blocker logic, scoring, and release recommendation.
Live monitoring and incident logic
A telemetry and monitoring layer that converts unsafe live interactions into incidents, review queue items, and containment recommendations instead of leaving risk hidden inside logs.
Packaged runtime with persistence
A Docker and PostgreSQL-backed runtime that persists telemetry and monitoring state so the system feels operational, not just locally demoable.
The release-gate story
The release process was not ceremonial. It materially changed the outcome. Early release candidates failed on scoring, blocker logic, and high-risk handling. The final one only passed once the benchmark results were good enough to justify scaled production.
The benchmark gate forced iteration. It did not just describe quality. It changed the release path.
The bug that changed the release outcome
The most credible part of the release story is that the benchmark exposed a real boundary-handling defect in the assistant logic rather than just producing abstract scores.
What the benchmark caught
- A legal-liability / out-of-scope boundary case entered a product-line boundary branch first
- The answer was blocked, but it did not behave like a true mandatory escalation path
- That meant the system was technically restrictive, but operationally weaker than it should have been
- The benchmark exposed that behavior, the service logic was fixed, and the release decision improved because of it
The difference between “blocked” and “properly escalated” mattered enough to affect the release decision.
Live monitoring and rollback story
The project’s strongest Phase 3 proof is that approval was not treated as the end of the story. A release that had already been cleared could still later be detected as unsafe in live use.
What happened in live monitoring
- A deliberately unsafe post-release interaction was injected into the live workflow
- The assistant answered directly where escalation should have happened
- That event triggered multiple incidents across boundary, escalation, and grounding-related logic
- The platform also created a P1 review queue item and raised release degradation incidents
- The top containment action became rollback_release
Why this changes the project’s weight
A lot of AI case studies stop once the release is approved. This one keeps going. It proves that an approved release can still degrade in practice, generate operational signals, open review workflows, and trigger a stronger containment recommendation.
That is what makes the platform feel mature: it treats production trust as something that must be maintained, not assumed.
Control room in degraded live state
The approved release later entered a critical state after unsafe live behavior triggered incidents and changed the top containment action to rollback.
Unsafe behavior translated into operational action
The platform did not stop at red flags. It generated incidents, opened a P1 review queue item, and made the release operationally reviewable.
Packaged runtime and operational realism
The runtime extension strengthened the project, but the important part is how it was done. Infrastructure was added where it created real operational value, not just to make the stack sound bigger.
What the packaging added
- Docker and Docker Compose packaged the assistant, control room, and PostgreSQL-backed persistence into one runnable environment
- PostgreSQL was used where it added real value: telemetry storage, monitoring storage, and operational state persistence
- The governed retrieval path was deliberately kept intact rather than rewritten just to force infrastructure into the answer path
- That was a disciplined choice: the retrieval behavior was already explicit, benchmarked, and central to release-gate behavior
The assistant, monitoring layer, and persisted operational state could run together inside one packaged environment.
Why this project matters in the portfolio
This is one of the strongest projects in the portfolio because it shows a more mature AI question than “can I build it?” It asks whether the business can govern it, test it, trust it, and contain it once it is live.
This project is stronger than a normal GenAI case because it proves release discipline, operational monitoring, and rollback readiness rather than just answer generation.
It shows that I can work on the harder problem behind enterprise AI adoption: not only whether a system can answer, but whether the business can trust, govern, and contain it.
Next step
Want to explore more of the portfolio or get in touch?
You can go back to the selected projects section or contact me directly by email.