← Back to portfolio

Applied ML / Insurance Analytics / Human-in-the-Loop AI

Claims Risk Scoring & Triage Model

A claims triage model that moves from a useful-but-not-ready baseline to an enriched, explainable, controlled-pilot-ready review workflow.

Data Science / AI Risk Analytics Human-in-the-loop

Built a claims risk scoring workflow that first identifies when an intake-only model is not ready for deployment, then improves it through enriched operational signals, threshold optimization, explainability, watchlist design and controlled pilot governance.

Business tension

The business was not blocked by model accuracy alone. It was blocked by whether the review queue would be useful, manageable and safe enough for human reviewers.

Final operating point

Controlled pilot ready, not automated decision-making.

The optimized Gradient Boosting model meets the target operating criteria for a controlled human-in-the-loop rollout. The model prioritizes review; it does not approve, reject or settle claims.

ROC-AUC 0.907
Precision 75.1%
Recall 78.1%
Review queue 28.9%

Before vs after enrichment

The project became valuable when the first model was treated as a readiness signal, not a success story.

The first model could capture risk, but only by creating a broad and inefficient review queue. The enriched version improved both risk capture and operational efficiency.

Stage Precision Recall Queue Decision
Intake-only model 50.2% 72.0% 39.8% Not ready
Enriched optimized model 75.1% 78.1% 28.9% Controlled pilot ready

Model readiness journey

The important part is not just the final model. It is the decision process.

The first model was not forced into production. The project treats that result as a readiness decision, then simulates the next realistic step: shadow-mode learning and better data signal.

Phase 1 Not deployment-ready

Intake-only model

The first model used only claim intake data. It had useful signal, but the review queue was too broad and not clean enough to justify operational deployment.

Precision 50.2%
Recall 72.0%
Queue 39.8%
Phase 2 Controlled pilot ready

Enriched optimized model

The second iteration enriched the feature layer with LLM-style text risk signals, document quality, provider history and early lifecycle indicators.

Precision 75.1%
Recall 78.1%
Queue 28.9%
Management decision

Do not force weak models into production. Enrich data, re-train, optimize thresholds, and deploy only as controlled decision support.

Business problem

The claims team cannot review every claim with the same intensity.

Simple claims should move through the standard process quickly. Higher-risk claims should be identified earlier, before they become delayed, disputed, escalated or operationally expensive.

The practical question is not whether a model can classify claims. It is whether it can create a useful review queue without overwhelming human reviewers.

Deployment boundary

The model supports triage. It does not make final claim decisions.

The final system is approved only as a review-support layer for a governed human-in-the-loop pilot. Human reviewers retain final decision authority.

  • No automated approval.
  • No automated rejection.
  • No automatic payout decision.
  • No legal or fraud determination without review.

Final triage policy

A three-level operating model, not a binary prediction.

The final policy separates the main priority review queue from a small watchlist layer for near-threshold claims with strong business signals.

1

Priority human review

risk_score ≥ 0.29

850 claims 75.1% precision
2

Watchlist / light review

0.25 ≤ score < 0.29 + strong drivers

50 claims 9 extra high-risk captured
3

Standard processing

below watchlist criteria

2,045 claims standard workflow
Combined priority + watchlist view 900 claims reviewed or monitored · 71.9% precision · 79.2% recall · 30.6% coverage

Technical workflow

From raw claims to pilot governance.

The project is structured as a reproducible ML pipeline, not a one-off notebook. Each step answers a business readiness question before moving closer to pilot deployment.

Synthetic claims data Data validation Leakage-safe model dataset Baseline model Threshold analysis Data enrichment Optimized Gradient Boosting Case explanations Watchlist policy Monitoring plan

Core stack

Applied ML with business-facing controls.

The technical stack is intentionally practical: strong enough to show applied Data Science capability, but still realistic for a business-facing ML workflow.

PythonPandasScikit-learnGradient BoostingFeature EngineeringThreshold AnalysisLLM-style Text SignalsModel MonitoringHuman-in-the-Loop AI
Modelling classification, thresholding, model comparison
Reliability leakage control, monitoring, pilot pause criteria
Business use review queue design, explanations, human oversight

What this project proves

The value is the full applied Data Science decision journey.

This is not positioned as a perfect model or an autonomous AI system. It is positioned as a realistic ML workflow where model performance, operational capacity and governance all matter.

Built an end-to-end ML triage workflow, not an isolated notebook.

Controlled for leakage by separating intake data from downstream claim outcomes.

Translated model scores into an operational review policy with capacity constraints.

Improved model readiness through richer business signals rather than model complexity alone.

Added case-level explanations so reviewers can inspect and challenge recommendations.

Defined monitoring, pause criteria and governance boundaries before pilot rollout.

Explainability

Reviewers get reasons, not just scores.

The case-level explanation layer translates model outputs into reviewer-friendly business drivers: urgency, contradictions, legal language, missing documentation, third-party involvement, document quality and early lifecycle risk signals.

The explanations are not causal proof. They are operational context for a human reviewer during a controlled rollout.

Monitoring

The pilot has explicit stop conditions.

Weekly monitoring tracks review queue volume, precision, recall, false negatives, score distribution, data quality, enrichment health and reviewer feedback.

The pilot should pause if precision falls below 55%, recall falls below 70%, enrichment failures exceed 5%, or the priority queue exceeds 40% for two consecutive weeks.

Repository evidence

Three files that document the decision.

The full repository includes datasets, scripts and reports. These are the three most important evidence files for understanding the final recommendation.

reports/final_model_readiness_report.mdreports/gradient_boosting_optimization_report.mdreports/case_level_explainability_report.md

Business conclusion

The strongest signal is not only the model performance. It is the full readiness journey.

This project demonstrates how to move from a weak-but-promising model to a governed review-support rollout: identify the limitation, enrich the data, re-train, optimize thresholds, add explanations, define monitoring and keep humans in control.

Next step

Want to see how this fits into the full portfolio?

Explore the rest of the applied analytics portfolio or reach out directly if you want to discuss the project, the modelling decisions, or the business reasoning behind the controlled pilot recommendation.