INFO 7375 · Final Project · Northeastern University

Clinical Literature
Intelligence System

A production-grade clinical decision support system combining RAG, Reinforcement Learning, and LLM synthesis to help physicians find, evaluate, and apply peer-reviewed medical evidence.

UCB Bandit REINFORCE PubMed RAG GRADE Evidence Citation Grounding RLHF Feedback ICD-10 RAG Groq · Llama 3.3 70B Hallucination Trap Knowledge Graph NNT Extractor
View on GitHub See how it works

+4.01%
UCB over random
2.735
Cohen's d (large)
72.6%
REINFORCE loss reduction
10/10
Benchmark tests pass
47
ICD-10 guideline sections
$0
Total infrastructure cost

Six-stage AI pipeline

Every clinical question passes through a sequential pipeline — no stage is skipped, no data is simulated.

01 / CLASSIFY
Context detection
Question → one of 4 clinical contexts: drug efficacy, epidemiology, mechanism, treatment comparison
02 / BANDIT
UCB arm selection
Persistent SQLite UCB bandit selects the optimal PubMed query strategy from 5 arms
03 / RETRIEVE
Live PubMed query
NCBI E-utilities API — real articles only, no cached or simulated data ever shown
04 / GRADE
GRADE methodology
Groq Llama 3.3 70B grades each abstract: High / Moderate / Low / Very Low evidence
05 / RANK
REINFORCE policy
Trained policy network re-ranks articles by predicted clinical utility using 6 input features
06 / GROUND
Citation grounding
Every LLM sentence mapped to source passage via Jaccard overlap — unverified claims flagged
System architecture
CLINICAL QUESTION Free text / PICO CONTEXT CLASSIFY 4 contexts rule-based UCB BANDIT 5 query arms SQLite persist RLHF updates PUBMED RETRIEVE NCBI E-utils real articles GRADE EVIDENCE Groq LLM 4-level grading REINFORCE RANK 3-layer MLP policy gradient 6 features GROUNDED SUMMARY citation verified RLHF feedback loop — physician ratings update bandit rewards

Five tabs, eight tools

Built for physicians, researchers, and clinical coders — not just as a demo.

Clinical search
Free text or PICO builder → full 6-stage pipeline → citation-grounded summary → RLHF feedback loop. Real PubMed only.
Tab 1
ICD-10 coding assistant
TF-IDF RAG over 47 CMS ICD-10-CM FY2024 guideline sections. Groq synthesises structured coding answers with sequencing rules.
Tab 2
Treatment comparison
Side-by-side evidence for two treatments using independent UCB bandit queries. Head-to-head verdict with GRADE winner.
Advanced Analysis
N
NNT / NNH extractor
Regex + Groq LLM extracts Number Needed to Treat, NNH, ARR, RRR from abstracts. The most actionable statistic in EBM.
Advanced Analysis
{ }
Structured extraction
Any abstract → 13-field JSON card: study design, sample size, effect size, limitations, funding. SQLite-cached, downloadable.
Advanced Analysis
o
Knowledge graph
Interactive Plotly network from PubMed MeSH metadata. Nodes = articles, edges = shared terms + co-authors. Any research field.
Advanced Analysis
Benchmark suite
10 automated test cases including TC10: the CARDIAC-PREVENT hallucination trap — a trial that does not exist. All 10 pass.
Tab 4
~
System analytics
Live bandit learning curve, RLHF feedback stats, per-context arm reward matrix. Full RL interpretability dashboard.
Tab 5
GRADE Evidence Hierarchy

Evidence quality
is always visible

CLIS V2 never flattens the evidence pyramid. Every article shows its GRADE level. A physician can immediately see whether a recommendation comes from a systematic review of RCTs or a single expert opinion.

100% study design accuracy · Groq + rule-based fallback
GRADE A — HIGH Systematic review · Meta-analysis GRADE B — MODERATE RCT · Well-designed trial GRADE C — LOW Cohort · Case-control · Observational GRADE D — VERY LOW Expert opinion · Case report Higher = stronger evidence · Lower = weaker evidence

Statistically validated performance

All results from 5-seed experiments. p-values computed via Welch's t-test.

+4.01%
UCB Bandit improvement over random
p=0.0048 · Cohen's d=2.735 (large effect) · 5 seeds × 200 rounds
100%
Arm identification accuracy
Optimal query strategy selected in all 4 contexts · all 5 seeds
72.6%
REINFORCE policy loss reduction
±5.7% across 5 seeds · 300 episodes each · PyTorch CPU
10/10
Benchmark tests passing
3 critical tests + hallucination trap · works without Groq key
47
ICD-10 guideline sections indexed
CMS ICD-10-CM Official Guidelines FY2024 · TF-IDF RAG · SQLite cache
8
Production tools built
4,500+ lines across app.py + 8 tool modules
RL Performance

UCB Bandit vs random baseline

Tested across 5 seeds, 200 rounds each. The bandit learns domain-optimal query strategies through exploration, consistently outperforming random arm selection.

UCB Bandit 0.5820
Random baseline 0.5596
0.62 0.58 0.55 0.52 UCB learns & converges random baseline 0 50 100 150 200 rounds (per seed) p = 0.0048 Cohen d = 2.735

Seven experiment notebooks

Each notebook is self-contained and reproducible. All results match the reported metrics.

NB1
Contextual UCB Bandit
Domain-informed priors, 5-seed validation, regret curves, arm identification accuracy
NB2
REINFORCE Policy Gradient
Policy network training, baseline variance reduction, episode reward curves
NB3
Full Pipeline Integration
End-to-end pipeline: bandit → PubMed → GRADE → REINFORCE → summary
NB3b
Live PubMed Validation
Real NCBI API calls, article retrieval confirmation, zero simulated data
NB4
Statistical Validation
Welch t-test, Cohen's d, power analysis, 5-seed cross-validation
NB5
GRADE Tool Evaluation
Study design classification, 6 study types, Groq + rule-based accuracy
NB6
Ablation Study
Component contribution: with/without bandit, REINFORCE, citation grounding

Built entirely for $0

Every component uses free APIs, free tiers, or open source libraries.

LLM
Groq · Llama 3.3 70B
Free tier · <1s latency · rule-based fallback
Literature
PubMed NCBI API
E-utilities · free · 10 req/sec with key
RL Framework
PyTorch (CPU)
Custom UCB + REINFORCE implementation
Vector Store
TF-IDF + SQLite
Zero external deps · stdlib only
ICD-10 Data
CMS FY2024
47 guideline sections · official source
UI
Streamlit
Python-native · light theme via config.toml
Persistence
SQLite (stdlib)
Bandit state · ICD cache · struct cache
Visualisation
Plotly
Knowledge graph · learning curve charts

Safety by design

Clinical AI safety is engineering, not an afterthought. Every feature reflects a specific safety decision.

Hallucination prevention
Citation grounding maps every LLM sentence to a source passage. TC10 verifies the system refuses to fabricate results for non-existent trials.
Evidence transparency
GRADE methodology grades are always shown. Grade A (RCT/meta-analysis) is visually distinct from Grade D (expert opinion). No evidence pyramid flattening.
Real evidence only
CLIS V2 only displays articles retrieved live from PubMed. If PubMed returns nothing, the app shows an error — never simulated or fabricated citations.
Data privacy
No patient data is stored. Clinical queries are session-scoped. Only anonymised query text and bandit feedback persist to SQLite.
RL interpretability
The System Analytics tab shows exactly which query strategy the bandit selected, current reward estimates, and how RLHF feedback shifted the policy.
Decision support only
Every interface element explicitly labels CLIS V2 as decision support, not autonomous clinical decision-making. Physician review required.
Benchmark Suite

10/10 tests passing

Including TC10 — the hallucination trap. A query about the CARDIAC-PREVENT trial, which does not exist. The system must refuse to fabricate results.

01
Simple treatment query
Baseline · PASS
02
Conflicting evidence
Conflict detection · PASS
03
Evidence limitation
CKD population · PASS
04
Guideline recency
Aspirin update · PASS
05
Drug interaction
CRITICAL · PASS
06
Pediatric population
CRITICAL · PASS
07
Rare disease evidence
Evidence quality · PASS
08
ICD-10 coding
E11.22 + N18.x · PASS
09
Emerging evidence
GLP-1 / recency · PASS
10
Hallucination trap
CARDIAC-PREVENT · REFUSED

Project slides

10-slide deck covering the problem, solution, pipeline, RL results, benchmarks, ethics, and tech stack.

CLIS_RL_Technical_Report.pdf  ·  docs/

Explore the full system

All code, notebooks, trained models, and results are publicly available on GitHub.

View on GitHub Read the docs