INFO 7375 · Final Project · Northeastern University

Clinical Literature
Intelligence System

A production-grade clinical decision support system combining RAG, Reinforcement Learning, and LLM synthesis to help physicians find, evaluate, and apply peer-reviewed medical evidence.

UCB Bandit REINFORCE PubMed RAG GRADE Evidence Citation Grounding RLHF Feedback ICD-10 RAG Groq · Llama 3.3 70B Hallucination Trap Knowledge Graph NNT Extractor

View on GitHub See how it works

Architecture

Six-stage AI pipeline

Every clinical question passes through a sequential pipeline — no stage is skipped, no data is simulated.

01 / CLASSIFY

Context detection

Question → one of 4 clinical contexts: drug efficacy, epidemiology, mechanism, treatment comparison

02 / BANDIT

UCB arm selection

Persistent SQLite UCB bandit selects the optimal PubMed query strategy from 5 arms

03 / RETRIEVE

Live PubMed query

NCBI E-utilities API — real articles only, no cached or simulated data ever shown

04 / GRADE

GRADE methodology

Groq Llama 3.3 70B grades each abstract: High / Moderate / Low / Very Low evidence

05 / RANK

REINFORCE policy

Trained policy network re-ranks articles by predicted clinical utility using 6 input features

06 / GROUND

Citation grounding

Every LLM sentence mapped to source passage via Jaccard overlap — unverified claims flagged

Features

Five tabs, eight tools

Built for physicians, researchers, and clinical coders — not just as a demo.

⚕

Clinical search

Free text or PICO builder → full 6-stage pipeline → citation-grounded summary → RLHF feedback loop. Real PubMed only.

Tab 1

⊕

ICD-10 coding assistant

TF-IDF RAG over 47 CMS ICD-10-CM FY2024 guideline sections. Groq synthesises structured coding answers with sequencing rules.

Tab 2

⚖

Treatment comparison

Side-by-side evidence for two treatments using independent UCB bandit queries. Head-to-head verdict with GRADE winner.

Advanced Analysis

NNT / NNH extractor

Regex + Groq LLM extracts Number Needed to Treat, NNH, ARR, RRR from abstracts. The most actionable statistic in EBM.

Advanced Analysis

{ }

Structured extraction

Any abstract → 13-field JSON card: study design, sample size, effect size, limitations, funding. SQLite-cached, downloadable.

Advanced Analysis

Knowledge graph

Interactive Plotly network from PubMed MeSH metadata. Nodes = articles, edges = shared terms + co-authors. Any research field.

Advanced Analysis

✓

Benchmark suite

10 automated test cases including TC10: the CARDIAC-PREVENT hallucination trap — a trial that does not exist. All 10 pass.

Tab 4

System analytics

Live bandit learning curve, RLHF feedback stats, per-context arm reward matrix. Full RL interpretability dashboard.

Tab 5

GRADE Evidence Hierarchy

Evidence quality
is always visible

CLIS V2 never flattens the evidence pyramid. Every article shows its GRADE level. A physician can immediately see whether a recommendation comes from a systematic review of RCTs or a single expert opinion.

100% study design accuracy · Groq + rule-based fallback

Results

Statistically validated performance

All results from 5-seed experiments. p-values computed via Welch's t-test.

+4.01%

UCB Bandit improvement over random

p=0.0048 · Cohen's d=2.735 (large effect) · 5 seeds × 200 rounds

100%

Arm identification accuracy

Optimal query strategy selected in all 4 contexts · all 5 seeds

72.6%

REINFORCE policy loss reduction

±5.7% across 5 seeds · 300 episodes each · PyTorch CPU

10/10

Benchmark tests passing

3 critical tests + hallucination trap · works without Groq key

ICD-10 guideline sections indexed

CMS ICD-10-CM Official Guidelines FY2024 · TF-IDF RAG · SQLite cache

Production tools built

4,500+ lines across app.py + 8 tool modules

RL Performance

UCB Bandit vs random baseline

Tested across 5 seeds, 200 rounds each. The bandit learns domain-optimal query strategies through exploration, consistently outperforming random arm selection.

UCB Bandit 0.5820

Random baseline 0.5596

Notebooks

Seven experiment notebooks

Each notebook is self-contained and reproducible. All results match the reported metrics.

NB1

Contextual UCB Bandit

Domain-informed priors, 5-seed validation, regret curves, arm identification accuracy

NB2

REINFORCE Policy Gradient

Policy network training, baseline variance reduction, episode reward curves

NB3

Full Pipeline Integration

End-to-end pipeline: bandit → PubMed → GRADE → REINFORCE → summary

NB3b

Live PubMed Validation

Real NCBI API calls, article retrieval confirmation, zero simulated data

NB4

Statistical Validation

Welch t-test, Cohen's d, power analysis, 5-seed cross-validation

NB5

GRADE Tool Evaluation

Study design classification, 6 study types, Groq + rule-based accuracy

NB6

Ablation Study

Component contribution: with/without bandit, REINFORCE, citation grounding

Technology

Built entirely for $0

Every component uses free APIs, free tiers, or open source libraries.

LLM

Groq · Llama 3.3 70B

Free tier · <1s latency · rule-based fallback

Literature

PubMed NCBI API

E-utilities · free · 10 req/sec with key

RL Framework

PyTorch (CPU)

Custom UCB + REINFORCE implementation

Vector Store

TF-IDF + SQLite

Zero external deps · stdlib only

ICD-10 Data

CMS FY2024

47 guideline sections · official source

Streamlit

Python-native · light theme via config.toml

Persistence

SQLite (stdlib)

Bandit state · ICD cache · struct cache

Visualisation

Plotly

Knowledge graph · learning curve charts

Ethics

Safety by design

Clinical AI safety is engineering, not an afterthought. Every feature reflects a specific safety decision.

Hallucination prevention

Citation grounding maps every LLM sentence to a source passage. TC10 verifies the system refuses to fabricate results for non-existent trials.

Evidence transparency

GRADE methodology grades are always shown. Grade A (RCT/meta-analysis) is visually distinct from Grade D (expert opinion). No evidence pyramid flattening.

Real evidence only

CLIS V2 only displays articles retrieved live from PubMed. If PubMed returns nothing, the app shows an error — never simulated or fabricated citations.

Data privacy

No patient data is stored. Clinical queries are session-scoped. Only anonymised query text and bandit feedback persist to SQLite.

RL interpretability

The System Analytics tab shows exactly which query strategy the bandit selected, current reward estimates, and how RLHF feedback shifted the policy.

Decision support only

Every interface element explicitly labels CLIS V2 as decision support, not autonomous clinical decision-making. Physician review required.

Benchmark Suite

10/10 tests passing

Including TC10 — the hallucination trap. A query about the CARDIAC-PREVENT trial, which does not exist. The system must refuse to fabricate results.

Simple treatment query

Baseline · PASS

Conflicting evidence

Conflict detection · PASS

Evidence limitation

CKD population · PASS

Guideline recency

Aspirin update · PASS

Drug interaction

CRITICAL · PASS

Pediatric population

CRITICAL · PASS

Rare disease evidence

Evidence quality · PASS

ICD-10 coding

E11.22 + N18.x · PASS

Emerging evidence

GLP-1 / recency · PASS

Hallucination trap

CARDIAC-PREVENT · REFUSED

Presentation

Project slides

10-slide deck covering the problem, solution, pipeline, RL results, benchmarks, ethics, and tech stack.

CLIS_RL_Technical_Report.pdf · docs/

Clinical LiteratureIntelligence System

Six-stage AI pipeline

Five tabs, eight tools

Evidence qualityis always visible

Statistically validated performance

UCB Bandit vs random baseline

Seven experiment notebooks

Built entirely for $0

Safety by design

10/10 tests passing

Project slides

Explore the full system

Clinical Literature
Intelligence System

Evidence quality
is always visible