5 A Single AMR Prediction

This notebook walks through the simplest case: predicting antimicrobial resistance for one species and one antibiotic using data from a single hospital site and year.

We call the pipeline functions directly (no caching) so each step is transparent.

(ns amr-book.single-prediction
  (:require
   ;; AMR ML pipeline (prepare, train, predict, measure):
   [scicloj.amr.learning :as learning]
   ;; Bacterial species definitions and antibiotic lists:
   [scicloj.amr.data.bacteria :as bacteria]
   ;; Table processing (https://scicloj.github.io/tablecloth/):
   [tablecloth.api :as tc]
   ;; Annotating kinds of visualizations (https://scicloj.github.io/kindly-noted/):
   [scicloj.kindly.v4.kind :as kind]))

Choosing a scenario

We’ll predict Cefepime resistance in E. coli from DRIAMS-A 2018 — one of the cases reported by Weis et al.:

(def params
  {:site :A
   :year 2018
   :species bacteria/E-coli
   :antibiotic :Cefepime})

Stage 1 — Prepare raw data

Load DRIAMS metadata, join with available spectra, and filter to the chosen species/antibiotic. The result has a :ri column (true = resistant or intermediate) and a :path column pointing to each raw spectrum file.

(def raw-data
  (learning/prepare-raw-data params))

(tc/row-count raw-data)

(tc/select-columns raw-data [:code :species :ri :path])

raw-data [Escherichia coli / Cefepime / A / 2018] [1400 4]:

Stage 2 — Prepare ML data

This is the expensive step. For each spectrum: 1. Load the raw .txt.gz file 2. Preprocess (sqrt, smooth, baseline, normalize) via Ripple 3. Bin to 6,000 features (3 Da bins over [2000, 20000] Da)

The result is a dataset with columns :ri, :x0, :x1, … :x5999.

(def ml-params
  {:preprocessing-params {:smooth-window 21 :smooth-polynomial 3}
   :binning-params {:range [2000 20000] :step 3}})

(def ml-data
  (learning/prepare-ml-data raw-data ml-params))

{:rows (tc/row-count ml-data)
 :features (dec (count (tc/column-names ml-data)))}

{:rows 1400, :features 6000}

Stage 3 — Train/test split

A holdout split with a fixed random seed for reproducibility. Requires at least 100 samples (returns nil otherwise).

(def split-data
  (learning/split ml-data {:seed 1}))

{:train (tc/row-count (:train split-data))
 :test (tc/row-count (:test split-data))}

{:train 933, :test 467}

Stage 4 — Train

Train an XGBoost binary classifier on the training set:

(def model
  (learning/train split-data
                  {:model-type :xgboost/classification
                   :round 50
                   :num-class 2}))

OUT

16:18:00.460 [nREPL-session-af2642d0-19b6-4729-aca6-52b33b4e3d41] DEBUG ml.dmlc.xgboost4j.java.NativeLibLoader -- Using path /lib/linux/x86_64/libxgboost4j.so for library with name xgboost4j

Stage 5 — Predict

Generate probability predictions on the test set. The result has columns 0 (probability of susceptible) and 1 (probability of resistant/intermediate):

(def predictions
  (learning/predict split-data model))

predictions

predictions [467 3]:

0	1	:ri
0.05187543	0.94812459	true
0.99304366	0.00695636	false
0.96061611	0.03938394	false
0.97968948	0.02031045	false
0.99701047	0.00298953	false
0.99649519	0.00350485	false
0.99945921	0.00054080	false
0.99365848	0.00634146	false
0.93335623	0.06664377	false
0.99423039	0.00576961	false
…	…	…
0.98908663	0.01091336	false
0.99442768	0.00557238	false
0.99780887	0.00219112	false
0.99719256	0.00280746	false
0.99448317	0.00551683	false
0.99723178	0.00276819	false
0.99872690	0.00127309	false
0.99666721	0.00333284	false
0.50909549	0.49090451	false
0.99798101	0.00201905	false
0.97578853	0.02421154	false

Stage 6 — Measure

Evaluate with ROCAUC and PRAUC (area under the precision-recall curve):

(def metrics
  (learning/measure split-data predictions))

metrics

{:n-train 933,
 :n-test 467,
 :pri 0.18201284796573874,
 :PRAUC 0.7233731598301811,
 :ROCAUC 0.8683092085001545}

Reading the results

ROCAUC close to 1.0 means the model separates resistant from susceptible cases well.
PRAUC is especially informative when resistance is rare (low :pri), since it accounts for class imbalance.
:pri is the prevalence of resistance in the test set.

Summary

The six stages form a linear pipeline:

params → prepare-raw-data → prepare-ml-data → split → train → predict → measure

Each function takes the output of the previous stage plus its own configuration. Wrapping each stage with Pocket’s caching avoids recomputing expensive steps.

source: notebooks/amr_book/single_prediction.clj

:code	:species	:ri	:path
69eca649-ec26-4f9d-9f9a-d42aa5b9ec0f_MALDI1	Escherichia coli	false	data/DRIAMS/DRIAMS-A/raw/2018/69eca649-ec26-4f9d-9f9a-d42aa5b9ec0f_MALDI1.txt.gz
2132c91c-7b62-4ea4-9984-6f7fcdaed7d6_MALDI1	Escherichia coli	false	data/DRIAMS/DRIAMS-A/raw/2018/2132c91c-7b62-4ea4-9984-6f7fcdaed7d6_MALDI1.txt.gz
51ef2973-26fd-4558-b7ac-e615fe177a18_MALDI1	Escherichia coli	false	data/DRIAMS/DRIAMS-A/raw/2018/51ef2973-26fd-4558-b7ac-e615fe177a18_MALDI1.txt.gz
11b8a987-04e9-42be-a933-7dbcb3da84aa_MALDI1	Escherichia coli	false	data/DRIAMS/DRIAMS-A/raw/2018/11b8a987-04e9-42be-a933-7dbcb3da84aa_MALDI1.txt.gz
d8ad07e0-c646-48a6-b901-cda98e8f7e67_MALDI1	Escherichia coli	true	data/DRIAMS/DRIAMS-A/raw/2018/d8ad07e0-c646-48a6-b901-cda98e8f7e67_MALDI1.txt.gz
99191931-3ed3-4a25-9bb2-e65e3fb6939d_MALDI1	Escherichia coli	false	data/DRIAMS/DRIAMS-A/raw/2018/99191931-3ed3-4a25-9bb2-e65e3fb6939d_MALDI1.txt.gz
b9af2612-ddc8-41c0-90aa-04c8fcf16f8d_MALDI1	Escherichia coli	false	data/DRIAMS/DRIAMS-A/raw/2018/b9af2612-ddc8-41c0-90aa-04c8fcf16f8d_MALDI1.txt.gz
51b82a9c-6497-4d69-a498-8ce0d9e649c5_MALDI1	Escherichia coli	false	data/DRIAMS/DRIAMS-A/raw/2018/51b82a9c-6497-4d69-a498-8ce0d9e649c5_MALDI1.txt.gz
ad628afc-fc57-4084-ba92-90c18d938319_MALDI1	Escherichia coli	true	data/DRIAMS/DRIAMS-A/raw/2018/ad628afc-fc57-4084-ba92-90c18d938319_MALDI1.txt.gz
5553ef00-6b71-4696-95c6-993eb4fe0cdc_MALDI1	Escherichia coli	false	data/DRIAMS/DRIAMS-A/raw/2018/5553ef00-6b71-4696-95c6-993eb4fe0cdc_MALDI1.txt.gz
…	…	…	…
525421d5-7c97-4523-9d82-5be7b312955c_MALDI2	Escherichia coli	true	data/DRIAMS/DRIAMS-A/raw/2018/525421d5-7c97-4523-9d82-5be7b312955c_MALDI2.txt.gz
1a98fcb2-6b06-4987-aa58-efc0f0da53fc_MALDI2	Escherichia coli	true	data/DRIAMS/DRIAMS-A/raw/2018/1a98fcb2-6b06-4987-aa58-efc0f0da53fc_MALDI2.txt.gz
0fac6eeb-0386-444a-8498-97e0b727bcd7_MALDI2	Escherichia coli	false	data/DRIAMS/DRIAMS-A/raw/2018/0fac6eeb-0386-444a-8498-97e0b727bcd7_MALDI2.txt.gz
9251e130-66b6-41ba-a1b2-ae41bce0a200_MALDI2	Escherichia coli	false	data/DRIAMS/DRIAMS-A/raw/2018/9251e130-66b6-41ba-a1b2-ae41bce0a200_MALDI2.txt.gz
450fde6c-4db1-4060-b2f9-4bf0cc621f36_MALDI2	Escherichia coli	false	data/DRIAMS/DRIAMS-A/raw/2018/450fde6c-4db1-4060-b2f9-4bf0cc621f36_MALDI2.txt.gz
fdb4014f-c1ae-4e30-bbdd-de6126976728_MALDI2	Escherichia coli	true	data/DRIAMS/DRIAMS-A/raw/2018/fdb4014f-c1ae-4e30-bbdd-de6126976728_MALDI2.txt.gz
81ff5b1c-d5b9-4989-a4a3-eb30e735c7b3_MALDI2	Escherichia coli	false	data/DRIAMS/DRIAMS-A/raw/2018/81ff5b1c-d5b9-4989-a4a3-eb30e735c7b3_MALDI2.txt.gz
cb0b127c-dda2-4895-a020-38b3381ce790_MALDI2	Escherichia coli	false	data/DRIAMS/DRIAMS-A/raw/2018/cb0b127c-dda2-4895-a020-38b3381ce790_MALDI2.txt.gz
013ca7ca-a5f1-4bd4-b24c-3686b6fa3cf3_MALDI2	Escherichia coli	false	data/DRIAMS/DRIAMS-A/raw/2018/013ca7ca-a5f1-4bd4-b24c-3686b6fa3cf3_MALDI2.txt.gz
05332488-dbae-453a-976f-ab4ff980f3b4_MALDI2	Escherichia coli	false	data/DRIAMS/DRIAMS-A/raw/2018/05332488-dbae-453a-976f-ab4ff980f3b4_MALDI2.txt.gz
6d7a407b-2727-4721-905f-31d5030b3ba4_MALDI2	Escherichia coli	true	data/DRIAMS/DRIAMS-A/raw/2018/6d7a407b-2727-4721-905f-31d5030b3ba4_MALDI2.txt.gz