5  A Single AMR Prediction

This notebook walks through the simplest case: predicting antimicrobial resistance for one species and one antibiotic using data from a single hospital site and year.

We call the pipeline functions directly (no caching) so each step is transparent.

(ns amr-book.single-prediction
  (:require
   ;; AMR ML pipeline (prepare, train, predict, measure):
   [scicloj.amr.learning :as learning]
   ;; Bacterial species definitions and antibiotic lists:
   [scicloj.amr.data.bacteria :as bacteria]
   ;; Table processing (https://scicloj.github.io/tablecloth/):
   [tablecloth.api :as tc]
   ;; Annotating kinds of visualizations (https://scicloj.github.io/kindly-noted/):
   [scicloj.kindly.v4.kind :as kind]))

Choosing a scenario

We’ll predict Cefepime resistance in E. coli from DRIAMS-A 2018 — one of the cases reported by Weis et al.:

(def params
  {:site :A
   :year 2018
   :species bacteria/E-coli
   :antibiotic :Cefepime})

Stage 1 — Prepare raw data

Load DRIAMS metadata, join with available spectra, and filter to the chosen species/antibiotic. The result has a :ri column (true = resistant or intermediate) and a :path column pointing to each raw spectrum file.

(def raw-data
  (learning/prepare-raw-data params))
(tc/row-count raw-data)
1400
(tc/select-columns raw-data [:code :species :ri :path])

raw-data [Escherichia coli / Cefepime / A / 2018] [1400 4]:

:code :species :ri :path
69eca649-ec26-4f9d-9f9a-d42aa5b9ec0f_MALDI1 Escherichia coli false data/DRIAMS/DRIAMS-A/raw/2018/69eca649-ec26-4f9d-9f9a-d42aa5b9ec0f_MALDI1.txt.gz
2132c91c-7b62-4ea4-9984-6f7fcdaed7d6_MALDI1 Escherichia coli false data/DRIAMS/DRIAMS-A/raw/2018/2132c91c-7b62-4ea4-9984-6f7fcdaed7d6_MALDI1.txt.gz
51ef2973-26fd-4558-b7ac-e615fe177a18_MALDI1 Escherichia coli false data/DRIAMS/DRIAMS-A/raw/2018/51ef2973-26fd-4558-b7ac-e615fe177a18_MALDI1.txt.gz
11b8a987-04e9-42be-a933-7dbcb3da84aa_MALDI1 Escherichia coli false data/DRIAMS/DRIAMS-A/raw/2018/11b8a987-04e9-42be-a933-7dbcb3da84aa_MALDI1.txt.gz
d8ad07e0-c646-48a6-b901-cda98e8f7e67_MALDI1 Escherichia coli true data/DRIAMS/DRIAMS-A/raw/2018/d8ad07e0-c646-48a6-b901-cda98e8f7e67_MALDI1.txt.gz
99191931-3ed3-4a25-9bb2-e65e3fb6939d_MALDI1 Escherichia coli false data/DRIAMS/DRIAMS-A/raw/2018/99191931-3ed3-4a25-9bb2-e65e3fb6939d_MALDI1.txt.gz
b9af2612-ddc8-41c0-90aa-04c8fcf16f8d_MALDI1 Escherichia coli false data/DRIAMS/DRIAMS-A/raw/2018/b9af2612-ddc8-41c0-90aa-04c8fcf16f8d_MALDI1.txt.gz
51b82a9c-6497-4d69-a498-8ce0d9e649c5_MALDI1 Escherichia coli false data/DRIAMS/DRIAMS-A/raw/2018/51b82a9c-6497-4d69-a498-8ce0d9e649c5_MALDI1.txt.gz
ad628afc-fc57-4084-ba92-90c18d938319_MALDI1 Escherichia coli true data/DRIAMS/DRIAMS-A/raw/2018/ad628afc-fc57-4084-ba92-90c18d938319_MALDI1.txt.gz
5553ef00-6b71-4696-95c6-993eb4fe0cdc_MALDI1 Escherichia coli false data/DRIAMS/DRIAMS-A/raw/2018/5553ef00-6b71-4696-95c6-993eb4fe0cdc_MALDI1.txt.gz
525421d5-7c97-4523-9d82-5be7b312955c_MALDI2 Escherichia coli true data/DRIAMS/DRIAMS-A/raw/2018/525421d5-7c97-4523-9d82-5be7b312955c_MALDI2.txt.gz
1a98fcb2-6b06-4987-aa58-efc0f0da53fc_MALDI2 Escherichia coli true data/DRIAMS/DRIAMS-A/raw/2018/1a98fcb2-6b06-4987-aa58-efc0f0da53fc_MALDI2.txt.gz
0fac6eeb-0386-444a-8498-97e0b727bcd7_MALDI2 Escherichia coli false data/DRIAMS/DRIAMS-A/raw/2018/0fac6eeb-0386-444a-8498-97e0b727bcd7_MALDI2.txt.gz
9251e130-66b6-41ba-a1b2-ae41bce0a200_MALDI2 Escherichia coli false data/DRIAMS/DRIAMS-A/raw/2018/9251e130-66b6-41ba-a1b2-ae41bce0a200_MALDI2.txt.gz
450fde6c-4db1-4060-b2f9-4bf0cc621f36_MALDI2 Escherichia coli false data/DRIAMS/DRIAMS-A/raw/2018/450fde6c-4db1-4060-b2f9-4bf0cc621f36_MALDI2.txt.gz
fdb4014f-c1ae-4e30-bbdd-de6126976728_MALDI2 Escherichia coli true data/DRIAMS/DRIAMS-A/raw/2018/fdb4014f-c1ae-4e30-bbdd-de6126976728_MALDI2.txt.gz
81ff5b1c-d5b9-4989-a4a3-eb30e735c7b3_MALDI2 Escherichia coli false data/DRIAMS/DRIAMS-A/raw/2018/81ff5b1c-d5b9-4989-a4a3-eb30e735c7b3_MALDI2.txt.gz
cb0b127c-dda2-4895-a020-38b3381ce790_MALDI2 Escherichia coli false data/DRIAMS/DRIAMS-A/raw/2018/cb0b127c-dda2-4895-a020-38b3381ce790_MALDI2.txt.gz
013ca7ca-a5f1-4bd4-b24c-3686b6fa3cf3_MALDI2 Escherichia coli false data/DRIAMS/DRIAMS-A/raw/2018/013ca7ca-a5f1-4bd4-b24c-3686b6fa3cf3_MALDI2.txt.gz
05332488-dbae-453a-976f-ab4ff980f3b4_MALDI2 Escherichia coli false data/DRIAMS/DRIAMS-A/raw/2018/05332488-dbae-453a-976f-ab4ff980f3b4_MALDI2.txt.gz
6d7a407b-2727-4721-905f-31d5030b3ba4_MALDI2 Escherichia coli true data/DRIAMS/DRIAMS-A/raw/2018/6d7a407b-2727-4721-905f-31d5030b3ba4_MALDI2.txt.gz

Stage 2 — Prepare ML data

This is the expensive step. For each spectrum: 1. Load the raw .txt.gz file 2. Preprocess (sqrt, smooth, baseline, normalize) via Ripple 3. Bin to 6,000 features (3 Da bins over [2000, 20000] Da)

The result is a dataset with columns :ri, :x0, :x1, … :x5999.

(def ml-params
  {:preprocessing-params {:smooth-window 21 :smooth-polynomial 3}
   :binning-params {:range [2000 20000] :step 3}})
(def ml-data
  (learning/prepare-ml-data raw-data ml-params))
{:rows (tc/row-count ml-data)
 :features (dec (count (tc/column-names ml-data)))}
{:rows 1400, :features 6000}

Stage 3 — Train/test split

A holdout split with a fixed random seed for reproducibility. Requires at least 100 samples (returns nil otherwise).

(def split-data
  (learning/split ml-data {:seed 1}))
{:train (tc/row-count (:train split-data))
 :test (tc/row-count (:test split-data))}
{:train 933, :test 467}

Stage 4 — Train

Train an XGBoost binary classifier on the training set:

(def model
  (learning/train split-data
                  {:model-type :xgboost/classification
                   :round 50
                   :num-class 2}))
16:18:00.460 [nREPL-session-af2642d0-19b6-4729-aca6-52b33b4e3d41] DEBUG ml.dmlc.xgboost4j.java.NativeLibLoader -- Using path /lib/linux/x86_64/libxgboost4j.so for library with name xgboost4j

Stage 5 — Predict

Generate probability predictions on the test set. The result has columns 0 (probability of susceptible) and 1 (probability of resistant/intermediate):

(def predictions
  (learning/predict split-data model))
predictions

predictions [467 3]:

0 1 :ri
0.05187543 0.94812459 true
0.99304366 0.00695636 false
0.96061611 0.03938394 false
0.97968948 0.02031045 false
0.99701047 0.00298953 false
0.99649519 0.00350485 false
0.99945921 0.00054080 false
0.99365848 0.00634146 false
0.93335623 0.06664377 false
0.99423039 0.00576961 false
0.98908663 0.01091336 false
0.99442768 0.00557238 false
0.99780887 0.00219112 false
0.99719256 0.00280746 false
0.99448317 0.00551683 false
0.99723178 0.00276819 false
0.99872690 0.00127309 false
0.99666721 0.00333284 false
0.50909549 0.49090451 false
0.99798101 0.00201905 false
0.97578853 0.02421154 false

Stage 6 — Measure

Evaluate with ROCAUC and PRAUC (area under the precision-recall curve):

(def metrics
  (learning/measure split-data predictions))
metrics
{:n-train 933,
 :n-test 467,
 :pri 0.18201284796573874,
 :PRAUC 0.7233731598301811,
 :ROCAUC 0.8683092085001545}

Reading the results

  • ROCAUC close to 1.0 means the model separates resistant from susceptible cases well.

  • PRAUC is especially informative when resistance is rare (low :pri), since it accounts for class imbalance.

  • :pri is the prevalence of resistance in the test set.

Summary

The six stages form a linear pipeline:

params → prepare-raw-data → prepare-ml-data → split → train → predict → measure

Each function takes the output of the previous stage plus its own configuration. Wrapping each stage with Pocket’s caching avoids recomputing expensive steps.

source: notebooks/amr_book/single_prediction.clj