5 A Single AMR Prediction
This notebook walks through the simplest case: predicting antimicrobial resistance for one species and one antibiotic using data from a single hospital site and year.
We call the pipeline functions directly (no caching) so each step is transparent.
(ns amr-book.single-prediction
(:require
;; AMR ML pipeline (prepare, train, predict, measure):
[scicloj.amr.learning :as learning]
;; Bacterial species definitions and antibiotic lists:
[scicloj.amr.data.bacteria :as bacteria]
;; Table processing (https://scicloj.github.io/tablecloth/):
[tablecloth.api :as tc]
;; Annotating kinds of visualizations (https://scicloj.github.io/kindly-noted/):
[scicloj.kindly.v4.kind :as kind]))Choosing a scenario
We’ll predict Cefepime resistance in E. coli from DRIAMS-A 2018 — one of the cases reported by Weis et al.:
(def params
{:site :A
:year 2018
:species bacteria/E-coli
:antibiotic :Cefepime})Stage 1 — Prepare raw data
Load DRIAMS metadata, join with available spectra, and filter to the chosen species/antibiotic. The result has a :ri column (true = resistant or intermediate) and a :path column pointing to each raw spectrum file.
(def raw-data
(learning/prepare-raw-data params))(tc/row-count raw-data)1400(tc/select-columns raw-data [:code :species :ri :path])raw-data [Escherichia coli / Cefepime / A / 2018] [1400 4]:
| :code | :species | :ri | :path |
|---|---|---|---|
| 69eca649-ec26-4f9d-9f9a-d42aa5b9ec0f_MALDI1 | Escherichia coli | false | data/DRIAMS/DRIAMS-A/raw/2018/69eca649-ec26-4f9d-9f9a-d42aa5b9ec0f_MALDI1.txt.gz |
| 2132c91c-7b62-4ea4-9984-6f7fcdaed7d6_MALDI1 | Escherichia coli | false | data/DRIAMS/DRIAMS-A/raw/2018/2132c91c-7b62-4ea4-9984-6f7fcdaed7d6_MALDI1.txt.gz |
| 51ef2973-26fd-4558-b7ac-e615fe177a18_MALDI1 | Escherichia coli | false | data/DRIAMS/DRIAMS-A/raw/2018/51ef2973-26fd-4558-b7ac-e615fe177a18_MALDI1.txt.gz |
| 11b8a987-04e9-42be-a933-7dbcb3da84aa_MALDI1 | Escherichia coli | false | data/DRIAMS/DRIAMS-A/raw/2018/11b8a987-04e9-42be-a933-7dbcb3da84aa_MALDI1.txt.gz |
| d8ad07e0-c646-48a6-b901-cda98e8f7e67_MALDI1 | Escherichia coli | true | data/DRIAMS/DRIAMS-A/raw/2018/d8ad07e0-c646-48a6-b901-cda98e8f7e67_MALDI1.txt.gz |
| 99191931-3ed3-4a25-9bb2-e65e3fb6939d_MALDI1 | Escherichia coli | false | data/DRIAMS/DRIAMS-A/raw/2018/99191931-3ed3-4a25-9bb2-e65e3fb6939d_MALDI1.txt.gz |
| b9af2612-ddc8-41c0-90aa-04c8fcf16f8d_MALDI1 | Escherichia coli | false | data/DRIAMS/DRIAMS-A/raw/2018/b9af2612-ddc8-41c0-90aa-04c8fcf16f8d_MALDI1.txt.gz |
| 51b82a9c-6497-4d69-a498-8ce0d9e649c5_MALDI1 | Escherichia coli | false | data/DRIAMS/DRIAMS-A/raw/2018/51b82a9c-6497-4d69-a498-8ce0d9e649c5_MALDI1.txt.gz |
| ad628afc-fc57-4084-ba92-90c18d938319_MALDI1 | Escherichia coli | true | data/DRIAMS/DRIAMS-A/raw/2018/ad628afc-fc57-4084-ba92-90c18d938319_MALDI1.txt.gz |
| 5553ef00-6b71-4696-95c6-993eb4fe0cdc_MALDI1 | Escherichia coli | false | data/DRIAMS/DRIAMS-A/raw/2018/5553ef00-6b71-4696-95c6-993eb4fe0cdc_MALDI1.txt.gz |
| … | … | … | … |
| 525421d5-7c97-4523-9d82-5be7b312955c_MALDI2 | Escherichia coli | true | data/DRIAMS/DRIAMS-A/raw/2018/525421d5-7c97-4523-9d82-5be7b312955c_MALDI2.txt.gz |
| 1a98fcb2-6b06-4987-aa58-efc0f0da53fc_MALDI2 | Escherichia coli | true | data/DRIAMS/DRIAMS-A/raw/2018/1a98fcb2-6b06-4987-aa58-efc0f0da53fc_MALDI2.txt.gz |
| 0fac6eeb-0386-444a-8498-97e0b727bcd7_MALDI2 | Escherichia coli | false | data/DRIAMS/DRIAMS-A/raw/2018/0fac6eeb-0386-444a-8498-97e0b727bcd7_MALDI2.txt.gz |
| 9251e130-66b6-41ba-a1b2-ae41bce0a200_MALDI2 | Escherichia coli | false | data/DRIAMS/DRIAMS-A/raw/2018/9251e130-66b6-41ba-a1b2-ae41bce0a200_MALDI2.txt.gz |
| 450fde6c-4db1-4060-b2f9-4bf0cc621f36_MALDI2 | Escherichia coli | false | data/DRIAMS/DRIAMS-A/raw/2018/450fde6c-4db1-4060-b2f9-4bf0cc621f36_MALDI2.txt.gz |
| fdb4014f-c1ae-4e30-bbdd-de6126976728_MALDI2 | Escherichia coli | true | data/DRIAMS/DRIAMS-A/raw/2018/fdb4014f-c1ae-4e30-bbdd-de6126976728_MALDI2.txt.gz |
| 81ff5b1c-d5b9-4989-a4a3-eb30e735c7b3_MALDI2 | Escherichia coli | false | data/DRIAMS/DRIAMS-A/raw/2018/81ff5b1c-d5b9-4989-a4a3-eb30e735c7b3_MALDI2.txt.gz |
| cb0b127c-dda2-4895-a020-38b3381ce790_MALDI2 | Escherichia coli | false | data/DRIAMS/DRIAMS-A/raw/2018/cb0b127c-dda2-4895-a020-38b3381ce790_MALDI2.txt.gz |
| 013ca7ca-a5f1-4bd4-b24c-3686b6fa3cf3_MALDI2 | Escherichia coli | false | data/DRIAMS/DRIAMS-A/raw/2018/013ca7ca-a5f1-4bd4-b24c-3686b6fa3cf3_MALDI2.txt.gz |
| 05332488-dbae-453a-976f-ab4ff980f3b4_MALDI2 | Escherichia coli | false | data/DRIAMS/DRIAMS-A/raw/2018/05332488-dbae-453a-976f-ab4ff980f3b4_MALDI2.txt.gz |
| 6d7a407b-2727-4721-905f-31d5030b3ba4_MALDI2 | Escherichia coli | true | data/DRIAMS/DRIAMS-A/raw/2018/6d7a407b-2727-4721-905f-31d5030b3ba4_MALDI2.txt.gz |
Stage 2 — Prepare ML data
This is the expensive step. For each spectrum: 1. Load the raw .txt.gz file 2. Preprocess (sqrt, smooth, baseline, normalize) via Ripple 3. Bin to 6,000 features (3 Da bins over [2000, 20000] Da)
The result is a dataset with columns :ri, :x0, :x1, … :x5999.
(def ml-params
{:preprocessing-params {:smooth-window 21 :smooth-polynomial 3}
:binning-params {:range [2000 20000] :step 3}})(def ml-data
(learning/prepare-ml-data raw-data ml-params)){:rows (tc/row-count ml-data)
:features (dec (count (tc/column-names ml-data)))}{:rows 1400, :features 6000}Stage 3 — Train/test split
A holdout split with a fixed random seed for reproducibility. Requires at least 100 samples (returns nil otherwise).
(def split-data
(learning/split ml-data {:seed 1})){:train (tc/row-count (:train split-data))
:test (tc/row-count (:test split-data))}{:train 933, :test 467}Stage 4 — Train
Train an XGBoost binary classifier on the training set:
(def model
(learning/train split-data
{:model-type :xgboost/classification
:round 50
:num-class 2}))16:18:00.460 [nREPL-session-af2642d0-19b6-4729-aca6-52b33b4e3d41] DEBUG ml.dmlc.xgboost4j.java.NativeLibLoader -- Using path /lib/linux/x86_64/libxgboost4j.so for library with name xgboost4j
Stage 5 — Predict
Generate probability predictions on the test set. The result has columns 0 (probability of susceptible) and 1 (probability of resistant/intermediate):
(def predictions
(learning/predict split-data model))predictionspredictions [467 3]:
| 0 | 1 | :ri |
|---|---|---|
| 0.05187543 | 0.94812459 | true |
| 0.99304366 | 0.00695636 | false |
| 0.96061611 | 0.03938394 | false |
| 0.97968948 | 0.02031045 | false |
| 0.99701047 | 0.00298953 | false |
| 0.99649519 | 0.00350485 | false |
| 0.99945921 | 0.00054080 | false |
| 0.99365848 | 0.00634146 | false |
| 0.93335623 | 0.06664377 | false |
| 0.99423039 | 0.00576961 | false |
| … | … | … |
| 0.98908663 | 0.01091336 | false |
| 0.99442768 | 0.00557238 | false |
| 0.99780887 | 0.00219112 | false |
| 0.99719256 | 0.00280746 | false |
| 0.99448317 | 0.00551683 | false |
| 0.99723178 | 0.00276819 | false |
| 0.99872690 | 0.00127309 | false |
| 0.99666721 | 0.00333284 | false |
| 0.50909549 | 0.49090451 | false |
| 0.99798101 | 0.00201905 | false |
| 0.97578853 | 0.02421154 | false |
Stage 6 — Measure
Evaluate with ROCAUC and PRAUC (area under the precision-recall curve):
(def metrics
(learning/measure split-data predictions))metrics{:n-train 933,
:n-test 467,
:pri 0.18201284796573874,
:PRAUC 0.7233731598301811,
:ROCAUC 0.8683092085001545}Reading the results
ROCAUC close to 1.0 means the model separates resistant from susceptible cases well.
PRAUC is especially informative when resistance is rare (low
:pri), since it accounts for class imbalance.:priis the prevalence of resistance in the test set.
Summary
The six stages form a linear pipeline:
params → prepare-raw-data → prepare-ml-data → split → train → predict → measure
Each function takes the output of the previous stage plus its own configuration. Wrapping each stage with Pocket’s caching avoids recomputing expensive steps.