9 ML Pipeline Walkthrough
This notebook demonstrates each stage of the MALDI-TOF AMR prediction pipeline, showing both direct (uncached) and cached variants side by side.
Here we focus on the caching pattern and how Pocket threads stages together.
(ns amr-book.ml-pipeline-walkthrough
(:require
;; AMR ML pipeline (prepare, train, predict, measure):
[scicloj.amr.learning :as learning]
;; AMR data loading utilities:
[scicloj.amr.data.ingestion :as ingestion]
;; Bacterial species definitions and antibiotic lists:
[scicloj.amr.data.bacteria :as bacteria]
;; Ripple MALDI signal processing (https://scicloj.github.io/ripple):
[scicloj.ripple.maldi :as ripple]
;; Table processing (https://scicloj.github.io/tablecloth/):
[tablecloth.api :as tc]
;; Annotating kinds of visualizations (https://scicloj.github.io/kindly-noted/):
[scicloj.kindly.v4.kind :as kind]))Example Configuration
We’ll use a single example throughout:
- Species: E. coli
- Antibiotic: Cefepime
- Site: A
- Year: 2018
(def example-params
{:site :A
:year 2018
:species bacteria/E-coli
:antibiotic :Cefepime})example-params{:site :A,
:year 2018,
:species "Escherichia coli",
:antibiotic :Cefepime}Stage 1: Prepare Raw Data
Loads metadata and raw spectra, filters by species/antibiotic/site/year. Returns a dataset with columns including :code, :Cefepime (resistance), :path
Without caching:
(def raw-data-no-cache
(learning/prepare-raw-data example-params))[(tc/row-count raw-data-no-cache)
(take 10 (tc/column-names raw-data-no-cache))][1400
(:code
:site
:year
:path
:data/DRIAMS/DRIAMS-A/id/2018/2018_clean.csv.gz.code
:column-0
:Unnamed: 0
:species
:laboratory_species
:Penicillin)]With caching:
(def raw-data-cached
(-> example-params
(learning/prepare-raw-data-cached)
deref))16:32:50.292 [nREPL-session-af2642d0-19b6-4729-aca6-52b33b4e3d41] DEBUG scicloj.pocket.impl.cache -- Cache hit (disk): scicloj.amr.learning/prepare-raw-data /workspace/.cache/pocket/c4/(scicloj.amr.learning_prepare-raw-data {:antibiotic :Cefepime, :site :A, :species "Escherichia coli", :year 2018})
[(tc/row-count raw-data-cached)
(= (tc/row-count raw-data-no-cache) (tc/row-count raw-data-cached))][1400 true]Stage 2: Prepare ML Data
Preprocesses spectra (sqrt, smooth, baseline, normalize) and bins them. Transforms the dataset to have :ri (resistance indicator) and feature columns :x0, :x1, … :x5999
(def ml-params
{:preprocessing-params {:smooth-window 21 :smooth-polynomial 3}
:binning-params {:range [2000 20000]
:step 3}})Without caching (slow - preprocessing + binning 1400 spectra):
(def ml-data-no-cache
(learning/prepare-ml-data raw-data-no-cache ml-params))[(tc/row-count ml-data-no-cache)
(take 5 (tc/column-names ml-data-no-cache))][1400 (:ri :x0 :x1 :x10 :x100)]With caching:
(def ml-data-cached
(-> example-params
(learning/prepare-raw-data-cached)
(learning/prepare-ml-data-cached ml-params)
deref))16:32:56.807 [nREPL-session-af2642d0-19b6-4729-aca6-52b33b4e3d41] DEBUG scicloj.pocket.impl.cache -- Cache hit (disk): scicloj.amr.learning/prepare-ml-data /workspace/.cache/pocket/88/88bc206e4ef3f49271d1a644a14e2c7676b0924c
[(tc/row-count ml-data-cached)
(= (tc/row-count ml-data-no-cache) (tc/row-count ml-data-cached))][1400 true]Stage 3: Split Data
Splits into train/test sets (default ~66/33 split). Requires ≥100 samples, returns nil otherwise. Returns map with :train and :test datasets.
(def split-params
{:seed 1})Without caching:
(def split-data-no-cache
(learning/split ml-data-no-cache split-params))[(keys split-data-no-cache)
(tc/row-count (:train split-data-no-cache))
(tc/row-count (:test split-data-no-cache))][(:train :test) 933 467]With caching:
(def split-data-cached
(-> example-params
(learning/prepare-raw-data-cached)
(learning/prepare-ml-data-cached ml-params)
(learning/split-cached split-params)
deref))16:32:57.067 [nREPL-session-af2642d0-19b6-4729-aca6-52b33b4e3d41] DEBUG scicloj.pocket.impl.cache -- Cache hit (disk): scicloj.amr.learning/split /workspace/.cache/pocket/45/4522e71e313babe42221a60ed2802d788f2dc034
[(keys split-data-cached)
(tc/row-count (:train split-data-cached))
(tc/row-count (:test split-data-cached))][(:train :test) 933 467]Stage 4: Train Model
Trains XGBoost classifier on training set. Returns trained model map with keys like :model-data, :options, :feature-columns
(def train-params
{:model-type :xgboost/classification
:round 50
:num-class 2})Without caching:
(def model-no-cache
(learning/train split-data-no-cache train-params))(keys model-no-cache)(:model-data
:options
:train-input-hash
:id
:feature-columns
:target-columns
:target-datatypes)With caching:
(def model-cached
(-> example-params
(learning/prepare-raw-data-cached)
(learning/prepare-ml-data-cached ml-params)
(learning/split-cached split-params)
(learning/train-cached train-params)
deref))16:33:06.320 [nREPL-session-af2642d0-19b6-4729-aca6-52b33b4e3d41] DEBUG scicloj.pocket.impl.cache -- Cache hit (disk): scicloj.amr.learning/train /workspace/.cache/pocket/f9/f9d3e786ea47b587a25b0dbe61e17836c0173a63
(keys model-cached)(:model-data
:options
:train-input-hash
:id
:feature-columns
:target-columns
:target-datatypes)Stage 5: Predict
Generates predictions on test set. Returns dataset with prediction columns (0, 1, and :ri for actual labels).
Without caching:
(def predictions-no-cache
(learning/predict split-data-no-cache model-no-cache))[(tc/row-count predictions-no-cache)
(tc/column-names predictions-no-cache)][467 (0 1 :ri)]predictions-no-cachepredictions [467 3]:
| 0 | 1 | :ri |
|---|---|---|
| 0.05187543 | 0.94812459 | true |
| 0.99304366 | 0.00695636 | false |
| 0.96061611 | 0.03938394 | false |
| 0.97968948 | 0.02031045 | false |
| 0.99701047 | 0.00298953 | false |
| 0.99649519 | 0.00350485 | false |
| 0.99945921 | 0.00054080 | false |
| 0.99365848 | 0.00634146 | false |
| 0.93335623 | 0.06664377 | false |
| 0.99423039 | 0.00576961 | false |
| … | … | … |
| 0.98908663 | 0.01091336 | false |
| 0.99442768 | 0.00557238 | false |
| 0.99780887 | 0.00219112 | false |
| 0.99719256 | 0.00280746 | false |
| 0.99448317 | 0.00551683 | false |
| 0.99723178 | 0.00276819 | false |
| 0.99872690 | 0.00127309 | false |
| 0.99666721 | 0.00333284 | false |
| 0.50909549 | 0.49090451 | false |
| 0.99798101 | 0.00201905 | false |
| 0.97578853 | 0.02421154 | false |
With caching:
(def predictions-cached
(let [split-data (-> example-params
(learning/prepare-raw-data-cached)
(learning/prepare-ml-data-cached ml-params)
(learning/split-cached split-params))
model (-> split-data
(learning/train-cached train-params))]
(-> split-data
(learning/predict-cached model)
deref)))16:33:06.448 [nREPL-session-af2642d0-19b6-4729-aca6-52b33b4e3d41] DEBUG scicloj.pocket.impl.cache -- Cache hit (disk): scicloj.amr.learning/predict /workspace/.cache/pocket/7f/7f84be3268acea8d79e2845e685a996a4acda8a5
[(tc/row-count predictions-cached)
(tc/column-names predictions-cached)][467 (0 1 :ri)]Stage 6: Measure Performance
Calculates ROCAUC and PRAUC metrics. Returns map with :n-train, :n-test, :pri (prevalence), :PRAUC, :ROCAUC
Without caching:
(def metrics-no-cache
(learning/measure split-data-no-cache predictions-no-cache))metrics-no-cache{:n-train 933,
:n-test 467,
:pri 0.18201284796573874,
:PRAUC 0.7233731598301811,
:ROCAUC 0.8683092085001545}With caching:
(def metrics-cached
(let [split-data (-> example-params
(learning/prepare-raw-data-cached)
(learning/prepare-ml-data-cached ml-params)
(learning/split-cached split-params))
model (-> split-data
(learning/train-cached train-params))
predictions (-> split-data
(learning/predict-cached model))]
(-> split-data
(learning/measure-cached predictions)
deref)))16:33:06.464 [nREPL-session-af2642d0-19b6-4729-aca6-52b33b4e3d41] DEBUG scicloj.pocket.impl.cache -- Cache hit (disk): scicloj.amr.learning/measure /workspace/.cache/pocket/41/417a4449af575d32b0b597cbb26c7c79879a0220
metrics-cached{:n-train 933,
:n-test 467,
:pri 0.18201284796573874,
:PRAUC 0.7304522704648382,
:ROCAUC 0.8816753926701566}Summary
The complete pipeline has 6 stages:
- prepare-raw-data: Load metadata and filter by species/antibiotic/site/year
- prepare-ml-data: Preprocess (sqrt, smooth, baseline, normalize) and bin spectra
- split: Create train/test split
- train: Train XGBoost classifier
- predict: Generate predictions on test set
- measure: Calculate ROCAUC and PRAUC metrics
Caching Benefits
The cached version enables:
- Reusing expensive computations: Preprocessing spectra is slow; cached results are instant
- Reproducible experiments: Cache keys are deterministic based on inputs
- Incremental development: Add pipeline stages without re-running earlier ones
- Parallel execution: Different scenarios can share cached preprocessing results
Pipeline Composition
Notice the threading pattern in the cached version:
(-> params
(learning/prepare-raw-data-cached)
(learning/prepare-ml-data-cached ml-params)
deref)Each cached function returns a Cached (an IDeref) that gets passed to the next stage. deref forces evaluation at the end.