9  ML Pipeline Walkthrough

This notebook demonstrates each stage of the MALDI-TOF AMR prediction pipeline, showing both direct (uncached) and cached variants side by side.

Here we focus on the caching pattern and how Pocket threads stages together.

(ns amr-book.ml-pipeline-walkthrough
  (:require
   ;; AMR ML pipeline (prepare, train, predict, measure):
   [scicloj.amr.learning :as learning]
   ;; AMR data loading utilities:
   [scicloj.amr.data.ingestion :as ingestion]
   ;; Bacterial species definitions and antibiotic lists:
   [scicloj.amr.data.bacteria :as bacteria]
   ;; Ripple MALDI signal processing (https://scicloj.github.io/ripple):
   [scicloj.ripple.maldi :as ripple]
   ;; Table processing (https://scicloj.github.io/tablecloth/):
   [tablecloth.api :as tc]
   ;; Annotating kinds of visualizations (https://scicloj.github.io/kindly-noted/):
   [scicloj.kindly.v4.kind :as kind]))

Example Configuration

We’ll use a single example throughout:

  • Species: E. coli
  • Antibiotic: Cefepime
  • Site: A
  • Year: 2018
(def example-params
  {:site :A
   :year 2018
   :species bacteria/E-coli
   :antibiotic :Cefepime})
example-params
{:site :A,
 :year 2018,
 :species "Escherichia coli",
 :antibiotic :Cefepime}

Stage 1: Prepare Raw Data

Loads metadata and raw spectra, filters by species/antibiotic/site/year. Returns a dataset with columns including :code, :Cefepime (resistance), :path

Without caching:

(def raw-data-no-cache
  (learning/prepare-raw-data example-params))
[(tc/row-count raw-data-no-cache)
 (take 10 (tc/column-names raw-data-no-cache))]
[1400
 (:code
  :site
  :year
  :path
  :data/DRIAMS/DRIAMS-A/id/2018/2018_clean.csv.gz.code
  :column-0
  :Unnamed: 0
  :species
  :laboratory_species
  :Penicillin)]

With caching:

(def raw-data-cached
  (-> example-params
      (learning/prepare-raw-data-cached)
      deref))
16:32:50.292 [nREPL-session-af2642d0-19b6-4729-aca6-52b33b4e3d41] DEBUG scicloj.pocket.impl.cache -- Cache hit (disk): scicloj.amr.learning/prepare-raw-data /workspace/.cache/pocket/c4/(scicloj.amr.learning_prepare-raw-data {:antibiotic :Cefepime, :site :A, :species "Escherichia coli", :year 2018})
[(tc/row-count raw-data-cached)
 (= (tc/row-count raw-data-no-cache) (tc/row-count raw-data-cached))]
[1400 true]

Stage 2: Prepare ML Data

Preprocesses spectra (sqrt, smooth, baseline, normalize) and bins them. Transforms the dataset to have :ri (resistance indicator) and feature columns :x0, :x1, … :x5999

(def ml-params
  {:preprocessing-params {:smooth-window 21 :smooth-polynomial 3}
   :binning-params {:range [2000 20000]
                    :step 3}})

Without caching (slow - preprocessing + binning 1400 spectra):

(def ml-data-no-cache
  (learning/prepare-ml-data raw-data-no-cache ml-params))
[(tc/row-count ml-data-no-cache)
 (take 5 (tc/column-names ml-data-no-cache))]
[1400 (:ri :x0 :x1 :x10 :x100)]

With caching:

(def ml-data-cached
  (-> example-params
      (learning/prepare-raw-data-cached)
      (learning/prepare-ml-data-cached ml-params)
      deref))
16:32:56.807 [nREPL-session-af2642d0-19b6-4729-aca6-52b33b4e3d41] DEBUG scicloj.pocket.impl.cache -- Cache hit (disk): scicloj.amr.learning/prepare-ml-data /workspace/.cache/pocket/88/88bc206e4ef3f49271d1a644a14e2c7676b0924c
[(tc/row-count ml-data-cached)
 (= (tc/row-count ml-data-no-cache) (tc/row-count ml-data-cached))]
[1400 true]

Stage 3: Split Data

Splits into train/test sets (default ~66/33 split). Requires ≥100 samples, returns nil otherwise. Returns map with :train and :test datasets.

(def split-params
  {:seed 1})

Without caching:

(def split-data-no-cache
  (learning/split ml-data-no-cache split-params))
[(keys split-data-no-cache)
 (tc/row-count (:train split-data-no-cache))
 (tc/row-count (:test split-data-no-cache))]
[(:train :test) 933 467]

With caching:

(def split-data-cached
  (-> example-params
      (learning/prepare-raw-data-cached)
      (learning/prepare-ml-data-cached ml-params)
      (learning/split-cached split-params)
      deref))
16:32:57.067 [nREPL-session-af2642d0-19b6-4729-aca6-52b33b4e3d41] DEBUG scicloj.pocket.impl.cache -- Cache hit (disk): scicloj.amr.learning/split /workspace/.cache/pocket/45/4522e71e313babe42221a60ed2802d788f2dc034
[(keys split-data-cached)
 (tc/row-count (:train split-data-cached))
 (tc/row-count (:test split-data-cached))]
[(:train :test) 933 467]

Stage 4: Train Model

Trains XGBoost classifier on training set. Returns trained model map with keys like :model-data, :options, :feature-columns

(def train-params
  {:model-type :xgboost/classification
   :round 50
   :num-class 2})

Without caching:

(def model-no-cache
  (learning/train split-data-no-cache train-params))
(keys model-no-cache)
(:model-data
 :options
 :train-input-hash
 :id
 :feature-columns
 :target-columns
 :target-datatypes)

With caching:

(def model-cached
  (-> example-params
      (learning/prepare-raw-data-cached)
      (learning/prepare-ml-data-cached ml-params)
      (learning/split-cached split-params)
      (learning/train-cached train-params)
      deref))
16:33:06.320 [nREPL-session-af2642d0-19b6-4729-aca6-52b33b4e3d41] DEBUG scicloj.pocket.impl.cache -- Cache hit (disk): scicloj.amr.learning/train /workspace/.cache/pocket/f9/f9d3e786ea47b587a25b0dbe61e17836c0173a63
(keys model-cached)
(:model-data
 :options
 :train-input-hash
 :id
 :feature-columns
 :target-columns
 :target-datatypes)

Stage 5: Predict

Generates predictions on test set. Returns dataset with prediction columns (0, 1, and :ri for actual labels).

Without caching:

(def predictions-no-cache
  (learning/predict split-data-no-cache model-no-cache))
[(tc/row-count predictions-no-cache)
 (tc/column-names predictions-no-cache)]
[467 (0 1 :ri)]
predictions-no-cache

predictions [467 3]:

0 1 :ri
0.05187543 0.94812459 true
0.99304366 0.00695636 false
0.96061611 0.03938394 false
0.97968948 0.02031045 false
0.99701047 0.00298953 false
0.99649519 0.00350485 false
0.99945921 0.00054080 false
0.99365848 0.00634146 false
0.93335623 0.06664377 false
0.99423039 0.00576961 false
0.98908663 0.01091336 false
0.99442768 0.00557238 false
0.99780887 0.00219112 false
0.99719256 0.00280746 false
0.99448317 0.00551683 false
0.99723178 0.00276819 false
0.99872690 0.00127309 false
0.99666721 0.00333284 false
0.50909549 0.49090451 false
0.99798101 0.00201905 false
0.97578853 0.02421154 false

With caching:

(def predictions-cached
  (let [split-data (-> example-params
                       (learning/prepare-raw-data-cached)
                       (learning/prepare-ml-data-cached ml-params)
                       (learning/split-cached split-params))
        model (-> split-data
                  (learning/train-cached train-params))]
    (-> split-data
        (learning/predict-cached model)
        deref)))
16:33:06.448 [nREPL-session-af2642d0-19b6-4729-aca6-52b33b4e3d41] DEBUG scicloj.pocket.impl.cache -- Cache hit (disk): scicloj.amr.learning/predict /workspace/.cache/pocket/7f/7f84be3268acea8d79e2845e685a996a4acda8a5
[(tc/row-count predictions-cached)
 (tc/column-names predictions-cached)]
[467 (0 1 :ri)]

Stage 6: Measure Performance

Calculates ROCAUC and PRAUC metrics. Returns map with :n-train, :n-test, :pri (prevalence), :PRAUC, :ROCAUC

Without caching:

(def metrics-no-cache
  (learning/measure split-data-no-cache predictions-no-cache))
metrics-no-cache
{:n-train 933,
 :n-test 467,
 :pri 0.18201284796573874,
 :PRAUC 0.7233731598301811,
 :ROCAUC 0.8683092085001545}

With caching:

(def metrics-cached
  (let [split-data (-> example-params
                       (learning/prepare-raw-data-cached)
                       (learning/prepare-ml-data-cached ml-params)
                       (learning/split-cached split-params))
        model (-> split-data
                  (learning/train-cached train-params))
        predictions (-> split-data
                        (learning/predict-cached model))]
    (-> split-data
        (learning/measure-cached predictions)
        deref)))
16:33:06.464 [nREPL-session-af2642d0-19b6-4729-aca6-52b33b4e3d41] DEBUG scicloj.pocket.impl.cache -- Cache hit (disk): scicloj.amr.learning/measure /workspace/.cache/pocket/41/417a4449af575d32b0b597cbb26c7c79879a0220
metrics-cached
{:n-train 933,
 :n-test 467,
 :pri 0.18201284796573874,
 :PRAUC 0.7304522704648382,
 :ROCAUC 0.8816753926701566}

Summary

The complete pipeline has 6 stages:

  1. prepare-raw-data: Load metadata and filter by species/antibiotic/site/year
  2. prepare-ml-data: Preprocess (sqrt, smooth, baseline, normalize) and bin spectra
  3. split: Create train/test split
  4. train: Train XGBoost classifier
  5. predict: Generate predictions on test set
  6. measure: Calculate ROCAUC and PRAUC metrics

Caching Benefits

The cached version enables:

  • Reusing expensive computations: Preprocessing spectra is slow; cached results are instant
  • Reproducible experiments: Cache keys are deterministic based on inputs
  • Incremental development: Add pipeline stages without re-running earlier ones
  • Parallel execution: Different scenarios can share cached preprocessing results

Pipeline Composition

Notice the threading pattern in the cached version:

(-> params
    (learning/prepare-raw-data-cached)
    (learning/prepare-ml-data-cached ml-params)
    deref)

Each cached function returns a Cached (an IDeref) that gets passed to the next stage. deref forces evaluation at the end.

source: notebooks/amr_book/ml_pipeline_walkthrough.clj