6  Caching and Scenarios

Running the pipeline with direct function calls works, but preprocessing thousands of spectra takes minutes — and we don’t want to redo that work every time we change a downstream parameter.

Pocket solves this by caching each stage’s output on disk. When the same function is called with the same arguments, the cached result is returned instantly.

This notebook shows the caching pattern and then uses it to evaluate multiple species/antibiotic combinations.

(ns amr-book.caching-and-scenarios
  (:require
   ;; AMR ML pipeline (prepare, train, predict, measure):
   [scicloj.amr.learning :as learning]
   ;; Bacterial species definitions and antibiotic lists:
   [scicloj.amr.data.bacteria :as bacteria]
   ;; Table processing (https://scicloj.github.io/tablecloth/):
   [tablecloth.api :as tc]
   ;; Annotating kinds of visualizations (https://scicloj.github.io/kindly-noted/):
   [scicloj.kindly.v4.kind :as kind]))

The caching pattern

Pocket caches each pipeline stage’s output on disk, keyed by function identity + arguments. learning.clj defines named cached wrappers (e.g. learning/prepare-raw-data-cached) built with pocket/caching-fn. Each returns a Cached object — an IDeref that triggers computation (or loads from cache) on deref.

Here is the full cached pipeline for one scenario:

(defn run-scenario
  "Run the AMR prediction pipeline for one species/antibiotic/site/year.
  Returns the evaluation metrics, or nil if insufficient data."
  [{:keys [site year species antibiotic]}]
  (let [raw-data (-> {:site site :year year
                      :species species :antibiotic antibiotic}
                     (learning/prepare-raw-data-cached))
        ml-data (-> raw-data
                    (learning/prepare-ml-data-cached
                     {:preprocessing-params {:smooth-window 21 :smooth-polynomial 3}
                      :binning-params {:range [2000 20000] :step 3}}))
        split-data (-> ml-data
                       (learning/split-cached {:seed 1}))
        model (-> split-data
                  (learning/train-cached
                   {:model-type :xgboost/classification
                    :round 50
                    :num-class 2}))
        predictions (-> split-data
                        (learning/predict-cached model))
        metrics (-> split-data
                    (learning/measure-cached predictions)
                    deref)]
    metrics))

Example: single scenario

(def example-metrics
  (run-scenario {:site :A
                 :year 2018
                 :species bacteria/E-coli
                 :antibiotic :Cefepime}))
16:18:09.643 [nREPL-session-af2642d0-19b6-4729-aca6-52b33b4e3d41] INFO scicloj.pocket.impl.cache -- Mem-cache reconfigured: {:policy :lru, :threshold 8}
16:18:09.646 [nREPL-session-af2642d0-19b6-4729-aca6-52b33b4e3d41] DEBUG scicloj.pocket.impl.cache -- Cache hit (disk): scicloj.amr.learning/measure /workspace/.cache/pocket/41/417a4449af575d32b0b597cbb26c7c79879a0220
example-metrics
{:n-train 933,
 :n-test 467,
 :pri 0.18201284796573874,
 :PRAUC 0.7304522704648382,
 :ROCAUC 0.8816753926701566}

Run it again — this time the cached result is returned instantly:

(def example-metrics-2
  (run-scenario {:site :A
                 :year 2018
                 :species bacteria/E-coli
                 :antibiotic :Cefepime}))
example-metrics-2
{:n-train 933,
 :n-test 467,
 :pri 0.18201284796573874,
 :PRAUC 0.7304522704648382,
 :ROCAUC 0.8816753926701566}

Running multiple scenarios

Now we can sweep over antibiotics cheaply, since the expensive preprocessing is cached and shared across scenarios that use the same spectra.

(def e-coli-antibiotics
  [:Cefepime :Ciprofloxacin :Ceftriaxone])
(def scenario-results
  (->> e-coli-antibiotics
       (map (fn [ab]
              (some-> (run-scenario {:site :A
                                     :year 2018
                                     :species bacteria/E-coli
                                     :antibiotic ab})
                      (assoc :antibiotic ab))))
       (remove nil?)
       tc/dataset))
16:18:09.652 [nREPL-session-af2642d0-19b6-4729-aca6-52b33b4e3d41] DEBUG scicloj.pocket.impl.cache -- Cache hit (disk): scicloj.amr.learning/measure /workspace/.cache/pocket/2b/2bc5e2f3d1449bd2e80faa653dae244a0f531531
16:18:09.653 [nREPL-session-af2642d0-19b6-4729-aca6-52b33b4e3d41] DEBUG scicloj.pocket.impl.cache -- Cache hit (disk): scicloj.amr.learning/measure /workspace/.cache/pocket/ce/ce129707cb58f164486cd5154cb49e9a78ad85b6
scenario-results

_unnamed [3 6]:

:n-train :n-test :pri :PRAUC :ROCAUC :antibiotic
933 467 0.18201285 0.73045227 0.88167539 :Cefepime
933 467 0.30620985 0.69228608 0.78975654 :Ciprofloxacin
933 467 0.22269807 0.72280037 0.82506887 :Ceftriaxone

Comparing results

A quick summary table showing how well the model discriminates resistance for each antibiotic:

(-> scenario-results
    (tc/select-columns [:antibiotic :n-train :n-test :pri :ROCAUC :PRAUC])
    (tc/order-by [:ROCAUC] :desc))

_unnamed [3 6]:

:antibiotic :n-train :n-test :pri :ROCAUC :PRAUC
:Cefepime 933 467 0.18201285 0.88167539 0.73045227
:Ceftriaxone 933 467 0.22269807 0.82506887 0.72280037
:Ciprofloxacin 933 467 0.30620985 0.78975654 0.69228608

The :pri column shows resistance prevalence — antibiotics with very low or very high prevalence are harder to evaluate meaningfully (the model has little signal to learn from).

What gets cached?

Pocket stores results under $POCKET_BASE_CACHE_DIR/.cache/. Each pipeline stage gets its own cache entry, keyed by function + arguments. Because prepare-ml-data (the expensive step) depends only on the raw data and preprocessing parameters, it is computed once and reused across all downstream variations (different splits, hyperparameters, etc.).

Next steps

  • Add more species (S. aureus, K. pneumoniae) and their antibiotics
  • Sweep over sites (A–D) and years (2015–2018)
  • Compare against the results reported by Weis et al.
source: notebooks/amr_book/caching_and_scenarios.clj