4  Preprocessing DRIAMS Spectra

Before we can train a classifier, each raw MALDI-TOF spectrum must be preprocessed and converted into a fixed-length feature vector.

This notebook walks through every step of that transformation for one spectrum, then shows how the full dataset is assembled for machine learning.

(ns amr-book.preprocessing-driams
  (:require
   ;; AMR ML pipeline (prepare, train, predict, measure):
   [scicloj.amr.learning :as learning]
   ;; AMR data loading utilities:
   [scicloj.amr.data.ingestion :as ingestion]
   ;; Bacterial species definitions and antibiotic lists:
   [scicloj.amr.data.bacteria :as bacteria]
   ;; Ripple MALDI signal processing (https://scicloj.github.io/ripple):
   [scicloj.ripple.maldi :as ripple]
   ;; Table processing (https://scicloj.github.io/tablecloth/):
   [tablecloth.api :as tc]
   ;; Column-level operations:
   [tablecloth.column.api :as tcc]
   ;; Interactive plotting via Plotly (https://scicloj.github.io/tableplot/):
   [scicloj.tableplot.v1.plotly :as plotly]
   ;; Annotating kinds of visualizations (https://scicloj.github.io/kindly-noted/):
   [scicloj.kindly.v4.kind :as kind]))

From the paper

Weis et al. describe the preprocessing as follows (preprint, pp 28–29):

The following preprocessing steps are performed using the R package MaldiQuant version 1.19: (1) the measured intensity is transformed with a square-root method to stabilize the variance, (2) smoothing using the Savitzky-Golay algorithm with half-window-size 10 is applied, (3) an estimate of the baseline is removed in 20 iterations of the SNIP algorithm, (4) the intensity is calibrated using the total-ion-current (TIC), and (5) the spectra are trimmed to values in a 2,000 to 20,000 Da range.

After preprocessing, each spectrum is represented by a set of measurements, each of them described by its corresponding mass-to-charge ratio and intensity. However, this representation results in each sample having potentially a different dimensionality […]. Since the machine learning methods used in this manuscript require their input to be a feature vector of fixed dimensionality, intensity measurements are binned using the bin size of 3 Da. […] each sample is represented by a vector containing 6,000 features.

Examining the actual R code reveals two details not mentioned in the text:

  • The Savitzky-Golay smoothing uses MALDIquant’s default polynomial order of 3 (not specified explicitly in their script).

  • The SNIP baseline removal is applied twice: removeBaseline(removeBaseline(spectra, method="SNIP", iterations=20)).

We now walk through each of these steps.

Choosing a scenario

We pick a single species / antibiotic / site / year combination — E. coli vs Cefepime from DRIAMS-A 2018:

(def params
  {:site :A
   :year 2018
   :species bacteria/E-coli
   :antibiotic :Cefepime})

prepare-raw-data loads the DRIAMS metadata, joins it with available spectrum files, and filters to our scenario. The result has a :path column (spectrum file) and a :ri column (true = resistant or intermediate).

(def raw-data
  (learning/prepare-raw-data params))
(tc/row-count raw-data)
1400
(tc/select-columns raw-data [:code :species :ri :path])

raw-data [Escherichia coli / Cefepime / A / 2018] [1400 4]:

:code :species :ri :path
69eca649-ec26-4f9d-9f9a-d42aa5b9ec0f_MALDI1 Escherichia coli false data/DRIAMS/DRIAMS-A/raw/2018/69eca649-ec26-4f9d-9f9a-d42aa5b9ec0f_MALDI1.txt.gz
2132c91c-7b62-4ea4-9984-6f7fcdaed7d6_MALDI1 Escherichia coli false data/DRIAMS/DRIAMS-A/raw/2018/2132c91c-7b62-4ea4-9984-6f7fcdaed7d6_MALDI1.txt.gz
51ef2973-26fd-4558-b7ac-e615fe177a18_MALDI1 Escherichia coli false data/DRIAMS/DRIAMS-A/raw/2018/51ef2973-26fd-4558-b7ac-e615fe177a18_MALDI1.txt.gz
11b8a987-04e9-42be-a933-7dbcb3da84aa_MALDI1 Escherichia coli false data/DRIAMS/DRIAMS-A/raw/2018/11b8a987-04e9-42be-a933-7dbcb3da84aa_MALDI1.txt.gz
d8ad07e0-c646-48a6-b901-cda98e8f7e67_MALDI1 Escherichia coli true data/DRIAMS/DRIAMS-A/raw/2018/d8ad07e0-c646-48a6-b901-cda98e8f7e67_MALDI1.txt.gz
99191931-3ed3-4a25-9bb2-e65e3fb6939d_MALDI1 Escherichia coli false data/DRIAMS/DRIAMS-A/raw/2018/99191931-3ed3-4a25-9bb2-e65e3fb6939d_MALDI1.txt.gz
b9af2612-ddc8-41c0-90aa-04c8fcf16f8d_MALDI1 Escherichia coli false data/DRIAMS/DRIAMS-A/raw/2018/b9af2612-ddc8-41c0-90aa-04c8fcf16f8d_MALDI1.txt.gz
51b82a9c-6497-4d69-a498-8ce0d9e649c5_MALDI1 Escherichia coli false data/DRIAMS/DRIAMS-A/raw/2018/51b82a9c-6497-4d69-a498-8ce0d9e649c5_MALDI1.txt.gz
ad628afc-fc57-4084-ba92-90c18d938319_MALDI1 Escherichia coli true data/DRIAMS/DRIAMS-A/raw/2018/ad628afc-fc57-4084-ba92-90c18d938319_MALDI1.txt.gz
5553ef00-6b71-4696-95c6-993eb4fe0cdc_MALDI1 Escherichia coli false data/DRIAMS/DRIAMS-A/raw/2018/5553ef00-6b71-4696-95c6-993eb4fe0cdc_MALDI1.txt.gz
525421d5-7c97-4523-9d82-5be7b312955c_MALDI2 Escherichia coli true data/DRIAMS/DRIAMS-A/raw/2018/525421d5-7c97-4523-9d82-5be7b312955c_MALDI2.txt.gz
1a98fcb2-6b06-4987-aa58-efc0f0da53fc_MALDI2 Escherichia coli true data/DRIAMS/DRIAMS-A/raw/2018/1a98fcb2-6b06-4987-aa58-efc0f0da53fc_MALDI2.txt.gz
0fac6eeb-0386-444a-8498-97e0b727bcd7_MALDI2 Escherichia coli false data/DRIAMS/DRIAMS-A/raw/2018/0fac6eeb-0386-444a-8498-97e0b727bcd7_MALDI2.txt.gz
9251e130-66b6-41ba-a1b2-ae41bce0a200_MALDI2 Escherichia coli false data/DRIAMS/DRIAMS-A/raw/2018/9251e130-66b6-41ba-a1b2-ae41bce0a200_MALDI2.txt.gz
450fde6c-4db1-4060-b2f9-4bf0cc621f36_MALDI2 Escherichia coli false data/DRIAMS/DRIAMS-A/raw/2018/450fde6c-4db1-4060-b2f9-4bf0cc621f36_MALDI2.txt.gz
fdb4014f-c1ae-4e30-bbdd-de6126976728_MALDI2 Escherichia coli true data/DRIAMS/DRIAMS-A/raw/2018/fdb4014f-c1ae-4e30-bbdd-de6126976728_MALDI2.txt.gz
81ff5b1c-d5b9-4989-a4a3-eb30e735c7b3_MALDI2 Escherichia coli false data/DRIAMS/DRIAMS-A/raw/2018/81ff5b1c-d5b9-4989-a4a3-eb30e735c7b3_MALDI2.txt.gz
cb0b127c-dda2-4895-a020-38b3381ce790_MALDI2 Escherichia coli false data/DRIAMS/DRIAMS-A/raw/2018/cb0b127c-dda2-4895-a020-38b3381ce790_MALDI2.txt.gz
013ca7ca-a5f1-4bd4-b24c-3686b6fa3cf3_MALDI2 Escherichia coli false data/DRIAMS/DRIAMS-A/raw/2018/013ca7ca-a5f1-4bd4-b24c-3686b6fa3cf3_MALDI2.txt.gz
05332488-dbae-453a-976f-ab4ff980f3b4_MALDI2 Escherichia coli false data/DRIAMS/DRIAMS-A/raw/2018/05332488-dbae-453a-976f-ab4ff980f3b4_MALDI2.txt.gz
6d7a407b-2727-4721-905f-31d5030b3ba4_MALDI2 Escherichia coli true data/DRIAMS/DRIAMS-A/raw/2018/6d7a407b-2727-4721-905f-31d5030b3ba4_MALDI2.txt.gz

Resistance distribution

How many spectra are resistant vs susceptible?

(def resistance-counts
  (-> raw-data
      (tc/group-by [:ri])
      (tc/aggregate {:count tc/row-count})))
resistance-counts

_unnamed [2 2]:

:ri :count
false 1163
true 237
(-> resistance-counts
    (tc/map-columns :label [:ri] #(if % "Resistant/Intermediate" "Susceptible"))
    (plotly/base {:=x :label
                  :=y :count
                  :=title "Resistance distribution — E. coli / Cefepime"
                  :=x-title ""
                  :=y-title "Number of spectra"})
    (plotly/layer-bar)
    plotly/plot)

Loading a single raw spectrum

Each spectrum is a gzipped text file with space-separated mass/intensity pairs. Let’s pick the first one:

(def spectrum-path
  (-> raw-data :path first))
(def raw-spectrum
  (ingestion/load-raw-spectrum spectrum-path))
raw-spectrum

data/DRIAMS/DRIAMS-A/raw/2018/69eca649-ec26-4f9d-9f9a-d42aa5b9ec0f_MALDI1.txt.gz [20808 2]:

:mass :intensity
1960.06193683 1348
1960.47603356 1375
1960.89017416 1201
1961.30435864 1070
1961.71858699 1223
1962.13285920 1176
1962.54717529 1128
1962.96153526 1164
1963.37593909 1083
1963.79038680 1187
20123.31533040 0
20124.65110966 4
20125.98693369 5
20127.32280249 1
20128.65871605 0
20129.99467439 6
20131.33067750 2
20132.66672538 0
20134.00281803 3
20135.33895544 4
20136.67513763 1
{:points (tc/row-count raw-spectrum)
 :mass-min (-> raw-spectrum :mass first)
 :mass-max (-> raw-spectrum :mass last)}
{:points 20808, :mass-min 1960.06193682729, :mass-max 20136.6751376317}
(-> raw-spectrum
    (plotly/base {:=x :mass
                  :=y :intensity
                  :=title "Raw spectrum"
                  :=x-title "m/z (Da)"
                  :=y-title "Intensity (a.u.)"})
    (plotly/layer-line)
    plotly/plot)

Step-by-step preprocessing

The steps below follow the paper’s R code exactly:

  1. Square root transform (variance stabilization)
  2. Savitzky-Golay smoothing (half-window-size 10, polynomial order 3)
  3. SNIP baseline removal (20 iterations, applied twice)
  4. TIC normalization (total area → 1.0)

We apply each one separately so we can see its effect.

Step 1 — Square root transform

Taking the square root compresses the dynamic range, giving lower-intensity peaks more weight relative to dominant ones.

(def masses (:mass raw-spectrum))
(def raw-intensities (:intensity raw-spectrum))
(def sqrt-intensities
  (ripple/sqrt-transform raw-intensities))
(tc/dataset {:mass masses
             :raw-intensity raw-intensities
             :sqrt-intensity sqrt-intensities})

_unnamed [20808 3]:

:mass :raw-intensity :sqrt-intensity
1960.06193683 1348 36.71511950
1960.47603356 1375 37.08099244
1960.89017416 1201 34.65544690
1961.30435864 1070 32.71085447
1961.71858699 1223 34.97141690
1962.13285920 1176 34.29285640
1962.54717529 1128 33.58571125
1962.96153526 1164 34.11744422
1963.37593909 1083 32.90896534
1963.79038680 1187 34.45286635
20123.31533040 0 0.00000000
20124.65110966 4 2.00000000
20125.98693369 5 2.23606798
20127.32280249 1 1.00000000
20128.65871605 0 0.00000000
20129.99467439 6 2.44948974
20131.33067750 2 1.41421356
20132.66672538 0 0.00000000
20134.00281803 3 1.73205081
20135.33895544 4 2.00000000
20136.67513763 1 1.00000000
(-> (tc/dataset {:mass masses :intensity sqrt-intensities})
    (plotly/base {:=x :mass
                  :=y :intensity
                  :=title "After square root transform"
                  :=x-title "m/z (Da)"
                  :=y-title "√Intensity"})
    (plotly/layer-line)
    plotly/plot)

Step 2 — Savitzky-Golay smoothing

A polynomial fit in a sliding window removes high-frequency noise while preserving peak shapes. The paper uses half-window-size 10 (= full window of 21 points) with MALDIquant’s default polynomial order of 3.

(def smoothed-intensities
  (ripple/savitzky-golay-smooth sqrt-intensities
                                {:window-size 21
                                 :polynomial-order 3}))
(tc/dataset {:mass masses
             :sqrt sqrt-intensities
             :smoothed smoothed-intensities})

_unnamed [20808 3]:

:mass :sqrt :smoothed
1960.06193683 36.71511950 35.91567123
1960.47603356 37.08099244 35.64373075
1960.89017416 34.65544690 35.38123010
1961.30435864 32.71085447 35.15532515
1961.71858699 34.97141690 34.75422774
1962.13285920 34.29285640 34.42380707
1962.54717529 33.58571125 34.01278408
1962.96153526 34.11744422 33.67445374
1963.37593909 32.90896534 33.39810535
1963.79038680 34.45286635 33.01740623
20123.31533040 0.00000000 1.83800122
20124.65110966 2.00000000 1.78236195
20125.98693369 2.23606798 1.66330419
20127.32280249 1.00000000 1.47755892
20128.65871605 0.00000000 1.34945531
20129.99467439 2.44948974 1.25414492
20131.33067750 1.41421356 1.20270645
20132.66672538 0.00000000 1.15549548
20134.00281803 1.73205081 1.12848318
20135.33895544 2.00000000 1.19463651
20136.67513763 1.00000000 1.20338784
(-> (tc/dataset {:mass masses
                 :sqrt sqrt-intensities
                 :smoothed smoothed-intensities})
    (tc/pivot->longer [:sqrt :smoothed]
                      {:target-columns :step
                       :value-column-name :intensity})
    (plotly/base {:=x :mass
                  :=y :intensity
                  :=color :step
                  :=title "Smoothing effect (zoom in to see detail)"
                  :=x-title "m/z (Da)"
                  :=y-title "Intensity"})
    (plotly/layer-line)
    plotly/plot)

Step 3 — SNIP baseline removal (twice)

The SNIP algorithm estimates and subtracts the slowly varying baseline caused by chemical matrix effects.

The paper’s R code applies SNIP twice in succession: removeBaseline(removeBaseline(spectra, method="SNIP", iterations=20)). The second pass catches any residual baseline that the first pass left behind.

(def baseline-corrected-once
  (ripple/snip-baseline-removal smoothed-intensities
                                {:iterations 20}))
(def baseline-corrected
  (ripple/snip-baseline-removal baseline-corrected-once
                                {:iterations 20}))
(tc/dataset {:mass masses
             :smoothed smoothed-intensities
             :after-1st-snip baseline-corrected-once
             :after-2nd-snip baseline-corrected})

_unnamed [20808 4]:

:mass :smoothed :after-1st-snip :after-2nd-snip
1960.06193683 35.91567123 0.00000000 0.00000000
1960.47603356 35.64373075 0.07315121 0.07026769
1960.89017416 35.38123010 0.15574225 0.14997520
1961.30435864 35.15532515 0.27492898 0.26627840
1961.71858699 34.75422774 0.21892326 0.21206534
1962.13285920 34.42380707 0.23359428 0.22902233
1962.54717529 34.01278408 0.19205007 0.18976409
1962.96153526 33.67445374 0.22419190 0.22419190
1963.37593909 33.39810535 0.29585815 0.29585815
1963.79038680 33.01740623 0.26317367 0.26317367
20123.31533040 1.83800122 0.69635000 0.69000843
20124.65110966 1.78236195 0.64235673 0.63680786
20125.98693369 1.66330419 0.52494499 0.52018881
20127.32280249 1.47755892 0.34084572 0.33688224
20128.65871605 1.34945531 0.21438811 0.21121732
20129.99467439 1.25414492 0.12072373 0.11834564
20131.33067750 1.20270645 0.07093126 0.06934587
20132.66672538 1.15549548 0.02536630 0.02457360
20134.00281803 1.12848318 0.00000000 0.00000000
20135.33895544 1.19463651 0.02870100 0.02870100
20136.67513763 1.20338784 0.00000000 0.00000000
(-> (tc/dataset {:mass masses
                 :smoothed smoothed-intensities
                 :after-first-snip baseline-corrected-once
                 :after-second-snip baseline-corrected})
    (tc/pivot->longer [:smoothed :after-first-snip :after-second-snip]
                      {:target-columns :step
                       :value-column-name :intensity})
    (plotly/base {:=x :mass
                  :=y :intensity
                  :=color :step
                  :=title "Baseline removal — single vs double SNIP"
                  :=x-title "m/z (Da)"
                  :=y-title "Intensity"})
    (plotly/layer-line)
    plotly/plot)

Step 4 — TIC normalization

Total Ion Current normalization scales the spectrum so its area (by the trapezoidal rule) equals 1.0. This makes spectra from different measurements comparable.

(def normalized-intensities
  (ripple/tic-normalize masses baseline-corrected
                        {:target-area 1.0}))
(tc/dataset {:mass masses
             :baseline-corrected baseline-corrected
             :normalized normalized-intensities})

_unnamed [20808 3]:

:mass :baseline-corrected :normalized
1960.06193683 0.00000000 0.00000000
1960.47603356 0.07026769 0.00000295
1960.89017416 0.14997520 0.00000630
1961.30435864 0.26627840 0.00001119
1961.71858699 0.21206534 0.00000891
1962.13285920 0.22902233 0.00000962
1962.54717529 0.18976409 0.00000797
1962.96153526 0.22419190 0.00000942
1963.37593909 0.29585815 0.00001243
1963.79038680 0.26317367 0.00001106
20123.31533040 0.69000843 0.00002900
20124.65110966 0.63680786 0.00002676
20125.98693369 0.52018881 0.00002186
20127.32280249 0.33688224 0.00001416
20128.65871605 0.21121732 0.00000888
20129.99467439 0.11834564 0.00000497
20131.33067750 0.06934587 0.00000291
20132.66672538 0.02457360 0.00000103
20134.00281803 0.00000000 0.00000000
20135.33895544 0.02870100 0.00000121
20136.67513763 0.00000000 0.00000000
(-> (tc/dataset {:mass masses :intensity normalized-intensities})
    (plotly/base {:=x :mass
                  :=y :intensity
                  :=title "After TIC normalization"
                  :=x-title "m/z (Da)"
                  :=y-title "Intensity (normalized)"})
    (plotly/layer-line)
    plotly/plot)

Trimming and binning

For machine learning we need a fixed-length feature vector. The DRIAMS protocol trims to [2000, 20000] Da and bins into 3 Da wide bins, producing 6,000 features.

Trimming

(def preprocessed
  (tc/dataset {:mass masses :intensity normalized-intensities}))
(def trimmed
  (ripple/trim-spectrum preprocessed {:range [2000 20000]}))
{:points-before (tc/row-count preprocessed)
 :points-after (tc/row-count trimmed)}
{:points-before 20808, :points-after 20609}
(-> trimmed
    (plotly/base {:=x :mass
                  :=y :intensity
                  :=title "Preprocessed spectrum (trimmed to 2000–20000 Da)"
                  :=x-title "m/z (Da)"
                  :=y-title "Intensity (normalized)"})
    (plotly/layer-line)
    plotly/plot)

Binning

The m/z axis is partitioned into 3 Da bins and the intensities falling into each bin are summed, producing a vector of 6,000 features.

(def binned
  (ripple/bin-spectrum trimmed {:range [2000 20000] :step 3}))
(count binned)
6000

Here are the first few values:

(vec (take 10 binned))
[0.0014051348219647724
 0.001138190111135852
 6.487409344596183E-4
 1.6951863616377594E-4
 3.85875544996719E-5
 5.060839052549854E-4
 0.001299036331936515
 7.310534543564925E-4
 6.292270612348158E-4
 3.4378357407736145E-4]

Visualized as a line chart (bin index on the x-axis):

(-> (tc/dataset {:bin (range (count binned))
                 :intensity (vec binned)})
    (plotly/base {:=x :bin
                  :=y :intensity
                  :=title "Binned feature vector (6000 features)"
                  :=x-title "Bin index"
                  :=y-title "Summed intensity"})
    (plotly/layer-line)
    plotly/plot)

Processing all spectra

learning/prepare-ml-data applies preprocessing and binning to every spectrum in the dataset, producing a table with one row per spectrum and 6,000 feature columns (:x0 through :x5999) plus the :ri target.

(def preprocessing-params
  {:smooth-window 21
   :smooth-polynomial 3})
(def ml-params
  {:preprocessing-params preprocessing-params
   :binning-params {:range [2000 20000] :step 3}})
(def ml-data
  (learning/prepare-ml-data raw-data ml-params))
{:rows (tc/row-count ml-data)
 :columns (count (tc/column-names ml-data))}
{:rows 1400, :columns 6001}

The first few columns:

(-> ml-data
    (tc/select-columns (into [:ri] (map #(keyword (str "x" %)) (range 5)))))

ml-data [1400 6]:

:ri :x0 :x1 :x2 :x3 :x4
false 0.00140513 0.00113819 0.00064874 0.00016952 0.00003859
false 0.00056953 0.00087902 0.00049127 0.00003817 0.00003095
false 0.00111533 0.00072134 0.00032575 0.00029950 0.00105973
false 0.00115295 0.00092728 0.00054034 0.00022159 0.00115163
true 0.00097009 0.00105452 0.00046938 0.00013652 0.00011332
false 0.00024568 0.00015792 0.00013027 0.00025076 0.00006439
false 0.00060575 0.00084451 0.00045604 0.00006230 0.00017407
false 0.00035647 0.00010136 0.00061886 0.00050318 0.00012080
true 0.00084311 0.00217295 0.00127265 0.00026835 0.00005026
false 0.00172205 0.00257105 0.00094202 0.00022629 0.00007186
true 0.00117619 0.00054554 0.00000983 0.00025545 0.00023035
true 0.00024839 0.00015676 0.00003847 0.00006366 0.00003414
false 0.00045581 0.00068686 0.00019928 0.00002988 0.00023445
false 0.00087060 0.00143934 0.00062429 0.00010663 0.00013869
false 0.00057721 0.00059694 0.00010637 0.00012898 0.00004320
true 0.00036581 0.00097373 0.00061454 0.00025811 0.00007275
false 0.00000160 0.00006738 0.00024693 0.00007348 0.00002933
false 0.00010252 0.00025605 0.00034148 0.00026237 0.00024249
false 0.00045981 0.00029983 0.00011500 0.00025876 0.00049484
false 0.00000165 0.00010861 0.00032309 0.00041009 0.00094332
true 0.00004385 0.00010137 0.00004282 0.00034013 0.00022050

A few spectra overlaid

Plotting the binned features for the first five spectra shows the typical variation across samples:

(def n-overlay 5)
(def overlay-data
  (let [bins (range 6000)]
    (->> (range n-overlay)
         (mapcat (fn [i]
                   (let [row (-> ml-data (tc/select-rows [i]))]
                     (map (fn [b]
                            {:bin b
                             :intensity (first (row (keyword (str "x" b))))
                             :spectrum (str "spectrum-" i)})
                          bins))))
         tc/dataset)))
(-> overlay-data
    (plotly/base {:=x :bin
                  :=y :intensity
                  :=color :spectrum
                  :=title "Binned features — first 5 spectra"
                  :=x-title "Bin index"
                  :=y-title "Summed intensity"})
    (plotly/layer-line {:=mark-opacity 0.5})
    plotly/plot)

Train/test split

As a final check, we split the dataset and verify it is ready for modeling:

(def split-data
  (learning/split ml-data {:seed 1}))
{:train-rows (tc/row-count (:train split-data))
 :test-rows (tc/row-count (:test split-data))
 :train-resistance-rate (-> split-data :train :ri tcc/mean)
 :test-resistance-rate (-> split-data :test :ri tcc/mean)}
{:train-rows 933,
 :test-rows 467,
 :train-resistance-rate 0.16291532690246516,
 :test-resistance-rate 0.18201284796573874}

Resistance distribution in train and test sets:

(-> (concat
     (map (fn [ri] {:split "train" :label (if ri "R/I" "S")}) ((:train split-data) :ri))
     (map (fn [ri] {:split "test" :label (if ri "R/I" "S")}) ((:test split-data) :ri)))
    tc/dataset
    (tc/group-by [:split :label])
    (tc/aggregate {:count tc/row-count})
    (plotly/base {:=x :split
                  :=y :count
                  :=color :label
                  :=title "Resistance distribution — train vs test"
                  :=x-title ""
                  :=y-title "Count"})
    (plotly/layer-bar)
    plotly/plot)

The dataset is ready. Each row is a preprocessed, binned MALDI spectrum with a resistance label — exactly what the classifier needs.

source: notebooks/amr_book/preprocessing_driams.clj