4 Preprocessing DRIAMS Spectra
Before we can train a classifier, each raw MALDI-TOF spectrum must be preprocessed and converted into a fixed-length feature vector.
This notebook walks through every step of that transformation for one spectrum, then shows how the full dataset is assembled for machine learning.
(ns amr-book.preprocessing-driams
(:require
;; AMR ML pipeline (prepare, train, predict, measure):
[scicloj.amr.learning :as learning]
;; AMR data loading utilities:
[scicloj.amr.data.ingestion :as ingestion]
;; Bacterial species definitions and antibiotic lists:
[scicloj.amr.data.bacteria :as bacteria]
;; Ripple MALDI signal processing (https://scicloj.github.io/ripple):
[scicloj.ripple.maldi :as ripple]
;; Table processing (https://scicloj.github.io/tablecloth/):
[tablecloth.api :as tc]
;; Column-level operations:
[tablecloth.column.api :as tcc]
;; Interactive plotting via Plotly (https://scicloj.github.io/tableplot/):
[scicloj.tableplot.v1.plotly :as plotly]
;; Annotating kinds of visualizations (https://scicloj.github.io/kindly-noted/):
[scicloj.kindly.v4.kind :as kind]))From the paper
Weis et al. describe the preprocessing as follows (preprint, pp 28–29):
The following preprocessing steps are performed using the R package MaldiQuant version 1.19: (1) the measured intensity is transformed with a square-root method to stabilize the variance, (2) smoothing using the Savitzky-Golay algorithm with half-window-size 10 is applied, (3) an estimate of the baseline is removed in 20 iterations of the SNIP algorithm, (4) the intensity is calibrated using the total-ion-current (TIC), and (5) the spectra are trimmed to values in a 2,000 to 20,000 Da range.
After preprocessing, each spectrum is represented by a set of measurements, each of them described by its corresponding mass-to-charge ratio and intensity. However, this representation results in each sample having potentially a different dimensionality […]. Since the machine learning methods used in this manuscript require their input to be a feature vector of fixed dimensionality, intensity measurements are binned using the bin size of 3 Da. […] each sample is represented by a vector containing 6,000 features.
Examining the actual R code reveals two details not mentioned in the text:
The Savitzky-Golay smoothing uses MALDIquant’s default polynomial order of 3 (not specified explicitly in their script).
The SNIP baseline removal is applied twice:
removeBaseline(removeBaseline(spectra, method="SNIP", iterations=20)).
We now walk through each of these steps.
Choosing a scenario
We pick a single species / antibiotic / site / year combination — E. coli vs Cefepime from DRIAMS-A 2018:
(def params
{:site :A
:year 2018
:species bacteria/E-coli
:antibiotic :Cefepime})prepare-raw-data loads the DRIAMS metadata, joins it with available spectrum files, and filters to our scenario. The result has a :path column (spectrum file) and a :ri column (true = resistant or intermediate).
(def raw-data
(learning/prepare-raw-data params))(tc/row-count raw-data)1400(tc/select-columns raw-data [:code :species :ri :path])raw-data [Escherichia coli / Cefepime / A / 2018] [1400 4]:
| :code | :species | :ri | :path |
|---|---|---|---|
| 69eca649-ec26-4f9d-9f9a-d42aa5b9ec0f_MALDI1 | Escherichia coli | false | data/DRIAMS/DRIAMS-A/raw/2018/69eca649-ec26-4f9d-9f9a-d42aa5b9ec0f_MALDI1.txt.gz |
| 2132c91c-7b62-4ea4-9984-6f7fcdaed7d6_MALDI1 | Escherichia coli | false | data/DRIAMS/DRIAMS-A/raw/2018/2132c91c-7b62-4ea4-9984-6f7fcdaed7d6_MALDI1.txt.gz |
| 51ef2973-26fd-4558-b7ac-e615fe177a18_MALDI1 | Escherichia coli | false | data/DRIAMS/DRIAMS-A/raw/2018/51ef2973-26fd-4558-b7ac-e615fe177a18_MALDI1.txt.gz |
| 11b8a987-04e9-42be-a933-7dbcb3da84aa_MALDI1 | Escherichia coli | false | data/DRIAMS/DRIAMS-A/raw/2018/11b8a987-04e9-42be-a933-7dbcb3da84aa_MALDI1.txt.gz |
| d8ad07e0-c646-48a6-b901-cda98e8f7e67_MALDI1 | Escherichia coli | true | data/DRIAMS/DRIAMS-A/raw/2018/d8ad07e0-c646-48a6-b901-cda98e8f7e67_MALDI1.txt.gz |
| 99191931-3ed3-4a25-9bb2-e65e3fb6939d_MALDI1 | Escherichia coli | false | data/DRIAMS/DRIAMS-A/raw/2018/99191931-3ed3-4a25-9bb2-e65e3fb6939d_MALDI1.txt.gz |
| b9af2612-ddc8-41c0-90aa-04c8fcf16f8d_MALDI1 | Escherichia coli | false | data/DRIAMS/DRIAMS-A/raw/2018/b9af2612-ddc8-41c0-90aa-04c8fcf16f8d_MALDI1.txt.gz |
| 51b82a9c-6497-4d69-a498-8ce0d9e649c5_MALDI1 | Escherichia coli | false | data/DRIAMS/DRIAMS-A/raw/2018/51b82a9c-6497-4d69-a498-8ce0d9e649c5_MALDI1.txt.gz |
| ad628afc-fc57-4084-ba92-90c18d938319_MALDI1 | Escherichia coli | true | data/DRIAMS/DRIAMS-A/raw/2018/ad628afc-fc57-4084-ba92-90c18d938319_MALDI1.txt.gz |
| 5553ef00-6b71-4696-95c6-993eb4fe0cdc_MALDI1 | Escherichia coli | false | data/DRIAMS/DRIAMS-A/raw/2018/5553ef00-6b71-4696-95c6-993eb4fe0cdc_MALDI1.txt.gz |
| … | … | … | … |
| 525421d5-7c97-4523-9d82-5be7b312955c_MALDI2 | Escherichia coli | true | data/DRIAMS/DRIAMS-A/raw/2018/525421d5-7c97-4523-9d82-5be7b312955c_MALDI2.txt.gz |
| 1a98fcb2-6b06-4987-aa58-efc0f0da53fc_MALDI2 | Escherichia coli | true | data/DRIAMS/DRIAMS-A/raw/2018/1a98fcb2-6b06-4987-aa58-efc0f0da53fc_MALDI2.txt.gz |
| 0fac6eeb-0386-444a-8498-97e0b727bcd7_MALDI2 | Escherichia coli | false | data/DRIAMS/DRIAMS-A/raw/2018/0fac6eeb-0386-444a-8498-97e0b727bcd7_MALDI2.txt.gz |
| 9251e130-66b6-41ba-a1b2-ae41bce0a200_MALDI2 | Escherichia coli | false | data/DRIAMS/DRIAMS-A/raw/2018/9251e130-66b6-41ba-a1b2-ae41bce0a200_MALDI2.txt.gz |
| 450fde6c-4db1-4060-b2f9-4bf0cc621f36_MALDI2 | Escherichia coli | false | data/DRIAMS/DRIAMS-A/raw/2018/450fde6c-4db1-4060-b2f9-4bf0cc621f36_MALDI2.txt.gz |
| fdb4014f-c1ae-4e30-bbdd-de6126976728_MALDI2 | Escherichia coli | true | data/DRIAMS/DRIAMS-A/raw/2018/fdb4014f-c1ae-4e30-bbdd-de6126976728_MALDI2.txt.gz |
| 81ff5b1c-d5b9-4989-a4a3-eb30e735c7b3_MALDI2 | Escherichia coli | false | data/DRIAMS/DRIAMS-A/raw/2018/81ff5b1c-d5b9-4989-a4a3-eb30e735c7b3_MALDI2.txt.gz |
| cb0b127c-dda2-4895-a020-38b3381ce790_MALDI2 | Escherichia coli | false | data/DRIAMS/DRIAMS-A/raw/2018/cb0b127c-dda2-4895-a020-38b3381ce790_MALDI2.txt.gz |
| 013ca7ca-a5f1-4bd4-b24c-3686b6fa3cf3_MALDI2 | Escherichia coli | false | data/DRIAMS/DRIAMS-A/raw/2018/013ca7ca-a5f1-4bd4-b24c-3686b6fa3cf3_MALDI2.txt.gz |
| 05332488-dbae-453a-976f-ab4ff980f3b4_MALDI2 | Escherichia coli | false | data/DRIAMS/DRIAMS-A/raw/2018/05332488-dbae-453a-976f-ab4ff980f3b4_MALDI2.txt.gz |
| 6d7a407b-2727-4721-905f-31d5030b3ba4_MALDI2 | Escherichia coli | true | data/DRIAMS/DRIAMS-A/raw/2018/6d7a407b-2727-4721-905f-31d5030b3ba4_MALDI2.txt.gz |
Resistance distribution
How many spectra are resistant vs susceptible?
(def resistance-counts
(-> raw-data
(tc/group-by [:ri])
(tc/aggregate {:count tc/row-count})))resistance-counts_unnamed [2 2]:
| :ri | :count |
|---|---|
| false | 1163 |
| true | 237 |
(-> resistance-counts
(tc/map-columns :label [:ri] #(if % "Resistant/Intermediate" "Susceptible"))
(plotly/base {:=x :label
:=y :count
:=title "Resistance distribution — E. coli / Cefepime"
:=x-title ""
:=y-title "Number of spectra"})
(plotly/layer-bar)
plotly/plot)Loading a single raw spectrum
Each spectrum is a gzipped text file with space-separated mass/intensity pairs. Let’s pick the first one:
(def spectrum-path
(-> raw-data :path first))(def raw-spectrum
(ingestion/load-raw-spectrum spectrum-path))raw-spectrumdata/DRIAMS/DRIAMS-A/raw/2018/69eca649-ec26-4f9d-9f9a-d42aa5b9ec0f_MALDI1.txt.gz [20808 2]:
| :mass | :intensity |
|---|---|
| 1960.06193683 | 1348 |
| 1960.47603356 | 1375 |
| 1960.89017416 | 1201 |
| 1961.30435864 | 1070 |
| 1961.71858699 | 1223 |
| 1962.13285920 | 1176 |
| 1962.54717529 | 1128 |
| 1962.96153526 | 1164 |
| 1963.37593909 | 1083 |
| 1963.79038680 | 1187 |
| … | … |
| 20123.31533040 | 0 |
| 20124.65110966 | 4 |
| 20125.98693369 | 5 |
| 20127.32280249 | 1 |
| 20128.65871605 | 0 |
| 20129.99467439 | 6 |
| 20131.33067750 | 2 |
| 20132.66672538 | 0 |
| 20134.00281803 | 3 |
| 20135.33895544 | 4 |
| 20136.67513763 | 1 |
{:points (tc/row-count raw-spectrum)
:mass-min (-> raw-spectrum :mass first)
:mass-max (-> raw-spectrum :mass last)}{:points 20808, :mass-min 1960.06193682729, :mass-max 20136.6751376317}(-> raw-spectrum
(plotly/base {:=x :mass
:=y :intensity
:=title "Raw spectrum"
:=x-title "m/z (Da)"
:=y-title "Intensity (a.u.)"})
(plotly/layer-line)
plotly/plot)Step-by-step preprocessing
The steps below follow the paper’s R code exactly:
- Square root transform (variance stabilization)
- Savitzky-Golay smoothing (half-window-size 10, polynomial order 3)
- SNIP baseline removal (20 iterations, applied twice)
- TIC normalization (total area → 1.0)
We apply each one separately so we can see its effect.
Step 1 — Square root transform
Taking the square root compresses the dynamic range, giving lower-intensity peaks more weight relative to dominant ones.
(def masses (:mass raw-spectrum))(def raw-intensities (:intensity raw-spectrum))(def sqrt-intensities
(ripple/sqrt-transform raw-intensities))(tc/dataset {:mass masses
:raw-intensity raw-intensities
:sqrt-intensity sqrt-intensities})_unnamed [20808 3]:
| :mass | :raw-intensity | :sqrt-intensity |
|---|---|---|
| 1960.06193683 | 1348 | 36.71511950 |
| 1960.47603356 | 1375 | 37.08099244 |
| 1960.89017416 | 1201 | 34.65544690 |
| 1961.30435864 | 1070 | 32.71085447 |
| 1961.71858699 | 1223 | 34.97141690 |
| 1962.13285920 | 1176 | 34.29285640 |
| 1962.54717529 | 1128 | 33.58571125 |
| 1962.96153526 | 1164 | 34.11744422 |
| 1963.37593909 | 1083 | 32.90896534 |
| 1963.79038680 | 1187 | 34.45286635 |
| … | … | … |
| 20123.31533040 | 0 | 0.00000000 |
| 20124.65110966 | 4 | 2.00000000 |
| 20125.98693369 | 5 | 2.23606798 |
| 20127.32280249 | 1 | 1.00000000 |
| 20128.65871605 | 0 | 0.00000000 |
| 20129.99467439 | 6 | 2.44948974 |
| 20131.33067750 | 2 | 1.41421356 |
| 20132.66672538 | 0 | 0.00000000 |
| 20134.00281803 | 3 | 1.73205081 |
| 20135.33895544 | 4 | 2.00000000 |
| 20136.67513763 | 1 | 1.00000000 |
(-> (tc/dataset {:mass masses :intensity sqrt-intensities})
(plotly/base {:=x :mass
:=y :intensity
:=title "After square root transform"
:=x-title "m/z (Da)"
:=y-title "√Intensity"})
(plotly/layer-line)
plotly/plot)Step 2 — Savitzky-Golay smoothing
A polynomial fit in a sliding window removes high-frequency noise while preserving peak shapes. The paper uses half-window-size 10 (= full window of 21 points) with MALDIquant’s default polynomial order of 3.
(def smoothed-intensities
(ripple/savitzky-golay-smooth sqrt-intensities
{:window-size 21
:polynomial-order 3}))(tc/dataset {:mass masses
:sqrt sqrt-intensities
:smoothed smoothed-intensities})_unnamed [20808 3]:
| :mass | :sqrt | :smoothed |
|---|---|---|
| 1960.06193683 | 36.71511950 | 35.91567123 |
| 1960.47603356 | 37.08099244 | 35.64373075 |
| 1960.89017416 | 34.65544690 | 35.38123010 |
| 1961.30435864 | 32.71085447 | 35.15532515 |
| 1961.71858699 | 34.97141690 | 34.75422774 |
| 1962.13285920 | 34.29285640 | 34.42380707 |
| 1962.54717529 | 33.58571125 | 34.01278408 |
| 1962.96153526 | 34.11744422 | 33.67445374 |
| 1963.37593909 | 32.90896534 | 33.39810535 |
| 1963.79038680 | 34.45286635 | 33.01740623 |
| … | … | … |
| 20123.31533040 | 0.00000000 | 1.83800122 |
| 20124.65110966 | 2.00000000 | 1.78236195 |
| 20125.98693369 | 2.23606798 | 1.66330419 |
| 20127.32280249 | 1.00000000 | 1.47755892 |
| 20128.65871605 | 0.00000000 | 1.34945531 |
| 20129.99467439 | 2.44948974 | 1.25414492 |
| 20131.33067750 | 1.41421356 | 1.20270645 |
| 20132.66672538 | 0.00000000 | 1.15549548 |
| 20134.00281803 | 1.73205081 | 1.12848318 |
| 20135.33895544 | 2.00000000 | 1.19463651 |
| 20136.67513763 | 1.00000000 | 1.20338784 |
(-> (tc/dataset {:mass masses
:sqrt sqrt-intensities
:smoothed smoothed-intensities})
(tc/pivot->longer [:sqrt :smoothed]
{:target-columns :step
:value-column-name :intensity})
(plotly/base {:=x :mass
:=y :intensity
:=color :step
:=title "Smoothing effect (zoom in to see detail)"
:=x-title "m/z (Da)"
:=y-title "Intensity"})
(plotly/layer-line)
plotly/plot)Step 3 — SNIP baseline removal (twice)
The SNIP algorithm estimates and subtracts the slowly varying baseline caused by chemical matrix effects.
The paper’s R code applies SNIP twice in succession: removeBaseline(removeBaseline(spectra, method="SNIP", iterations=20)). The second pass catches any residual baseline that the first pass left behind.
(def baseline-corrected-once
(ripple/snip-baseline-removal smoothed-intensities
{:iterations 20}))(def baseline-corrected
(ripple/snip-baseline-removal baseline-corrected-once
{:iterations 20}))(tc/dataset {:mass masses
:smoothed smoothed-intensities
:after-1st-snip baseline-corrected-once
:after-2nd-snip baseline-corrected})_unnamed [20808 4]:
| :mass | :smoothed | :after-1st-snip | :after-2nd-snip |
|---|---|---|---|
| 1960.06193683 | 35.91567123 | 0.00000000 | 0.00000000 |
| 1960.47603356 | 35.64373075 | 0.07315121 | 0.07026769 |
| 1960.89017416 | 35.38123010 | 0.15574225 | 0.14997520 |
| 1961.30435864 | 35.15532515 | 0.27492898 | 0.26627840 |
| 1961.71858699 | 34.75422774 | 0.21892326 | 0.21206534 |
| 1962.13285920 | 34.42380707 | 0.23359428 | 0.22902233 |
| 1962.54717529 | 34.01278408 | 0.19205007 | 0.18976409 |
| 1962.96153526 | 33.67445374 | 0.22419190 | 0.22419190 |
| 1963.37593909 | 33.39810535 | 0.29585815 | 0.29585815 |
| 1963.79038680 | 33.01740623 | 0.26317367 | 0.26317367 |
| … | … | … | … |
| 20123.31533040 | 1.83800122 | 0.69635000 | 0.69000843 |
| 20124.65110966 | 1.78236195 | 0.64235673 | 0.63680786 |
| 20125.98693369 | 1.66330419 | 0.52494499 | 0.52018881 |
| 20127.32280249 | 1.47755892 | 0.34084572 | 0.33688224 |
| 20128.65871605 | 1.34945531 | 0.21438811 | 0.21121732 |
| 20129.99467439 | 1.25414492 | 0.12072373 | 0.11834564 |
| 20131.33067750 | 1.20270645 | 0.07093126 | 0.06934587 |
| 20132.66672538 | 1.15549548 | 0.02536630 | 0.02457360 |
| 20134.00281803 | 1.12848318 | 0.00000000 | 0.00000000 |
| 20135.33895544 | 1.19463651 | 0.02870100 | 0.02870100 |
| 20136.67513763 | 1.20338784 | 0.00000000 | 0.00000000 |
(-> (tc/dataset {:mass masses
:smoothed smoothed-intensities
:after-first-snip baseline-corrected-once
:after-second-snip baseline-corrected})
(tc/pivot->longer [:smoothed :after-first-snip :after-second-snip]
{:target-columns :step
:value-column-name :intensity})
(plotly/base {:=x :mass
:=y :intensity
:=color :step
:=title "Baseline removal — single vs double SNIP"
:=x-title "m/z (Da)"
:=y-title "Intensity"})
(plotly/layer-line)
plotly/plot)Step 4 — TIC normalization
Total Ion Current normalization scales the spectrum so its area (by the trapezoidal rule) equals 1.0. This makes spectra from different measurements comparable.
(def normalized-intensities
(ripple/tic-normalize masses baseline-corrected
{:target-area 1.0}))(tc/dataset {:mass masses
:baseline-corrected baseline-corrected
:normalized normalized-intensities})_unnamed [20808 3]:
| :mass | :baseline-corrected | :normalized |
|---|---|---|
| 1960.06193683 | 0.00000000 | 0.00000000 |
| 1960.47603356 | 0.07026769 | 0.00000295 |
| 1960.89017416 | 0.14997520 | 0.00000630 |
| 1961.30435864 | 0.26627840 | 0.00001119 |
| 1961.71858699 | 0.21206534 | 0.00000891 |
| 1962.13285920 | 0.22902233 | 0.00000962 |
| 1962.54717529 | 0.18976409 | 0.00000797 |
| 1962.96153526 | 0.22419190 | 0.00000942 |
| 1963.37593909 | 0.29585815 | 0.00001243 |
| 1963.79038680 | 0.26317367 | 0.00001106 |
| … | … | … |
| 20123.31533040 | 0.69000843 | 0.00002900 |
| 20124.65110966 | 0.63680786 | 0.00002676 |
| 20125.98693369 | 0.52018881 | 0.00002186 |
| 20127.32280249 | 0.33688224 | 0.00001416 |
| 20128.65871605 | 0.21121732 | 0.00000888 |
| 20129.99467439 | 0.11834564 | 0.00000497 |
| 20131.33067750 | 0.06934587 | 0.00000291 |
| 20132.66672538 | 0.02457360 | 0.00000103 |
| 20134.00281803 | 0.00000000 | 0.00000000 |
| 20135.33895544 | 0.02870100 | 0.00000121 |
| 20136.67513763 | 0.00000000 | 0.00000000 |
(-> (tc/dataset {:mass masses :intensity normalized-intensities})
(plotly/base {:=x :mass
:=y :intensity
:=title "After TIC normalization"
:=x-title "m/z (Da)"
:=y-title "Intensity (normalized)"})
(plotly/layer-line)
plotly/plot)Trimming and binning
For machine learning we need a fixed-length feature vector. The DRIAMS protocol trims to [2000, 20000] Da and bins into 3 Da wide bins, producing 6,000 features.
Trimming
(def preprocessed
(tc/dataset {:mass masses :intensity normalized-intensities}))(def trimmed
(ripple/trim-spectrum preprocessed {:range [2000 20000]})){:points-before (tc/row-count preprocessed)
:points-after (tc/row-count trimmed)}{:points-before 20808, :points-after 20609}(-> trimmed
(plotly/base {:=x :mass
:=y :intensity
:=title "Preprocessed spectrum (trimmed to 2000–20000 Da)"
:=x-title "m/z (Da)"
:=y-title "Intensity (normalized)"})
(plotly/layer-line)
plotly/plot)Binning
The m/z axis is partitioned into 3 Da bins and the intensities falling into each bin are summed, producing a vector of 6,000 features.
(def binned
(ripple/bin-spectrum trimmed {:range [2000 20000] :step 3}))(count binned)6000Here are the first few values:
(vec (take 10 binned))[0.0014051348219647724
0.001138190111135852
6.487409344596183E-4
1.6951863616377594E-4
3.85875544996719E-5
5.060839052549854E-4
0.001299036331936515
7.310534543564925E-4
6.292270612348158E-4
3.4378357407736145E-4]Visualized as a line chart (bin index on the x-axis):
(-> (tc/dataset {:bin (range (count binned))
:intensity (vec binned)})
(plotly/base {:=x :bin
:=y :intensity
:=title "Binned feature vector (6000 features)"
:=x-title "Bin index"
:=y-title "Summed intensity"})
(plotly/layer-line)
plotly/plot)Processing all spectra
learning/prepare-ml-data applies preprocessing and binning to every spectrum in the dataset, producing a table with one row per spectrum and 6,000 feature columns (:x0 through :x5999) plus the :ri target.
(def preprocessing-params
{:smooth-window 21
:smooth-polynomial 3})(def ml-params
{:preprocessing-params preprocessing-params
:binning-params {:range [2000 20000] :step 3}})(def ml-data
(learning/prepare-ml-data raw-data ml-params)){:rows (tc/row-count ml-data)
:columns (count (tc/column-names ml-data))}{:rows 1400, :columns 6001}The first few columns:
(-> ml-data
(tc/select-columns (into [:ri] (map #(keyword (str "x" %)) (range 5)))))ml-data [1400 6]:
| :ri | :x0 | :x1 | :x2 | :x3 | :x4 |
|---|---|---|---|---|---|
| false | 0.00140513 | 0.00113819 | 0.00064874 | 0.00016952 | 0.00003859 |
| false | 0.00056953 | 0.00087902 | 0.00049127 | 0.00003817 | 0.00003095 |
| false | 0.00111533 | 0.00072134 | 0.00032575 | 0.00029950 | 0.00105973 |
| false | 0.00115295 | 0.00092728 | 0.00054034 | 0.00022159 | 0.00115163 |
| true | 0.00097009 | 0.00105452 | 0.00046938 | 0.00013652 | 0.00011332 |
| false | 0.00024568 | 0.00015792 | 0.00013027 | 0.00025076 | 0.00006439 |
| false | 0.00060575 | 0.00084451 | 0.00045604 | 0.00006230 | 0.00017407 |
| false | 0.00035647 | 0.00010136 | 0.00061886 | 0.00050318 | 0.00012080 |
| true | 0.00084311 | 0.00217295 | 0.00127265 | 0.00026835 | 0.00005026 |
| false | 0.00172205 | 0.00257105 | 0.00094202 | 0.00022629 | 0.00007186 |
| … | … | … | … | … | … |
| true | 0.00117619 | 0.00054554 | 0.00000983 | 0.00025545 | 0.00023035 |
| true | 0.00024839 | 0.00015676 | 0.00003847 | 0.00006366 | 0.00003414 |
| false | 0.00045581 | 0.00068686 | 0.00019928 | 0.00002988 | 0.00023445 |
| false | 0.00087060 | 0.00143934 | 0.00062429 | 0.00010663 | 0.00013869 |
| false | 0.00057721 | 0.00059694 | 0.00010637 | 0.00012898 | 0.00004320 |
| true | 0.00036581 | 0.00097373 | 0.00061454 | 0.00025811 | 0.00007275 |
| false | 0.00000160 | 0.00006738 | 0.00024693 | 0.00007348 | 0.00002933 |
| false | 0.00010252 | 0.00025605 | 0.00034148 | 0.00026237 | 0.00024249 |
| false | 0.00045981 | 0.00029983 | 0.00011500 | 0.00025876 | 0.00049484 |
| false | 0.00000165 | 0.00010861 | 0.00032309 | 0.00041009 | 0.00094332 |
| true | 0.00004385 | 0.00010137 | 0.00004282 | 0.00034013 | 0.00022050 |
A few spectra overlaid
Plotting the binned features for the first five spectra shows the typical variation across samples:
(def n-overlay 5)(def overlay-data
(let [bins (range 6000)]
(->> (range n-overlay)
(mapcat (fn [i]
(let [row (-> ml-data (tc/select-rows [i]))]
(map (fn [b]
{:bin b
:intensity (first (row (keyword (str "x" b))))
:spectrum (str "spectrum-" i)})
bins))))
tc/dataset)))(-> overlay-data
(plotly/base {:=x :bin
:=y :intensity
:=color :spectrum
:=title "Binned features — first 5 spectra"
:=x-title "Bin index"
:=y-title "Summed intensity"})
(plotly/layer-line {:=mark-opacity 0.5})
plotly/plot)Train/test split
As a final check, we split the dataset and verify it is ready for modeling:
(def split-data
(learning/split ml-data {:seed 1})){:train-rows (tc/row-count (:train split-data))
:test-rows (tc/row-count (:test split-data))
:train-resistance-rate (-> split-data :train :ri tcc/mean)
:test-resistance-rate (-> split-data :test :ri tcc/mean)}{:train-rows 933,
:test-rows 467,
:train-resistance-rate 0.16291532690246516,
:test-resistance-rate 0.18201284796573874}Resistance distribution in train and test sets:
(-> (concat
(map (fn [ri] {:split "train" :label (if ri "R/I" "S")}) ((:train split-data) :ri))
(map (fn [ri] {:split "test" :label (if ri "R/I" "S")}) ((:test split-data) :ri)))
tc/dataset
(tc/group-by [:split :label])
(tc/aggregate {:count tc/row-count})
(plotly/base {:=x :split
:=y :count
:=color :label
:=title "Resistance distribution — train vs test"
:=x-title ""
:=y-title "Count"})
(plotly/layer-bar)
plotly/plot)The dataset is ready. Each row is a preprocessed, binned MALDI spectrum with a resistance label — exactly what the classifier needs.