3 Preprocessing Explained
This notebook demonstrates the MALDI-TOF preprocessing pipeline using a synthetic spectrum — no real data files are needed.
By constructing a spectrum from known components (peaks, baseline, noise), we can see exactly what each preprocessing step does and verify it works correctly.
The preprocessing follows the Weis et al. protocol, implemented by Ripple:
- Square root transform (variance stabilization)
- Savitzky-Golay smoothing (half-window 10, polynomial order 3)
- SNIP baseline removal (20 iterations, applied twice)
- TIC normalization
- Trim to [2000, 20000] Da
- Bin into 3 Da bins → 6000 features
(ns amr-book.preprocessing-explained
(:require
;; Ripple MALDI signal processing (https://scicloj.github.io/ripple):
[scicloj.ripple.maldi :as ripple]
;; Table processing (https://scicloj.github.io/tablecloth/):
[tablecloth.api :as tc]
;; Interactive plotting via Plotly (https://scicloj.github.io/tableplot/):
[scicloj.tableplot.v1.plotly :as plotly]
;; Annotating kinds of visualizations (https://scicloj.github.io/kindly-noted/):
[scicloj.kindly.v4.kind :as kind]
;; High-performance array math (https://github.com/cnuernber/dtype-next):
[tech.v3.datatype :as dtype]
[tech.v3.datatype.functional :as dfn]))Building a Synthetic Spectrum
A real MALDI-TOF spectrum has three components:
Peaks — sharp signals from molecules of specific masses
Baseline — a broad, slowly varying background from the chemical matrix used in MALDI
Noise — random fluctuations from the detection process
We build a synthetic spectrum with all three, so we can watch each preprocessing step remove the artifacts while preserving the true peaks.
Mass axis
Real DRIAMS spectra span roughly 1960–20133 Da with ~18000 points. For a manageable visualization, we use 1800–5000 Da at 1 Da spacing.
(def raw-masses
(vec (range 1800.0 5000.0 1.0)))(count raw-masses)3200Gaussian peak helper
(defn gaussian
"A [Gaussian](https://en.wikipedia.org/wiki/Gaussian_function) peak
centered at `center` with width `sigma` and `height`."
[x center sigma height]
(* height (Math/exp (- (/ (Math/pow (- x center) 2)
(* 2 sigma sigma))))))Assembling the components
We place five peaks of varying width and height on top of a complex baseline: an exponential decay (typical of MALDI matrix effects) plus sinusoidal undulations and broad humps that mimic the irregular backgrounds seen in real spectra. Deterministic sinusoidal “noise” is overlaid for reproducibility.
(def raw-intensities
(mapv (fn [m]
(let [;; Complex baseline: exponential decay + sinusoidal undulations + broad bumps
baseline (+ (* 1500 (Math/exp (- (/ (- m 1800) 400.0))))
(* 150 (Math/sin (/ (- m 1800) 300.0)))
(* 80 (Math/cos (/ (- m 2000) 150.0)))
(gaussian m 2600 100.0 500.0)
(gaussian m 3400 80.0 400.0)
(gaussian m 4200 120.0 300.0))
peak1 (gaussian m 2200 8.0 600.0)
peak2 (gaussian m 2800 15.0 350.0)
peak3 (gaussian m 3200 5.0 800.0)
peak4 (gaussian m 3800 20.0 250.0)
peak5 (gaussian m 4400 10.0 450.0)
noise (* 18.0 (+ (Math/sin (* m 0.3))
(* 0.7 (Math/sin (* m 1.1)))
(* 0.3 (Math/sin (* m 2.7)))))]
(max 0.0 (+ baseline peak1 peak2 peak3 peak4 peak5 noise))))
raw-masses))The raw spectrum
(def raw-spectrum
(tc/dataset {:mass raw-masses :intensity raw-intensities}))raw-spectrum_unnamed [3200 2]:
| :mass | :intensity |
|---|---|
| 1800.0 | 1521.82600374 |
| 1801.0 | 1524.53872811 |
| 1802.0 | 1523.89660208 |
| 1803.0 | 1504.43437093 |
| 1804.0 | 1515.53101976 |
| 1805.0 | 1517.64105909 |
| 1806.0 | 1534.08628318 |
| 1807.0 | 1527.67881263 |
| 1808.0 | 1508.80961341 |
| 1809.0 | 1499.53906651 |
| … | … |
| 4989.0 | 0.00000000 |
| 4990.0 | 0.00000000 |
| 4991.0 | 0.00000000 |
| 4992.0 | 0.00000000 |
| 4993.0 | 0.00000000 |
| 4994.0 | 0.00000000 |
| 4995.0 | 0.00000000 |
| 4996.0 | 0.00000000 |
| 4997.0 | 0.00000000 |
| 4998.0 | 0.00000000 |
| 4999.0 | 0.00000000 |
(-> raw-spectrum
(plotly/base {:=x :mass
:=y :intensity
:=title "Raw synthetic spectrum"
:=x-title "m/z (Da)"
:=y-title "Intensity (a.u.)"})
(plotly/layer-line {:=mark-opacity 0.8})
plotly/plot)The baseline has an exponential decay, broad humps, and wavy undulations — obscuring the true peak heights. Noise adds spurious wiggles. The region below 2000 Da is dominated by matrix signal.
Step 1 — Square Root Transform
MALDI-TOF measures ion counts, which follow Poisson-like statistics: the variance grows with the mean. A variance-stabilizing square root compresses the dynamic range so that high and low peaks have comparable noise levels.
y’ = √y
(def sqrt-intensities
(ripple/sqrt-transform raw-intensities))(tc/dataset {:mass raw-masses
:raw raw-intensities
:sqrt sqrt-intensities})_unnamed [3200 3]:
| :mass | :raw | :sqrt |
|---|---|---|
| 1800.0 | 1521.82600374 | 39.01058835 |
| 1801.0 | 1524.53872811 | 39.04534195 |
| 1802.0 | 1523.89660208 | 39.03711826 |
| 1803.0 | 1504.43437093 | 38.78703870 |
| 1804.0 | 1515.53101976 | 38.92982173 |
| 1805.0 | 1517.64105909 | 38.95691285 |
| 1806.0 | 1534.08628318 | 39.16741354 |
| 1807.0 | 1527.67881263 | 39.08553201 |
| 1808.0 | 1508.80961341 | 38.84339858 |
| 1809.0 | 1499.53906651 | 38.72388238 |
| … | … | … |
| 4989.0 | 0.00000000 | 0.00000000 |
| 4990.0 | 0.00000000 | 0.00000000 |
| 4991.0 | 0.00000000 | 0.00000000 |
| 4992.0 | 0.00000000 | 0.00000000 |
| 4993.0 | 0.00000000 | 0.00000000 |
| 4994.0 | 0.00000000 | 0.00000000 |
| 4995.0 | 0.00000000 | 0.00000000 |
| 4996.0 | 0.00000000 | 0.00000000 |
| 4997.0 | 0.00000000 | 0.00000000 |
| 4998.0 | 0.00000000 | 0.00000000 |
| 4999.0 | 0.00000000 | 0.00000000 |
Dynamic range before and after:
{:raw-range [(dfn/reduce-min raw-intensities) (dfn/reduce-max raw-intensities)]
:sqrt-range [(dfn/reduce-min sqrt-intensities) (dfn/reduce-max sqrt-intensities)]}{:raw-range [0.0 1534.0862831797983],
:sqrt-range [0.0 39.16741353701822]}(-> (tc/dataset {:mass raw-masses :intensity sqrt-intensities})
(plotly/base {:=x :mass
:=y :intensity
:=title "After square root transform"
:=x-title "m/z (Da)"
:=y-title "√Intensity"})
(plotly/layer-line {:=mark-opacity 0.8})
plotly/plot)The tallest peak (raw ~1300) and smallest (raw ~250) are now closer together, giving downstream algorithms equal sensitivity across the spectrum.
Step 2 — Savitzky-Golay Smoothing
The Savitzky-Golay filter fits a local polynomial to each sliding window of points and replaces the center with the fitted value. It removes high-frequency noise while preserving peak shapes.
The Weis et al. protocol uses:
- Half-window size 10 → full window of 21 points
- Polynomial order 3 (MALDIquant default)
(def smoothed-intensities
(ripple/savitzky-golay-smooth sqrt-intensities
{:window-size 21
:polynomial-order 3}))(tc/dataset {:mass raw-masses
:sqrt sqrt-intensities
:smoothed smoothed-intensities})_unnamed [3200 3]:
| :mass | :sqrt | :smoothed |
|---|---|---|
| 1800.0 | 39.01058835 | 39.02784055 |
| 1801.0 | 39.04534195 | 39.01681595 |
| 1802.0 | 39.03711826 | 39.00783244 |
| 1803.0 | 38.78703870 | 38.99572538 |
| 1804.0 | 38.92982173 | 38.99170953 |
| 1805.0 | 38.95691285 | 38.98161818 |
| 1806.0 | 39.16741354 | 38.94525966 |
| 1807.0 | 39.08553201 | 38.89463208 |
| 1808.0 | 38.84339858 | 38.82333151 |
| 1809.0 | 38.72388238 | 38.76328923 |
| … | … | … |
| 4989.0 | 0.00000000 | 0.00000000 |
| 4990.0 | 0.00000000 | 0.00000000 |
| 4991.0 | 0.00000000 | 0.00000000 |
| 4992.0 | 0.00000000 | 0.00000000 |
| 4993.0 | 0.00000000 | 0.00000000 |
| 4994.0 | 0.00000000 | 0.00000000 |
| 4995.0 | 0.00000000 | 0.00000000 |
| 4996.0 | 0.00000000 | 0.00000000 |
| 4997.0 | 0.00000000 | 0.00000000 |
| 4998.0 | 0.00000000 | 0.00000000 |
| 4999.0 | 0.00000000 | 0.00000000 |
Effect of window size
Larger windows smooth more aggressively but can broaden peaks. Let’s compare three window sizes zoomed to the sharpest peak (σ=5, centered at 3200 Da):
(def sg-narrow
(ripple/savitzky-golay-smooth sqrt-intensities
{:window-size 5 :polynomial-order 3}))(def sg-default
smoothed-intensities)(def sg-wide
(ripple/savitzky-golay-smooth sqrt-intensities
{:window-size 41 :polynomial-order 3}))(let [idx-start 1370 idx-end 1430
sub-m (dtype/sub-buffer raw-masses idx-start (- idx-end idx-start))
sub-sqrt (dtype/sub-buffer sqrt-intensities idx-start (- idx-end idx-start))
sub-narrow (dtype/sub-buffer sg-narrow idx-start (- idx-end idx-start))
sub-default (dtype/sub-buffer sg-default idx-start (- idx-end idx-start))
sub-wide (dtype/sub-buffer sg-wide idx-start (- idx-end idx-start))
nm (count sub-m)]
(-> (tc/dataset
{:mass (vec (concat sub-m sub-m sub-m sub-m))
:intensity (vec (concat sub-sqrt sub-narrow sub-default sub-wide))
:series (vec (concat (repeat nm "input (sqrt)")
(repeat nm "window=5")
(repeat nm "window=21 (Weis et al.)")
(repeat nm "window=41")))})
(plotly/base {:=x :mass
:=y :intensity
:=color :series
:=title "Smoothing comparison — peak at 3200 Da"
:=x-title "m/z (Da)"
:=y-title "Intensity"})
(plotly/layer-line {:=mark-opacity 0.8})
plotly/plot))Window=5 leaves visible noise, window=41 broadens the peak. Window=21 (the paper’s choice) balances noise removal and peak preservation.
Full spectrum after smoothing:
(-> (tc/dataset {:mass raw-masses :intensity smoothed-intensities})
(plotly/base {:=x :mass
:=y :intensity
:=title "After Savitzky-Golay smoothing (window=21, order=3)"
:=x-title "m/z (Da)"
:=y-title "Intensity"})
(plotly/layer-line {:=mark-opacity 0.8})
plotly/plot)Step 3 — SNIP Baseline Removal (Twice)
The SNIP algorithm (Statistics-sensitive Non-linear Iterative Peak-clipping) estimates the baseline by iteratively “eroding” peaks. At each iteration with window half-width w, each point is compared to the average of its neighbors w steps away:
baseline[i] = min(baseline[i], (baseline[i-w] + baseline[i+w]) / 2)
Points that are part of a peak get clipped down. After all iterations, only the slow-varying baseline remains.
The paper’s actual R code applies SNIP twice: the second pass catches residual baseline that the first pass left behind.
First SNIP pass
(def baseline-corrected-once
(ripple/snip-baseline-removal smoothed-intensities
{:iterations 20}))Second SNIP pass
(def baseline-corrected
(ripple/snip-baseline-removal baseline-corrected-once
{:iterations 20}))(tc/dataset {:mass raw-masses
:smoothed smoothed-intensities
:after-1st-snip baseline-corrected-once
:after-2nd-snip baseline-corrected})_unnamed [3200 4]:
| :mass | :smoothed | :after-1st-snip | :after-2nd-snip |
|---|---|---|---|
| 1800.0 | 39.02784055 | 0.00000000 | 0.00000000 |
| 1801.0 | 39.01681595 | 0.03804590 | 0.03804590 |
| 1802.0 | 39.00783244 | 0.07813291 | 0.07813291 |
| 1803.0 | 38.99572538 | 0.11509636 | 0.11509636 |
| 1804.0 | 38.99170953 | 0.16015103 | 0.16015103 |
| 1805.0 | 38.98161818 | 0.19913018 | 0.19913018 |
| 1806.0 | 38.94525966 | 0.21184217 | 0.21184217 |
| 1807.0 | 38.89463208 | 0.21028511 | 0.21028511 |
| 1808.0 | 38.82333151 | 0.18805504 | 0.18805504 |
| 1809.0 | 38.76328923 | 0.17708327 | 0.17708327 |
| … | … | … | … |
| 4989.0 | 0.00000000 | 0.00000000 | 0.00000000 |
| 4990.0 | 0.00000000 | 0.00000000 | 0.00000000 |
| 4991.0 | 0.00000000 | 0.00000000 | 0.00000000 |
| 4992.0 | 0.00000000 | 0.00000000 | 0.00000000 |
| 4993.0 | 0.00000000 | 0.00000000 | 0.00000000 |
| 4994.0 | 0.00000000 | 0.00000000 | 0.00000000 |
| 4995.0 | 0.00000000 | 0.00000000 | 0.00000000 |
| 4996.0 | 0.00000000 | 0.00000000 | 0.00000000 |
| 4997.0 | 0.00000000 | 0.00000000 | 0.00000000 |
| 4998.0 | 0.00000000 | 0.00000000 | 0.00000000 |
| 4999.0 | 0.00000000 | 0.00000000 | 0.00000000 |
Visualizing the two passes
The estimated baselines (input minus corrected) show how the second pass refines the estimate:
(let [n (count raw-masses)]
(-> (tc/dataset
{:mass (vec (concat raw-masses raw-masses raw-masses))
:intensity (vec (concat smoothed-intensities
baseline-corrected-once
baseline-corrected))
:step (vec (concat (repeat n "before (smoothed)")
(repeat n "after 1st SNIP")
(repeat n "after 2nd SNIP")))})
(plotly/base {:=x :mass
:=y :intensity
:=color :step
:=title "Baseline removal — single vs double SNIP"
:=x-title "m/z (Da)"
:=y-title "Intensity"})
(plotly/layer-line {:=mark-opacity 0.7})
plotly/plot))The first pass removes the bulk of the baseline. The second pass is a subtle refinement — it shaves off residual curvature that the first pass left behind. The effect is small but systematic, bringing inter-peak regions slightly closer to zero.
Verify the background region (away from peaks) is near zero:
(let [region (dtype/sub-buffer baseline-corrected 700 100)
mean-val (/ (dfn/sum region) (count region))]
{:background-mean mean-val
:near-zero? (< (Math/abs mean-val) 1.0)}){:background-mean 0.2850081834166718, :near-zero? true}Step 4 — TIC Normalization
Total Ion Current normalization scales each spectrum so that its area (by the trapezoid rule) equals 1.0. This makes spectra from different measurements comparable, removing differences due to sample amount or instrument sensitivity.
Unlike the other functions, tic-normalize needs both masses and intensities — the m/z spacings are required to compute the trapezoid area correctly.
(def masses raw-masses)(def normalized-intensities
(ripple/tic-normalize masses baseline-corrected
{:target-area 1.0}))(tc/dataset {:mass masses
:baseline-corrected baseline-corrected
:normalized normalized-intensities})_unnamed [3200 3]:
| :mass | :baseline-corrected | :normalized |
|---|---|---|
| 1800.0 | 0.00000000 | 0.00000000 |
| 1801.0 | 0.03804590 | 0.00001205 |
| 1802.0 | 0.07813291 | 0.00002476 |
| 1803.0 | 0.11509636 | 0.00003647 |
| 1804.0 | 0.16015103 | 0.00005074 |
| 1805.0 | 0.19913018 | 0.00006309 |
| 1806.0 | 0.21184217 | 0.00006712 |
| 1807.0 | 0.21028511 | 0.00006663 |
| 1808.0 | 0.18805504 | 0.00005958 |
| 1809.0 | 0.17708327 | 0.00005611 |
| … | … | … |
| 4989.0 | 0.00000000 | 0.00000000 |
| 4990.0 | 0.00000000 | 0.00000000 |
| 4991.0 | 0.00000000 | 0.00000000 |
| 4992.0 | 0.00000000 | 0.00000000 |
| 4993.0 | 0.00000000 | 0.00000000 |
| 4994.0 | 0.00000000 | 0.00000000 |
| 4995.0 | 0.00000000 | 0.00000000 |
| 4996.0 | 0.00000000 | 0.00000000 |
| 4997.0 | 0.00000000 | 0.00000000 |
| 4998.0 | 0.00000000 | 0.00000000 |
| 4999.0 | 0.00000000 | 0.00000000 |
(-> (tc/dataset {:mass masses :intensity normalized-intensities})
(plotly/base {:=x :mass
:=y :intensity
:=title "After TIC normalization"
:=x-title "m/z (Da)"
:=y-title "Intensity (normalized)"})
(plotly/layer-line {:=mark-opacity 0.8})
plotly/plot)Why normalization matters
Two measurements of the same sample can differ in overall intensity. TIC normalization removes this difference:
(let [spectrum-a baseline-corrected
spectrum-b (dfn/* baseline-corrected 4.0)
norm-a (ripple/tic-normalize masses spectrum-a {:target-area 1.0})
norm-b (ripple/tic-normalize masses spectrum-b {:target-area 1.0})
max-diff (dfn/reduce-max (dfn/abs (dfn/- norm-a norm-b)))]
{:max-difference-after-normalization max-diff
:identical? (< max-diff 1e-10)}){:max-difference-after-normalization 0.0, :identical? true}After normalization, the two spectra are identical — the 4× intensity difference has been removed.
Single-call Preprocessing
Ripple’s preprocess-spectrum-data applies the square root, smoothing, baseline removal (twice by default), and normalization in one call — matching the paper’s R code.
(def preprocessed
(ripple/preprocess-spectrum-data
raw-spectrum
{:should-sqrt-transform true
:smooth-window 21
:smooth-polynomial 3
:baseline-iterations 20
:should-tic-normalize true}))preprocessed_unnamed [3200 2]:
| :mass | :intensity |
|---|---|
| 1800.0 | 0.00000000 |
| 1801.0 | 0.00001205 |
| 1802.0 | 0.00002476 |
| 1803.0 | 0.00003647 |
| 1804.0 | 0.00005074 |
| 1805.0 | 0.00006309 |
| 1806.0 | 0.00006712 |
| 1807.0 | 0.00006663 |
| 1808.0 | 0.00005958 |
| 1809.0 | 0.00005611 |
| … | … |
| 4989.0 | 0.00000000 |
| 4990.0 | 0.00000000 |
| 4991.0 | 0.00000000 |
| 4992.0 | 0.00000000 |
| 4993.0 | 0.00000000 |
| 4994.0 | 0.00000000 |
| 4995.0 | 0.00000000 |
| 4996.0 | 0.00000000 |
| 4997.0 | 0.00000000 |
| 4998.0 | 0.00000000 |
| 4999.0 | 0.00000000 |
(-> preprocessed
(plotly/base {:=x :mass
:=y :intensity
:=title "After full preprocessing (double SNIP)"
:=x-title "m/z (Da)"
:=y-title "Intensity (normalized)"})
(plotly/layer-line {:=mark-opacity 0.8})
plotly/plot)Verify it matches the step-by-step result:
(let [manual-result normalized-intensities
auto-result (:intensity preprocessed)
diff (dfn/abs (dfn/- manual-result auto-result))]
{:max-abs-difference (dfn/reduce-max diff)
:identical? (< (dfn/reduce-max diff) 1e-10)}){:max-abs-difference 0.0, :identical? true}For machine learning, spectra must be converted to fixed-length feature vectors. The DRIAMS protocol:
- Trim to [2000, 20000] Da — removing the low-mass matrix region
- Bin into 3 Da wide bins — producing 6000 features
Trimming
Our synthetic spectrum extends below 2000 Da. Trimming removes those points:
(def normalized-spectrum
(tc/dataset {:mass masses :intensity normalized-intensities}))(def trimmed
(ripple/trim-spectrum normalized-spectrum {:range [2000 5000]})){:points-before (tc/row-count normalized-spectrum)
:points-after (tc/row-count trimmed)}{:points-before 3200, :points-after 3000}(-> trimmed
(plotly/base {:=x :mass
:=y :intensity
:=title "Trimmed to [2000, 5000] Da"
:=x-title "m/z (Da)"
:=y-title "Intensity (normalized)"})
(plotly/layer-line {:=mark-opacity 0.8})
plotly/plot)Binning
The m/z axis is partitioned into 3 Da bins and the intensities in each bin are summed. For our range [2000, 5000] this produces 1000 bins. (With the full DRIAMS range [2000, 20000] there would be 6000 bins.)
(def binned
(ripple/bin-spectrum trimmed {:range [2000 5000] :step 3}))(count binned)1000First 10 bin values:
(vec (take 10 binned))[1.574118349507658E-4
1.4107381897474443E-5
3.6967830921445345E-5
1.6116132192937037E-4
3.369944564779574E-4
3.9867612081056103E-4
3.24252594089932E-4
1.5708595420709903E-4
3.082070720091793E-5
4.270847554056874E-5]Comparing continuous vs binned
Zooming to the peak at 3200 Da shows how binning approximates the continuous shape:
(let [bin-min 2000
bin-step 3
n-bins (count binned)
bin-centers (mapv #(+ bin-min (* (+ % 0.5) bin-step)) (range n-bins))
zoom-min 3150 zoom-max 3250
cont-masses (:mass trimmed)
cont-intens (:intensity trimmed)
cont-idx (filterv #(and (>= (cont-masses %) zoom-min)
(<= (cont-masses %) zoom-max))
(range (count cont-masses)))
cm (mapv #(cont-masses %) cont-idx)
ci (mapv #(cont-intens %) cont-idx)
bin-idx (filterv #(and (>= (nth bin-centers %) zoom-min)
(<= (nth bin-centers %) zoom-max))
(range n-bins))
bm (mapv #(nth bin-centers %) bin-idx)
bi (mapv #(aget binned %) bin-idx)
nc (count cm) nb (count bm)]
(-> (tc/dataset
{:mass (vec (concat cm bm))
:intensity (vec (concat ci bi))
:series (vec (concat (repeat nc "continuous (1 Da)")
(repeat nb "binned (3 Da)")))})
(plotly/base {:=x :mass
:=y :intensity
:=color :series
:=title "Continuous vs binned — peak at 3200 Da"
:=x-title "m/z (Da)"
:=y-title "Intensity"})
(plotly/layer-line {:=mark-opacity 0.7})
(plotly/layer-point {:=mark-size 4 :=mark-opacity 0.5})
plotly/plot))The binned version has one value per 3 Da, preserving the peak shape while producing a fixed-length vector suitable for machine learning.
The full binned feature vector
(-> (tc/dataset {:bin (range (count binned))
:intensity (vec binned)})
(plotly/base {:=x :bin
:=y :intensity
:=title "Binned feature vector (1000 features)"
:=x-title "Bin index"
:=y-title "Summed intensity"})
(plotly/layer-line {:=mark-opacity 0.8})
plotly/plot)This is the representation fed to the classifier: a fixed-length vector where each element is the summed intensity in a 3 Da wide bin. In the real DRIAMS pipeline, this would be 6000 features spanning [2000, 20000] Da.
Summary
The preprocessing pipeline transforms a noisy, baseline- contaminated raw spectrum into a clean, normalized, fixed-length feature vector:
| Step | Operation | Effect |
|---|---|---|
| 1 | Square root | Stabilize variance |
| 2 | Savitzky-Golay (window 21, order 3) | Remove noise |
| 3 | SNIP baseline × 2 (20 iterations) | Remove background |
| 4 | TIC normalization | Area → 1.0 |
| 5 | Trim [2000, 20000] Da | Focus on analysis range |
| 6 | Bin 3 Da → 6000 features | Fixed-length vector |
Each step is essential: without variance stabilization, small peaks are invisible; without baseline removal, peak heights are unreliable; without normalization, spectra from different measurements are incomparable; without binning, the ML classifier cannot accept the input.