2 Exploring the DRIAMS Dataset
The DRIAMS dataset is a large collection of MALDI-TOF mass spectra linked to antimicrobial resistance profiles, published by Weis et al. (2022).
It contains spectra from four Swiss hospital sites:
- DRIAMS-A — University Hospital of Basel (2015–2018, ~80K spectra)
- DRIAMS-B — Canton Hospital Basel-Land (2018, ~6K spectra)
- DRIAMS-C — Canton Hospital Aarau (2018, ~22K spectra)
- DRIAMS-D — Viollier AG laboratory (2018, ~76K spectra)
This notebook walks through the data using AMR’s ingestion utilities, then shows how Ripple preprocesses a spectrum for machine learning.
(ns amr-book.exploring-driams
(:require
;; AMR data loading utilities:
[scicloj.amr.data.ingestion :as ingestion]
;; Table processing (https://scicloj.github.io/tablecloth/):
[tablecloth.api :as tc]
;; Interactive plotting via Plotly (https://scicloj.github.io/tableplot/):
[scicloj.tableplot.v1.plotly :as plotly]))Where is the data?
The ingestion/base-dir function resolves the DRIAMS path from the DRIAMS_BASE_DIR environment variable (or amr.edn):
(ingestion/base-dir)"data/DRIAMS/"Loading a raw spectrum
Each spectrum is a gzipped text file with space-separated mass/intensity pairs — typically ~18,000 points spanning roughly 2,000–20,000 Da.
Let’s pick the first file returned by the ingestion utilities:
(def spectrum-path
(-> (ingestion/find-data-files "txt.gz")
first))spectrum-path"data/DRIAMS/DRIAMS-A/raw/2015/000d2b4a-ca7f-41c6-a9a2-968874ee9ce4.txt.gz"Load it as a tablecloth dataset:
(def raw-spectrum
(ingestion/load-raw-spectrum spectrum-path))raw-spectrumdata/DRIAMS/DRIAMS-A/raw/2015/000d2b4a-ca7f-41c6-a9a2-968874ee9ce4.txt.gz [20745 2]:
| :mass | :intensity |
|---|---|
| 1959.98679805 | 221 |
| 1960.40223979 | 333 |
| 1960.81772567 | 428 |
| 1961.23325571 | 503 |
| 1961.64882989 | 303 |
| 1962.06444823 | 365 |
| 1962.48011071 | 301 |
| 1962.89581735 | 390 |
| 1963.31156813 | 372 |
| 1963.72736306 | 412 |
| … | … |
| 20123.00622194 | 60 |
| 20124.34573575 | 37 |
| 20125.68529456 | 111 |
| 20127.02489835 | 109 |
| 20128.36454714 | 76 |
| 20129.70424091 | 69 |
| 20131.04397968 | 79 |
| 20132.38376344 | 21 |
| 20133.72359219 | 73 |
| 20135.06346593 | 119 |
| 20136.40338467 | 62 |
Basic statistics:
{:rows (tc/row-count raw-spectrum)
:mass-min (-> raw-spectrum :mass first)
:mass-max (-> raw-spectrum :mass last)}{:rows 20745, :mass-min 1959.98679804911, :mass-max 20136.4033846666}Visualizing the raw spectrum
(-> raw-spectrum
(plotly/base {:=x :mass
:=y :intensity
:=title "Raw MALDI-TOF spectrum"
:=x-title "m/z (Da)"
:=y-title "Intensity (a.u.)"})
(plotly/layer-line)
plotly/plot)Metadata
The id/ directory contains species identification and antimicrobial resistance labels (R/S/I) for each spectrum.
(def metadata
(ingestion/load-metadata {:site :A :year 2018})){:spectra (tc/row-count metadata)
:columns (count (tc/column-names metadata))}{:spectra 30069, :columns 87}A few rows (showing selected antibiotic columns):
(-> metadata
(tc/select-columns [:code :species :Cefepime :Ciprofloxacin :Ceftriaxone]))data/DRIAMS/DRIAMS-A/id/2018/2018_clean.csv.gz [30069 5]:
| :code | :species | :Cefepime | :Ciprofloxacin | :Ceftriaxone |
|---|---|---|---|---|
| 18e02f6b-4b84-4344-9b7a-2a9cc2b5e2b6_MALDI1 | Pseudomonas aeruginosa | S | S | - |
| e9544679-3f9d-43f6-8ce3-aac053980742_MALDI1 | Candida glabrata | - | - | - |
| bfcad108-864f-4b37-83f3-d7dc94265213_MALDI1 | Klebsiella pneumoniae | S | S | S |
| c649f842-5926-4bb3-8aef-d411db4241f4_MALDI1 | Staphylococcus capitis | R | S | R |
| 69eca649-ec26-4f9d-9f9a-d42aa5b9ec0f_MALDI1 | Escherichia coli | S | S | S |
| 2132c91c-7b62-4ea4-9984-6f7fcdaed7d6_MALDI1 | Escherichia coli | S | S | S |
| 554d747d-77d6-4f66-b24a-dc1132943e54_MALDI1 | Staphylococcus aureus | S | S | S |
| abce6ff4-92ec-4b63-b971-4a3cc06441b0_MALDI1 | Staphylococcus aureus | S | S | S |
| 0a430536-9d43-406c-9fef-0a9ffebb41f0_MALDI1 | Staphylococcus capitis | R | S | R |
| ab900e35-1954-4201-a95e-dd0719a3a3ef_MALDI1 | Gardnerella vaginalis | |||
| … | … | … | … | … |
| c86ab62e-75f3-43da-ab1c-29152646699b_MALDI2 | Pseudomonas aeruginosa | |||
| ca258726-2047-4a75-ab65-60a81bcfc960_MALDI2 | Proteus mirabilis | |||
| d012c864-4676-4439-80f8-4b8dceb5121b_MALDI2 | Staphylococcus epidermidis | |||
| da8a5356-7edd-4f55-b0c5-843ca65999ce_MALDI2 | Staphylococcus aureus | |||
| df3fe614-0998-4650-a4b4-fffd354de434_MALDI2 | Staphylococcus aureus | |||
| e5db7ba3-3f65-47e4-8f2f-f375e3c34d3e_MALDI2 | Escherichia coli | |||
| f3170944-adfd-4ea7-995b-7c82402ddb79_MALDI2 | Escherichia coli | |||
| f4364fbe-e053-4227-a543-73d6c633fb7e_MALDI2 | Enterococcus faecalis | |||
| fc1ef8b3-9012-48a7-9386-f4363ee942f8_MALDI2 | Enterococcus faecalis | |||
| fcbc835a-1cea-48ad-8e56-82afcb867f31_MALDI2 | Actinomyces turicensis | |||
| a4468a4e-3e12-4685-aea4-4c8b50e68509_MALDI2 | Escherichia coli |
Species distribution
How many distinct species are there, and which are most common?
(def species-counts
(-> metadata
(tc/group-by [:species])
(tc/aggregate {:count tc/row-count})
(tc/order-by [:count] :desc)))(tc/select-rows species-counts (range 10))_unnamed [10 2]:
| :species | :count |
|---|---|
| Staphylococcus epidermidis | 2554 |
| Staphylococcus aureus | 2191 |
| Escherichia coli | 1970 |
| Pseudomonas aeruginosa | 1463 |
| Enterococcus faecalis | 1336 |
| Klebsiella pneumoniae | 1099 |
| Gardnerella vaginalis | 1053 |
| Propionibacterium acnes | 739 |
| Candida albicans | 735 |
| Streptococcus agalactiae | 669 |
The top species as a bar chart:
(-> species-counts
(tc/select-rows (range 15))
(plotly/base {:=x :count
:=y :species
:=title "Top 15 species — DRIAMS-A 2018"
:=x-title "Number of spectra"
:=y-title ""})
(plotly/layer-bar)
plotly/plot
(assoc-in [:data 0 :orientation] :h)
(assoc-in [:layout :margin :l] 200))Antibiotics
The metadata has one column per antibiotic, with values “R” (resistant), “S” (susceptible), or “I” (intermediate). Which antibiotics are tested?
(def antibiotic-columns
(->> (tc/column-names metadata)
(remove #{:code :species :laboratory_species
:combined_code :column-0 (keyword "Unnamed: 0")})))(count antibiotic-columns)82References
- Weis, C., et al. (2022). Direct antimicrobial resistance prediction from clinical MALDI-TOF mass spectra using machine learning. Nature Medicine, 28, 164–174.