2  Exploring the DRIAMS Dataset

The DRIAMS dataset is a large collection of MALDI-TOF mass spectra linked to antimicrobial resistance profiles, published by Weis et al. (2022).

It contains spectra from four Swiss hospital sites:

This notebook walks through the data using AMR’s ingestion utilities, then shows how Ripple preprocesses a spectrum for machine learning.

(ns amr-book.exploring-driams
  (:require
   ;; AMR data loading utilities:
   [scicloj.amr.data.ingestion :as ingestion]
   ;; Table processing (https://scicloj.github.io/tablecloth/):
   [tablecloth.api :as tc]
   ;; Interactive plotting via Plotly (https://scicloj.github.io/tableplot/):
   [scicloj.tableplot.v1.plotly :as plotly]))

Where is the data?

The ingestion/base-dir function resolves the DRIAMS path from the DRIAMS_BASE_DIR environment variable (or amr.edn):

(ingestion/base-dir)
"data/DRIAMS/"

Loading a raw spectrum

Each spectrum is a gzipped text file with space-separated mass/intensity pairs — typically ~18,000 points spanning roughly 2,000–20,000 Da.

Let’s pick the first file returned by the ingestion utilities:

(def spectrum-path
  (-> (ingestion/find-data-files "txt.gz")
      first))
spectrum-path
"data/DRIAMS/DRIAMS-A/raw/2015/000d2b4a-ca7f-41c6-a9a2-968874ee9ce4.txt.gz"

Load it as a tablecloth dataset:

(def raw-spectrum
  (ingestion/load-raw-spectrum spectrum-path))
raw-spectrum

data/DRIAMS/DRIAMS-A/raw/2015/000d2b4a-ca7f-41c6-a9a2-968874ee9ce4.txt.gz [20745 2]:

:mass :intensity
1959.98679805 221
1960.40223979 333
1960.81772567 428
1961.23325571 503
1961.64882989 303
1962.06444823 365
1962.48011071 301
1962.89581735 390
1963.31156813 372
1963.72736306 412
20123.00622194 60
20124.34573575 37
20125.68529456 111
20127.02489835 109
20128.36454714 76
20129.70424091 69
20131.04397968 79
20132.38376344 21
20133.72359219 73
20135.06346593 119
20136.40338467 62

Basic statistics:

{:rows (tc/row-count raw-spectrum)
 :mass-min (-> raw-spectrum :mass first)
 :mass-max (-> raw-spectrum :mass last)}
{:rows 20745, :mass-min 1959.98679804911, :mass-max 20136.4033846666}

Visualizing the raw spectrum

(-> raw-spectrum
    (plotly/base {:=x :mass
                  :=y :intensity
                  :=title "Raw MALDI-TOF spectrum"
                  :=x-title "m/z (Da)"
                  :=y-title "Intensity (a.u.)"})
    (plotly/layer-line)
    plotly/plot)

Metadata

The id/ directory contains species identification and antimicrobial resistance labels (R/S/I) for each spectrum.

(def metadata
  (ingestion/load-metadata {:site :A :year 2018}))
{:spectra (tc/row-count metadata)
 :columns (count (tc/column-names metadata))}
{:spectra 30069, :columns 87}

A few rows (showing selected antibiotic columns):

(-> metadata
    (tc/select-columns [:code :species :Cefepime :Ciprofloxacin :Ceftriaxone]))

data/DRIAMS/DRIAMS-A/id/2018/2018_clean.csv.gz [30069 5]:

:code :species :Cefepime :Ciprofloxacin :Ceftriaxone
18e02f6b-4b84-4344-9b7a-2a9cc2b5e2b6_MALDI1 Pseudomonas aeruginosa S S -
e9544679-3f9d-43f6-8ce3-aac053980742_MALDI1 Candida glabrata - - -
bfcad108-864f-4b37-83f3-d7dc94265213_MALDI1 Klebsiella pneumoniae S S S
c649f842-5926-4bb3-8aef-d411db4241f4_MALDI1 Staphylococcus capitis R S R
69eca649-ec26-4f9d-9f9a-d42aa5b9ec0f_MALDI1 Escherichia coli S S S
2132c91c-7b62-4ea4-9984-6f7fcdaed7d6_MALDI1 Escherichia coli S S S
554d747d-77d6-4f66-b24a-dc1132943e54_MALDI1 Staphylococcus aureus S S S
abce6ff4-92ec-4b63-b971-4a3cc06441b0_MALDI1 Staphylococcus aureus S S S
0a430536-9d43-406c-9fef-0a9ffebb41f0_MALDI1 Staphylococcus capitis R S R
ab900e35-1954-4201-a95e-dd0719a3a3ef_MALDI1 Gardnerella vaginalis
c86ab62e-75f3-43da-ab1c-29152646699b_MALDI2 Pseudomonas aeruginosa
ca258726-2047-4a75-ab65-60a81bcfc960_MALDI2 Proteus mirabilis
d012c864-4676-4439-80f8-4b8dceb5121b_MALDI2 Staphylococcus epidermidis
da8a5356-7edd-4f55-b0c5-843ca65999ce_MALDI2 Staphylococcus aureus
df3fe614-0998-4650-a4b4-fffd354de434_MALDI2 Staphylococcus aureus
e5db7ba3-3f65-47e4-8f2f-f375e3c34d3e_MALDI2 Escherichia coli
f3170944-adfd-4ea7-995b-7c82402ddb79_MALDI2 Escherichia coli
f4364fbe-e053-4227-a543-73d6c633fb7e_MALDI2 Enterococcus faecalis
fc1ef8b3-9012-48a7-9386-f4363ee942f8_MALDI2 Enterococcus faecalis
fcbc835a-1cea-48ad-8e56-82afcb867f31_MALDI2 Actinomyces turicensis
a4468a4e-3e12-4685-aea4-4c8b50e68509_MALDI2 Escherichia coli

Species distribution

How many distinct species are there, and which are most common?

(def species-counts
  (-> metadata
      (tc/group-by [:species])
      (tc/aggregate {:count tc/row-count})
      (tc/order-by [:count] :desc)))
(tc/select-rows species-counts (range 10))

_unnamed [10 2]:

:species :count
Staphylococcus epidermidis 2554
Staphylococcus aureus 2191
Escherichia coli 1970
Pseudomonas aeruginosa 1463
Enterococcus faecalis 1336
Klebsiella pneumoniae 1099
Gardnerella vaginalis 1053
Propionibacterium acnes 739
Candida albicans 735
Streptococcus agalactiae 669

The top species as a bar chart:

(-> species-counts
    (tc/select-rows (range 15))
    (plotly/base {:=x :count
                  :=y :species
                  :=title "Top 15 species — DRIAMS-A 2018"
                  :=x-title "Number of spectra"
                  :=y-title ""})
    (plotly/layer-bar)
    plotly/plot
    (assoc-in [:data 0 :orientation] :h)
    (assoc-in [:layout :margin :l] 200))

Antibiotics

The metadata has one column per antibiotic, with values “R” (resistant), “S” (susceptible), or “I” (intermediate). Which antibiotics are tested?

(def antibiotic-columns
  (->> (tc/column-names metadata)
       (remove #{:code :species :laboratory_species
                 :combined_code :column-0 (keyword "Unnamed: 0")})))
(count antibiotic-columns)
82

References

source: notebooks/amr_book/exploring_driams.clj