What machine learning models work best for spectral classification?

For small datasets (under 500 samples), SVMs and random forests with hand-crafted features consistently outperform deep learning. For larger datasets, 1D convolutional neural networks and 1D-ResNet architectures learn spectral features directly from raw data and achieve state-of-the-art accuracy. The preprocessing pipeline — baseline correction, normalization, and optional dimensionality reduction — often matters more than model choice.

How do you preprocess spectra for machine learning?

The standard pipeline is: baseline correction (rubber band, asymmetric least squares, or polynomial fitting), normalization (Standard Normal Variate or Multiplicative Scatter Correction to remove intensity variations), and optionally dimensionality reduction (PCA or feature selection to remove noise and reduce overfitting). The exact steps depend on the modality — Raman spectra need cosmic ray removal, FTIR spectra need ATR correction, and NIR spectra benefit from derivative preprocessing.

Can AI diagnose diseases from a spectrum?

Yes. Trained models classify spectra with greater than 95% accuracy for many clinical applications, including bacterial identification, cancer tissue typing, and blood biomarker detection. However, deploying these models clinically requires FDA clearance (or equivalent regulatory authorization), clinical validation on independent patient cohorts, and software that meets IEC 62304 and 21 CFR Part 11 requirements.

Spectral Classification with ML: Building Clinical AI Pipelines

A spectrum is a vector of numbers. A clinical result is a decision: positive or negative, malignant or benign, pathogen A or pathogen B, concentration above or below the clinical threshold. The machine learning pipeline that transforms one into the other is the core intellectual property of any spectroscopy-based diagnostic.

This article covers the full pipeline - from raw spectrum to deployed clinical model. It is written for ML engineers and data scientists who know how to train models but have never worked with spectral data. If you are a spectroscopist who has never deployed a model, the second half on validation and deployment will be the most relevant.

We will use Python throughout, with scikit-learn, scipy, and PyTorch. The concepts apply equally to FTIR, Raman, and NIR spectra - see our comparison of spectroscopic modalities for the physics behind each.

The Pipeline

Raw Spectrum
    │
    ├── 1. Preprocessing
    │       Baseline correction, normalization,
    │       smoothing, derivatives
    │
    ├── 2. Feature Extraction
    │       PCA, peak ratios, or learned (CNN)
    │
    ├── 3. Model Training
    │       PLS-DA, SVM, Random Forest, CNN
    │
    ├── 4. Validation
    │       Patient-level splits, cross-instrument,
    │       temporal validation
    │
    ├── 5. Deployment
    │       ONNX export, latency, confidence thresholds
    │
    └── 6. Monitoring
            Drift detection, performance tracking,
            regulatory documentation

Each step has domain-specific considerations that differ from typical tabular or image ML. Let us walk through them.

Step 1: Spectral Preprocessing

Raw spectral data is not ready for modeling. Instrument artifacts, baseline drift, noise, and scaling differences between instruments and sessions must be removed before any model can learn meaningful patterns. For a comprehensive comparison of preprocessing methods, see our spectral preprocessing for clinical ML guide.

Baseline Correction

Spectra often sit on top of a slowly varying baseline caused by scattering, fluorescence (in Raman), or ATR crystal effects (in FTIR). This baseline carries no chemical information and must be removed.

Asymmetric Least Squares (ALS) is the most widely used method for automated baseline correction:

import numpy as np
from scipy import sparse
from scipy.sparse.linalg import spsolve
 
def als_baseline(y, lam=1e5, p=0.01, niter=10):
    """
    Asymmetric Least Squares baseline estimation.
    
    The key insight: the baseline should be smooth (enforced by
    the smoothness penalty lam) and should lie below the peaks
    (enforced by asymmetric weighting p).
    
    Parameters:
        y: intensity array
        lam: smoothness (1e4 for noisy data, 1e7 for clean data)
        p: asymmetry (0.001-0.01 typical)
        niter: iterations (10 is usually sufficient)
    """
    L = len(y)
    D = sparse.diags([1, -2, 1], [0, -1, -2], shape=(L, L - 2))
    w = np.ones(L)
    for _ in range(niter):
        W = sparse.spdiags(w, 0, L, L)
        Z = W + lam * D.dot(D.T)
        z = spsolve(Z, w * y)
        w = p * (y > z) + (1 - p) * (y <= z)
    return z
 
# Apply baseline correction
baseline = als_baseline(spectrum, lam=1e5, p=0.01)
corrected = spectrum - baseline

Rubber band correction is simpler but effective for spectra with clear peaks. It stretches a convex hull underneath the spectrum and subtracts it. This works well for FTIR absorbance spectra where peaks point upward from a relatively flat baseline.

Normalization

Different instruments, different measurement days, and different sample thicknesses produce spectra with different overall intensities. Normalization removes this scaling variation so the model focuses on spectral shape, not absolute intensity.

Standard Normal Variate (SNV) is the most common normalization for spectral data:

def snv(spectra: np.ndarray) -> np.ndarray:
    """
    Standard Normal Variate normalization.
    
    Each spectrum is centered (subtract mean) and scaled
    (divide by std). This removes multiplicative and additive
    scatter effects.
    
    Parameters:
        spectra: 2D array (n_samples, n_wavenumbers)
    
    Returns:
        SNV-normalized spectra (same shape)
    """
    mean = np.mean(spectra, axis=1, keepdims=True)
    std = np.std(spectra, axis=1, keepdims=True)
    return (spectra - mean) / std

Vector normalization divides each spectrum by its L2 norm, projecting all spectra onto the unit hypersphere. This is useful when you want to compare spectral shapes regardless of concentration:

def vector_normalize(spectra: np.ndarray) -> np.ndarray:
    """Normalize each spectrum to unit L2 norm."""
    norms = np.linalg.norm(spectra, axis=1, keepdims=True)
    return spectra / norms

Min-max normalization scales each spectrum to [0, 1]. Simple but sensitive to outlier peaks. Generally less robust than SNV for spectral data.

Smoothing

Spectral noise - random intensity fluctuations from detector noise, shot noise, and electronic noise - degrades model performance. Smoothing reduces noise while preserving peak shapes.

Savitzky-Golay filtering is the standard for spectral smoothing. It fits a local polynomial to a sliding window, which smooths noise while preserving peak positions and shapes better than a simple moving average:

from scipy.signal import savgol_filter
 
# Typical parameters for spectral data:
# window_length: 7-15 (must be odd)
# polyorder: 2-3 (lower = more smoothing)
smoothed = savgol_filter(spectrum, window_length=11, polyorder=3)

The window length controls the trade-off between noise reduction and peak preservation. Too wide and you broaden peaks and lose resolution. Too narrow and you retain noise. For FTIR data at 4 cm^-1 resolution, a window of 7-11 points works well. For Raman data with coarser resolution, 11-15 is typical.

Derivative Spectra

First and second derivatives of spectra are powerful features - they resolve overlapping peaks that are invisible in the raw spectrum and remove baseline offsets entirely (first derivative) or linear baselines (second derivative).

# Savitzky-Golay derivative - smoothing and differentiation in one step
first_derivative = savgol_filter(spectrum, 
                                  window_length=15, 
                                  polyorder=3, 
                                  deriv=1)
 
second_derivative = savgol_filter(spectrum, 
                                   window_length=15, 
                                   polyorder=3, 
                                   deriv=2)

Second derivatives are particularly useful in biological spectroscopy. The amide I band (1600-1700 cm^-1 in FTIR) appears as a single broad peak in the raw spectrum but resolves into multiple components in the second derivative - alpha-helix (~1656 cm^-1), beta-sheet (~1635 cm^-1), turns (~1672 cm^-1). These sub-peak positions shift in disease states, and the second derivative makes them visible to the model.

Spectral Region Selection

Not all wavenumbers are informative. The fingerprint region (900-1800 cm^-1 for biological FTIR, 400-1800 cm^-1 for Raman) contains the most diagnostically relevant information. The C-H stretching region (2800-3100 cm^-1) carries lipid information. Outside these regions, you are mostly modeling noise and water absorption.

def select_region(wavenumbers, spectrum, 
                  regions=[(900, 1800), (2800, 3100)]):
    """
    Select diagnostically relevant spectral regions.
    
    Parameters:
        wavenumbers: 1D array of wavenumber values
        spectrum: 1D intensity array (or 2D for multiple spectra)
        regions: list of (start, end) wavenumber ranges to keep
    """
    mask = np.zeros(len(wavenumbers), dtype=bool)
    for start, end in regions:
        mask |= (wavenumbers >= start) & (wavenumbers <= end)
    
    if spectrum.ndim == 1:
        return wavenumbers[mask], spectrum[mask]
    else:
        return wavenumbers[mask], spectrum[:, mask]

Putting It Together

A complete preprocessing pipeline for FTIR tissue classification:

from scipy.signal import savgol_filter
 
def preprocess_ftir(wavenumbers, spectrum):
    """
    Standard preprocessing pipeline for FTIR tissue spectra.
    
    Returns preprocessed spectrum ready for model input.
    """
    # 1. Select fingerprint + lipid regions
    wn, spec = select_region(wavenumbers, spectrum,
                              regions=[(900, 1800), (2800, 3100)])
    
    # 2. Baseline correction (ALS)
    baseline = als_baseline(spec, lam=1e5, p=0.01)
    spec = spec - baseline
    
    # 3. Smoothing (Savitzky-Golay)
    spec = savgol_filter(spec, window_length=9, polyorder=3)
    
    # 4. Normalization (SNV)
    mean = np.mean(spec)
    std = np.std(spec)
    spec = (spec - mean) / std
    
    return wn, spec

Step 2: Feature Extraction

After preprocessing, you have a clean spectrum - typically 500 to 2000 data points per sample. You can feed this directly into a model (and CNNs prefer this) or extract features first (required for classical ML methods to avoid the curse of dimensionality).

PCA (Principal Component Analysis)

PCA is the workhorse of chemometrics. It reduces the dimensionality of spectral data while preserving the variance. The first 5-20 principal components typically capture 95%+ of the meaningful variation in a spectral dataset.

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
 
# spectra: (n_samples, n_wavenumbers) - already preprocessed
scaler = StandardScaler()
spectra_scaled = scaler.fit_transform(spectra)
 
pca = PCA(n_components=20)
scores = pca.fit_transform(spectra_scaled)
 
# How much variance do we capture?
cumvar = np.cumsum(pca.explained_variance_ratio_)
n_95 = np.argmax(cumvar >= 0.95) + 1
print(f"Components for 95% variance: {n_95}")
 
# Loadings tell you which wavenumbers drive each component
# This is critical for interpretability in clinical applications
loadings = pca.components_  # (n_components, n_wavenumbers)

PCA scores make excellent inputs for classical classifiers (PLS-DA, SVM, Random Forest). The loadings provide interpretability - you can examine which spectral regions (and therefore which molecular vibrations) drive the classification. This matters for regulatory submissions where you need to explain why the model works.

Peak-Based Features

Domain experts often know which spectral features are diagnostically relevant. Extracting explicit features - peak positions, peak heights, peak area ratios - produces a small, interpretable feature set:

from scipy.signal import find_peaks
 
def extract_peak_features(wavenumbers, spectrum, 
                          target_peaks=[1545, 1650, 2850, 2920]):
    """
    Extract peak-based features at known positions.
    
    For tissue classification, key peaks include:
    - 1545 cm-1: Amide II (protein)
    - 1650 cm-1: Amide I (protein secondary structure)
    - 2850 cm-1: symmetric CH2 stretch (lipid)
    - 2920 cm-1: asymmetric CH2 stretch (lipid)
    """
    features = {}
    
    for target in target_peaks:
        # Find the closest measured wavenumber
        idx = np.argmin(np.abs(wavenumbers - target))
        
        # Peak height
        features[f"peak_{target}"] = spectrum[idx]
    
    # Ratios are often more informative than absolute heights
    if 1650 in target_peaks and 1545 in target_peaks:
        idx_1650 = np.argmin(np.abs(wavenumbers - 1650))
        idx_1545 = np.argmin(np.abs(wavenumbers - 1545))
        features["amide_ratio"] = (spectrum[idx_1650] / 
                                    spectrum[idx_1545])
    
    if 2920 in target_peaks and 2850 in target_peaks:
        idx_2920 = np.argmin(np.abs(wavenumbers - 2920))
        idx_2850 = np.argmin(np.abs(wavenumbers - 2850))
        features["lipid_ratio"] = (spectrum[idx_2920] / 
                                    spectrum[idx_2850])
    
    return features

Peak-based features are highly interpretable and work well when the diagnostic mechanism is understood. The amide I/II ratio tracks protein secondary structure changes that occur in cancer. The CH2 symmetric/asymmetric stretch ratio tracks lipid chain order in cell membranes. When your model uses these features, you can explain to a clinician (and a regulator) exactly what molecular information drives the classification.

Learned Features (1D CNN)

Convolutional neural networks learn features directly from preprocessed spectra, without requiring manual feature engineering. A 1D CNN treats the spectrum as a 1D signal and learns spectral patterns (analogous to edges and textures in image CNNs) that distinguish classes.

This is increasingly the preferred approach when sufficient training data is available (500+ spectra per class), because the network discovers features that domain experts may not have considered. But it sacrifices interpretability - the features are learned filter responses, not named molecular vibrations.

Step 3: Model Selection

The right model depends on your dataset size, the clinical requirements, and how much interpretability you need for regulatory purposes.

PLS-DA (Partial Least Squares Discriminant Analysis)

PLS-DA is the default starting point in chemometrics. It simultaneously performs dimensionality reduction and classification, finding latent variables that maximize the covariance between spectral data and class labels. It works well with small datasets (30-100 spectra per class) and produces interpretable models.

from sklearn.cross_decomposition import PLSRegression
from sklearn.preprocessing import LabelBinarizer
 
class PLSDA:
    """
    PLS-DA classifier for spectral data.
    
    PLS regression is used with binarized class labels as the Y
    matrix. Classification is by maximum predicted Y value.
    """
    def __init__(self, n_components=10):
        self.pls = PLSRegression(n_components=n_components)
        self.lb = LabelBinarizer()
    
    def fit(self, X, y):
        Y = self.lb.fit_transform(y)
        if Y.shape[1] == 1:
            # Binary case: LabelBinarizer returns single column
            Y = np.hstack([1 - Y, Y])
        self.pls.fit(X, Y)
        return self
    
    def predict(self, X):
        Y_pred = self.pls.predict(X)
        class_indices = np.argmax(Y_pred, axis=1)
        return self.lb.classes_[class_indices]
    
    def predict_proba(self, X):
        """Pseudo-probabilities from PLS predictions."""
        Y_pred = self.pls.predict(X)
        # Clip to [0, 1] and normalize
        Y_pred = np.clip(Y_pred, 0, None)
        row_sums = Y_pred.sum(axis=1, keepdims=True)
        row_sums[row_sums == 0] = 1  # avoid division by zero
        return Y_pred / row_sums
 
# Usage
model = PLSDA(n_components=10)
model.fit(X_train, y_train)
predictions = model.predict(X_test)

When to use PLS-DA: Small datasets (< 200 spectra total), need for interpretability (VIP scores identify important wavenumbers), regulatory settings where model simplicity is valued.

SVM (Support Vector Machine)

SVMs excel at binary classification with moderate dataset sizes. For spectral data, an RBF (radial basis function) kernel handles the nonlinear relationships between spectral features and class labels effectively.

from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
 
# SVM with RBF kernel - good for spectral classification
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=30)),
    ('svm', SVC(kernel='rbf', probability=True))
])
 
# Hyperparameter search
param_grid = {
    'pca__n_components': [10, 20, 30, 50],
    'svm__C': [0.1, 1, 10, 100],
    'svm__gamma': ['scale', 'auto', 0.01, 0.001]
}
 
grid = GridSearchCV(pipeline, param_grid, cv=5, 
                     scoring='roc_auc', n_jobs=-1)
grid.fit(X_train, y_train)
 
print(f"Best AUC: {grid.best_score_:.3f}")
print(f"Best params: {grid.best_params_}")

When to use SVM: Binary classification (positive/negative), 50-500 spectra, good generalization with limited data.

Random Forest

Random forests are robust baselines that handle class imbalance well and provide feature importance rankings for free. They also require minimal hyperparameter tuning.

from sklearn.ensemble import RandomForestClassifier
 
# Random Forest - robust baseline with built-in feature importance
rf = RandomForestClassifier(
    n_estimators=500,
    max_depth=None,
    min_samples_leaf=5,
    class_weight='balanced',  # handles class imbalance
    random_state=42,
    n_jobs=-1
)
 
rf.fit(X_train, y_train)
 
# Feature importance - which wavenumbers matter most?
importances = rf.feature_importances_
 
# Plot importance vs. wavenumber to identify diagnostic regions
# High-importance wavenumbers correspond to molecular vibrations
# that distinguish the classes

When to use Random Forest: Good baseline for any dataset size, multi-class problems, when you want quick feature importance analysis.

1D CNN

When you have sufficient data (500+ spectra per class), a 1D convolutional neural network outperforms classical methods by learning hierarchical spectral features directly from the data.

import torch
import torch.nn as nn
 
class SpectralCNN(nn.Module):
    """
    1D CNN for spectral classification.
    
    Architecture follows the pattern proven effective for 
    spectral data: Conv1D blocks with increasing filters,
    followed by global average pooling and dense layers.
    """
    def __init__(self, input_length, n_classes):
        super().__init__()
        
        self.features = nn.Sequential(
            # Block 1: capture broad spectral features
            nn.Conv1d(1, 32, kernel_size=11, padding=5),
            nn.BatchNorm1d(32),
            nn.ReLU(),
            nn.MaxPool1d(2),
            
            # Block 2: capture finer spectral patterns
            nn.Conv1d(32, 64, kernel_size=7, padding=3),
            nn.BatchNorm1d(64),
            nn.ReLU(),
            nn.MaxPool1d(2),
            
            # Block 3: capture peak-level features
            nn.Conv1d(64, 128, kernel_size=5, padding=2),
            nn.BatchNorm1d(128),
            nn.ReLU(),
            nn.AdaptiveAvgPool1d(1)  # global average pooling
        )
        
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(64, n_classes)
        )
    
    def forward(self, x):
        # x shape: (batch, 1, spectrum_length)
        x = self.features(x)
        x = self.classifier(x)
        return x
 
# Training setup
model = SpectralCNN(input_length=1024, n_classes=3)
criterion = nn.CrossEntropyLoss(
    weight=torch.tensor([1.0, 2.0, 1.5])  # class weights
)
optimizer = torch.optim.AdamW(model.parameters(), 
                               lr=1e-3, 
                               weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer, T_max=100
)

The kernel sizes in each block are deliberately chosen. The first block uses kernel size 11 to capture broad spectral features (baselines, overall shape). The second block (kernel 7) captures medium-scale features (peak groups, band envelopes). The third block (kernel 5) captures fine features (individual peaks, shoulders). This hierarchy mirrors how spectroscopists read spectra - broad shape first, then zoom into specific regions.

When to use CNN: 500+ spectra per class, when maximum accuracy is needed, when you can afford reduced interpretability.

Decision Guide

Dataset Size	Recommended Model	Rationale
< 50 per class	PLS-DA	Handles high-dimensional data with very few samples
50-200 per class	SVM (RBF) or PLS-DA	SVM generalizes well; PLS-DA if interpretability is critical
200-500 per class	Random Forest or SVM	RF gives feature importance; SVM if binary
500+ per class	1D CNN	Sufficient data for deep learning to outperform
Any size, multi-modal	CNN with multi-input	Combine FTIR + Raman in parallel branches

Step 4: Validation

Validation is where most spectral classification projects fail. A model that shows 99% accuracy in development can drop to 70% in clinical deployment. The cause is almost always a validation methodology that leaks information the model will not have in production.

The Cardinal Rule: Patient-Level Splits

Never split your dataset randomly at the spectrum level. If you measure 10 spectra from patient A's tissue sample, those 10 spectra are correlated - they share patient-specific biological characteristics that have nothing to do with the diagnostic target. If 7 end up in training and 3 in testing, your model can achieve high test accuracy by recognizing patient A, not by recognizing the disease.

Leave-one-patient-out (LOPO) cross-validation is the gold standard:

from sklearn.model_selection import LeaveOneGroupOut
 
def lopo_cross_validation(X, y, patient_ids, model_class, 
                           model_params=None):
    """
    Leave-One-Patient-Out cross-validation.
    
    All spectra from a single patient are held out for testing.
    The model is trained on all other patients. This is repeated
    for every patient.
    
    This is computationally expensive but gives the most honest
    estimate of clinical performance.
    """
    logo = LeaveOneGroupOut()
    
    y_true_all = []
    y_pred_all = []
    y_prob_all = []
    patient_results = {}
    
    for train_idx, test_idx in logo.split(X, y, patient_ids):
        X_train, X_test = X[train_idx], X[test_idx]
        y_train, y_test = y[train_idx], y[test_idx]
        patient_id = patient_ids[test_idx[0]]
        
        # Train model on all patients except one
        if model_params:
            model = model_class(**model_params)
        else:
            model = model_class()
        model.fit(X_train, y_train)
        
        # Predict on held-out patient
        y_pred = model.predict(X_test)
        
        y_true_all.extend(y_test)
        y_pred_all.extend(y_pred)
        
        # Per-patient result (majority vote across spectra)
        from collections import Counter
        patient_prediction = Counter(y_pred).most_common(1)[0][0]
        patient_truth = Counter(y_test).most_common(1)[0][0]
        patient_results[patient_id] = {
            'true': patient_truth,
            'predicted': patient_prediction,
            'n_spectra': len(y_test)
        }
    
    return y_true_all, y_pred_all, patient_results

Cross-Instrument Validation

A model trained on spectra from instrument A must work on instrument B. Instrument-to-instrument variability (different detector sensitivities, laser powers, optical alignments) introduces systematic spectral differences that a model can exploit instead of learning genuine chemistry.

Test this explicitly: train on instruments 1 and 2, test on instrument 3. If accuracy drops substantially, your preprocessing is not removing instrument-specific artifacts effectively, or you need to include instrument transfer calibration in your pipeline.

def cross_instrument_validation(X, y, instrument_ids):
    """
    Leave-one-instrument-out validation.
    
    Trains on all instruments except one, tests on the held-out
    instrument. Reveals whether the model generalizes across
    hardware.
    """
    unique_instruments = np.unique(instrument_ids)
    results = {}
    
    for held_out in unique_instruments:
        train_mask = instrument_ids != held_out
        test_mask = instrument_ids == held_out
        
        X_train, y_train = X[train_mask], y[train_mask]
        X_test, y_test = X[test_mask], y[test_mask]
        
        model = SVC(kernel='rbf', probability=True)
        model.fit(X_train, y_train)
        
        accuracy = model.score(X_test, y_test)
        results[held_out] = accuracy
        print(f"Instrument {held_out} held out: "
              f"accuracy = {accuracy:.3f}")
    
    return results

Temporal Validation

Disease prevalence, sample handling procedures, and reagent lots change over time. A model trained on samples from January through March must work on samples from April onward. Temporal validation catches concept drift before it reaches patients.

Split your data chronologically: train on the first 70% of samples (by collection date), test on the last 30%. If accuracy degrades, investigate what changed - it may be a real distribution shift in the patient population, a change in sample preparation, or instrument drift.

Clinical Performance Metrics

Accuracy alone is not sufficient for clinical applications. You need:

Sensitivity (recall): proportion of true positives detected. For a screening test, this must be very high (>95%) - you cannot miss cancer.
Specificity: proportion of true negatives correctly identified. For a confirmatory test, this must be very high (>95%) - you cannot falsely diagnose.
ROC curve and AUC: the operating characteristic across all decision thresholds. AUC > 0.95 is the typical minimum for clinical deployment.
Operating point selection: the threshold on your model's probability output that determines the positive/negative cutoff. This is a clinical decision, not a data science decision - it depends on the relative cost of false positives vs. false negatives for your specific clinical context.

from sklearn.metrics import roc_curve, auc, confusion_matrix
 
def clinical_performance_report(y_true, y_prob, 
                                 target_sensitivity=0.95):
    """
    Generate a clinical performance report.
    
    Instead of picking the threshold that maximizes accuracy,
    find the threshold that achieves the target sensitivity
    and report the corresponding specificity.
    """
    fpr, tpr, thresholds = roc_curve(y_true, y_prob)
    roc_auc = auc(fpr, tpr)
    
    # Find threshold for target sensitivity
    idx = np.argmin(np.abs(tpr - target_sensitivity))
    operating_threshold = thresholds[idx]
    operating_sensitivity = tpr[idx]
    operating_specificity = 1 - fpr[idx]
    
    # Apply threshold
    y_pred = (y_prob >= operating_threshold).astype(int)
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    
    print(f"AUC: {roc_auc:.4f}")
    print(f"Operating threshold: {operating_threshold:.4f}")
    print(f"Sensitivity: {operating_sensitivity:.4f} "
          f"(target: {target_sensitivity})")
    print(f"Specificity: {operating_specificity:.4f}")
    print(f"PPV: {tp/(tp+fp):.4f}")
    print(f"NPV: {tn/(tn+fn):.4f}")
    print(f"Confusion matrix: TP={tp}, FP={fp}, FN={fn}, TN={tn}")
    
    return {
        'auc': roc_auc,
        'threshold': operating_threshold,
        'sensitivity': operating_sensitivity,
        'specificity': operating_specificity,
        'ppv': tp / (tp + fp),
        'npv': tn / (tn + fn)
    }

Step 5: Deployment

A trained model sitting in a Jupyter notebook is not a clinical tool. Deployment means embedding the model in a production system that runs inference in real time, handles edge cases gracefully, and integrates with the clinical workflow.

Model Export

For deployment, export your model to a format that is independent of the training framework:

ONNX - the most portable option. Works with Python, C++, C#, Java, JavaScript. Supported by nearly every inference runtime.
TorchScript - for staying in the PyTorch ecosystem. Good for Python and C++ deployment.
Pickle/joblib - for scikit-learn models. Only works in Python, and version mismatches can break deserialization.

import torch
 
# Export PyTorch CNN to ONNX
model.eval()
dummy_input = torch.randn(1, 1, 1024)  # (batch, channels, length)
 
torch.onnx.export(
    model,
    dummy_input,
    "spectral_classifier_v1.2.onnx",
    input_names=["spectrum"],
    output_names=["class_logits"],
    dynamic_axes={
        "spectrum": {0: "batch_size"},
        "class_logits": {0: "batch_size"}
    },
    opset_version=17
)
 
# Verify ONNX model
import onnxruntime as ort
 
session = ort.InferenceSession("spectral_classifier_v1.2.onnx")
result = session.run(
    None, 
    {"spectrum": dummy_input.numpy()}
)
print(f"ONNX output shape: {result[0].shape}")

Inference Latency

In a clinical workflow, the clinician places a sample on the spectrometer, the instrument acquires a spectrum (1-30 seconds depending on modality and integration time), and the classification result must appear within 1 second of acquisition completing. The clinician is standing there waiting. Two seconds feels slow. Five seconds and they walk away.

Profile your inference pipeline end-to-end:

import time
 
def benchmark_inference(model_path, spectrum, n_runs=100):
    """
    Benchmark inference latency for a deployed model.
    
    Measures the full pipeline: preprocessing + inference.
    Target: < 200ms for the ML portion (excluding acquisition).
    """
    session = ort.InferenceSession(model_path)
    
    times = []
    for _ in range(n_runs):
        start = time.perf_counter()
        
        # Preprocessing (must be included in latency budget)
        processed = preprocess_ftir(wavenumbers, spectrum)
        input_data = processed.reshape(1, 1, -1).astype(np.float32)
        
        # Inference
        result = session.run(None, {"spectrum": input_data})
        
        elapsed = time.perf_counter() - start
        times.append(elapsed)
    
    times = np.array(times) * 1000  # convert to ms
    print(f"Median: {np.median(times):.1f} ms")
    print(f"P95: {np.percentile(times, 95):.1f} ms")
    print(f"P99: {np.percentile(times, 99):.1f} ms")

For scikit-learn models (PLS-DA, SVM, Random Forest), inference is typically < 5ms. For 1D CNNs, ONNX Runtime on CPU typically achieves < 50ms for the model sizes used in spectral classification. Preprocessing (baseline correction, normalization) usually takes longer than inference - 20-100ms depending on spectrum length and the algorithms used.

Confidence Thresholds and the Indeterminate Category

Not every spectrum produces a confident classification. A model may encounter a spectrum that looks unlike anything in its training set - a contaminated sample, a new sample type, an instrument malfunction. Deploying a model that always outputs a definitive result is dangerous.

Implement a three-category output: positive, negative, and indeterminate. For rigorous uncertainty quantification using conformal prediction and Monte Carlo dropout, see our confidence scoring for spectral classification deep dive.

def classify_with_confidence(model_session, spectrum,
                              positive_threshold=0.85,
                              negative_threshold=0.15):
    """
    Three-category classification with confidence thresholds.
    
    Spectra that fall between the thresholds are flagged as
    indeterminate - the clinician is notified that the result
    requires manual review or repeat testing.
    """
    input_data = spectrum.reshape(1, 1, -1).astype(np.float32)
    logits = model_session.run(None, {"spectrum": input_data})[0]
    
    # Softmax to get probabilities
    probs = np.exp(logits) / np.sum(np.exp(logits), axis=1, 
                                      keepdims=True)
    max_prob = np.max(probs)
    predicted_class = np.argmax(probs)
    
    if max_prob >= positive_threshold:
        confidence = "high"
        result = CLASS_NAMES[predicted_class]
    elif max_prob <= negative_threshold:
        confidence = "high"
        result = CLASS_NAMES[predicted_class]
    else:
        confidence = "low"
        result = "INDETERMINATE"
    
    return {
        'result': result,
        'confidence': confidence,
        'probability': float(max_prob),
        'all_probabilities': {
            CLASS_NAMES[i]: float(p) 
            for i, p in enumerate(probs[0])
        }
    }

The indeterminate rate (percentage of samples that fall between thresholds) is a key operational metric. Too high (>15%) and the test is not useful - clinicians are constantly being told "I don't know." Too low (approaching 0%) and the thresholds are too permissive, meaning low-confidence results are being reported as definitive.

Model Versioning

Clinical models must be versioned with full traceability. Every prediction must be linked to the exact model version that produced it. When a model is updated, both versions run in parallel (shadow mode) before the new version replaces the old one.

Track at minimum:

Model version identifier (semantic versioning: v1.2.0)
Training dataset hash (which spectra were used to train)
Preprocessing pipeline version (which algorithms and parameters)
Validation metrics at time of deployment (sensitivity, specificity, AUC)
Deployment timestamp
Validated instruments (list of instruments the model has been validated on)

Step 6: Monitoring and Drift Detection

A deployed model is not done. Spectral data distributions shift over time - instrument components age, sample preparation protocols evolve, patient populations change. A model that performed at 97% sensitivity in validation can silently degrade to 85% over six months if no one is watching.

Input Drift Detection

Monitor whether incoming spectra look like the training data. If spectra start arriving that are statistically different from the training distribution, something has changed - and the model's predictions may no longer be reliable.

from scipy.spatial.distance import mahalanobis
 
class SpectralDriftDetector:
    """
    Detect when incoming spectra drift from the training 
    distribution using Mahalanobis distance in PCA space.
    """
    def __init__(self, pca_model, training_scores):
        self.pca = pca_model
        self.mean = np.mean(training_scores, axis=0)
        self.cov_inv = np.linalg.inv(
            np.cov(training_scores.T) + 
            1e-6 * np.eye(training_scores.shape[1])
        )
        
        # Establish baseline distances from training data
        self.baseline_distances = np.array([
            mahalanobis(s, self.mean, self.cov_inv)
            for s in training_scores
        ])
        self.threshold = np.percentile(
            self.baseline_distances, 99
        )
    
    def check(self, spectrum):
        """
        Check if a new spectrum is within the training 
        distribution.
        
        Returns:
            is_within: bool - True if spectrum is normal
            distance: float - Mahalanobis distance
        """
        scores = self.pca.transform(
            spectrum.reshape(1, -1)
        )[0]
        distance = mahalanobis(scores, self.mean, self.cov_inv)
        
        return {
            'within_distribution': distance < self.threshold,
            'mahalanobis_distance': distance,
            'threshold': self.threshold,
            'percentile': (np.sum(
                self.baseline_distances < distance
            ) / len(self.baseline_distances) * 100)
        }

When drift is detected, the system should alert the operations team - not silently continue predicting. The appropriate response depends on the cause: recalibrate the instrument, retrain the model on updated data, or investigate the new sample type.

Prediction Drift

Even without input drift, the distribution of predictions can shift. If a model that historically classified 20% of samples as positive suddenly starts classifying 40% as positive, something has changed - either the patient population or the model's behavior. Track prediction distributions over time and alert on significant shifts.

Regulatory Implications

If your ML model contributes to a clinical decision, you are building Software as a Medical Device (SaMD). The model is not just a convenience - it is a regulated medical device component. This means:

IEC 62304 governs the software development lifecycle. Every code change, every model update, every preprocessing parameter change must follow your software development plan.
ISO 14971 requires a risk analysis of your ML pipeline. What happens when the model is wrong? What is the clinical consequence of a false negative?
FDA guidance on AI/ML-based SaMD (2021, updated 2023) introduces the concept of predetermined change control plans - documenting in advance how your model will be updated post-market.

We cover SaMD classification and regulatory strategy in detail in our SaMD classification guide. For the explainability requirements that FDA and the EU AI Act impose on clinical AI models, see our guide to explainable AI for spectroscopy. If your spectral classification model makes or influences a clinical decision, read those articles before going further.

What SpectraDx Provides

Building the ML pipeline is the intellectually engaging part. Maintaining it in a clinical environment - model versioning, drift monitoring, regulatory documentation, instrument transfer validation, A/B testing of model updates - is the operationally hard part. It is also the part that has nothing to do with spectroscopy or machine learning and everything to do with clinical software engineering.

SpectraDx's platform includes the ML inference pipeline as a core component. You bring your trained model (or we help you train one). We handle:

Model deployment and versioning with full audit trail
Preprocessing pipeline configuration per instrument type
Confidence thresholds and indeterminate result handling
Input drift detection and alerting
Prediction monitoring and performance tracking
Regulatory documentation generation (IEC 62304 artifacts)

The result: your spectral classification model runs in a production clinical environment with the infrastructure it needs to be safe, reliable, and maintainable. Explore our solutions to see the full platform.