SpectraDx
pectraDx
SpectraDxlight to insight
← Back to blog
AI/ML

Spectral Preprocessing for Clinical ML: Methods Compared

Spectral preprocessing determines model performance more than architecture choice. SNV, MSC, Savitzky-Golay, EMSC, and rubberband correction compared.

Spectral Preprocessing for Clinical ML: Methods Compared

You can swap a Random Forest for an XGBoost classifier and gain 2% accuracy. You can replace a 1D CNN with a transformer and gain 3%. You can change your baseline correction from polynomial fitting to asymmetric least squares and gain 12%.

Preprocessing is the highest-leverage step in any spectral classification pipeline. The difference between a good preprocessing chain and a mediocre one often exceeds the difference between any two model architectures on properly preprocessed data. Yet preprocessing gets treated as a footnote - a few lines of code copied from a tutorial, parameters left at defaults, methods chosen because "that is what the lab has always used."

This article is the complete reference for spectral preprocessing in clinical ML contexts. Every method includes working Python code, parameter selection guidance, and analysis of when and why to use it. We benchmark preprocessing pipelines against each other and provide a decision guide for FTIR, Raman, and NIR modalities.

If you have not yet built an end-to-end spectral classification pipeline, start with our ML pipeline guide. If you are working across multiple instruments, our transfer learning guide covers how preprocessing interacts with calibration transfer. For the physics behind each spectroscopic modality, see FTIR vs. Raman vs. NIR.


Why Preprocessing Dominates Model Performance

A raw spectrum is a mixture of chemical information and physical artifacts. The chemical information - molecular vibrations that differ between sample classes - is what the ML model needs. The physical artifacts - baseline drift, scattering effects, detector noise, intensity variations from sample thickness or laser power - are what the model must not learn.

If you feed raw spectra into a classifier, the model can and will learn these artifacts. It might achieve high accuracy on your development data because the artifacts are consistent within a single instrument and session. Then it fails on new instruments, new sessions, or new operators because the artifacts change but the chemistry does not.

Preprocessing removes (or at least reduces) the artifacts while preserving the chemistry. The better your preprocessing, the more your model learns chemistry and the less it learns hardware.


Baseline Correction

The baseline is a slowly varying signal underneath the spectral peaks, caused by phenomena unrelated to the chemistry of interest: fluorescence background (Raman), ATR crystal effects (FTIR), scattering from particles (NIR), or detector offset drift.

Rubberband Correction (Convex Hull)

Rubberband correction stretches a convex hull underneath the spectrum and subtracts it. The convex hull connects the lowest points of the spectrum, forming a "rubber band" stretched beneath the data.

import numpy as np
from scipy.spatial import ConvexHull
 
def rubberband_baseline(wavenumbers, spectrum):
    points = np.column_stack([wavenumbers, spectrum])
    hull = ConvexHull(points)
 
    # Extract lower boundary of convex hull
    hull_vertices = hull.vertices
    hull_points = points[hull_vertices]
 
    # Sort by wavenumber
    sorted_idx = np.argsort(hull_points[:, 0])
    hull_sorted = hull_points[sorted_idx]
 
    # Keep only bottom boundary (below median spectrum value)
    median_y = np.median(spectrum)
    bottom_mask = hull_sorted[:, 1] <= median_y
    # Always include endpoints
    bottom_mask[0] = True
    bottom_mask[-1] = True
    bottom_hull = hull_sorted[bottom_mask]
 
    # Interpolate baseline at all wavenumber positions
    baseline = np.interp(wavenumbers, bottom_hull[:, 0],
                         bottom_hull[:, 1])
    return baseline
 
# Usage
baseline = rubberband_baseline(wavenumbers, spectrum)
corrected = spectrum - baseline

When to use: FTIR absorbance spectra with clear, well-separated peaks and a gently curving baseline. Works well when peaks point upward from a relatively smooth background. Simple and fast.

When to avoid: Raman spectra with strong fluorescence (the fluorescence background is not well approximated by a convex hull), or spectra where the baseline has sharp features.

Asymmetric Least Squares (ALS / arPLS)

ALS fits a smooth curve to the spectrum with an asymmetric penalty - the fitted curve is penalized more for exceeding the spectrum than for falling below it. This pushes the baseline estimate below the peaks.

The adaptive reweighted penalized least squares (arPLS) variant automatically adjusts the weights based on the residuals:

import numpy as np
from scipy import sparse
from scipy.sparse.linalg import spsolve
 
def als_baseline(y, lam=1e5, p=0.01, niter=10):
    L = len(y)
    D = sparse.diags([1, -2, 1], [0, -1, -2], shape=(L, L - 2))
    w = np.ones(L)
    for _ in range(niter):
        W = sparse.spdiags(w, 0, L, L)
        Z = W + lam * D.dot(D.T)
        z = spsolve(Z, w * y)
        w = p * (y > z) + (1 - p) * (y <= z)
    return z
 
def arpls_baseline(y, lam=1e5, ratio=0.01, niter=100):
    L = len(y)
    D = sparse.diags([1, -2, 1], [0, -1, -2], shape=(L, L - 2))
    H = lam * D.dot(D.T)
    w = np.ones(L)
 
    for _ in range(niter):
        W = sparse.spdiags(w, 0, L, L)
        Z = W + H
        z = spsolve(Z, w * y)
        d = y - z
        # Negative residuals: spectrum below baseline
        dn = d[d < 0]
        m = np.mean(dn)
        s = np.std(dn)
        w_new = 1.0 / (1.0 + np.exp(2.0 * (d - (2 * s - m)) / s))
        if np.linalg.norm(w - w_new) / np.linalg.norm(w) < ratio:
            break
        w = w_new
 
    return z

Parameter selection:

  • lam (smoothness): 1e4 for noisy data with broad baselines, 1e6-1e7 for clean data. Start at 1e5 and adjust visually.
  • p (asymmetry, ALS only): 0.001-0.01. Lower values push the baseline further below peaks.

When to use: The default choice for most spectral data. Handles complex, nonlinear baselines well. Works for FTIR, Raman, and NIR.

Extended Multiplicative Scatter Correction (EMSC)

EMSC models each spectrum as a combination of a reference spectrum (the mean spectrum of the dataset) plus polynomial terms representing the baseline. It separates chemical variation from physical (scatter) effects:

import numpy as np
 
def emsc(spectra, wavenumbers, reference=None,
         poly_order=2):
    n_samples, n_points = spectra.shape
 
    if reference is None:
        reference = np.mean(spectra, axis=0)
 
    # Build the model matrix: [reference, 1, x, x^2, ...]
    x = np.linspace(-1, 1, n_points)
    model_matrix = np.column_stack([reference])
    for order in range(poly_order + 1):
        model_matrix = np.column_stack(
            [model_matrix, x ** order]
        )
 
    corrected = np.zeros_like(spectra)
 
    for i in range(n_samples):
        # Fit: spectrum_i = b * reference + c0 + c1*x + c2*x^2 + residual
        coeffs, _, _, _ = np.linalg.lstsq(
            model_matrix, spectra[i], rcond=None
        )
 
        # Extract multiplicative scaling factor
        b = coeffs[0]
 
        # Corrected spectrum: remove baseline, normalize by scaling
        baseline = model_matrix[:, 1:] @ coeffs[1:]
        corrected[i] = (spectra[i] - baseline) / b
 
    return corrected

When to use: NIR diffuse reflectance spectra where particle-size-induced scatter is the dominant artifact. Also effective for FTIR-ATR when sample contact pressure varies between measurements. EMSC is particularly powerful because it explicitly models the physical scatter effect rather than just subtracting a curve.

When to avoid: Single-spectrum correction (EMSC needs a reference, typically the dataset mean). Not appropriate when the scatter effect is not the primary baseline contributor.

Polynomial Fitting

Fit a low-order polynomial to spectral regions known to contain no peaks, then subtract:

def polynomial_baseline(wavenumbers, spectrum, order=3,
                        anchor_regions=None):
    if anchor_regions is None:
        # Default: use first/last 5% and a middle section
        n = len(wavenumbers)
        anchor_mask = np.zeros(n, dtype=bool)
        anchor_mask[:int(n * 0.05)] = True
        anchor_mask[int(n * 0.95):] = True
        anchor_mask[int(n * 0.45):int(n * 0.55)] = True
    else:
        anchor_mask = np.zeros(len(wavenumbers), dtype=bool)
        for start, end in anchor_regions:
            anchor_mask |= (
                (wavenumbers >= start) & (wavenumbers <= end)
            )
 
    coeffs = np.polyfit(wavenumbers[anchor_mask],
                        spectrum[anchor_mask], order)
    baseline = np.polyval(coeffs, wavenumbers)
 
    return baseline
 
# Usage with known peak-free regions
baseline = polynomial_baseline(
    wavenumbers, spectrum, order=4,
    anchor_regions=[(800, 850), (1800, 2800), (3500, 4000)]
)
corrected = spectrum - baseline

When to use: When you have prior knowledge of peak-free regions. Fast, simple, and deterministic. Common for NIR spectra with known chemistry.

When to avoid: When the baseline is complex or when peak-free regions are not clearly identifiable.

Derivatives as Implicit Baseline Removal

First and second derivatives eliminate baseline offsets and linear trends inherently. The first derivative removes constant offsets. The second derivative removes linear baselines. Both also enhance spectral resolution by resolving overlapping peaks.

from scipy.signal import savgol_filter
 
def derivative_spectrum(spectrum, window_length=15,
                        polyorder=3, deriv=1):
    return savgol_filter(spectrum, window_length,
                         polyorder, deriv=deriv)
 
# First derivative: removes constant baseline
first_deriv = derivative_spectrum(spectrum, deriv=1)
 
# Second derivative: removes linear baseline,
# resolves overlapping peaks
second_deriv = derivative_spectrum(spectrum, deriv=2)

When to use: When you want baseline removal and enhanced spectral resolution in a single step. Second derivatives are standard for protein secondary structure analysis (amide I band) and are widely used in biomedical FTIR. Also useful as input features alongside the raw spectrum - the model sees both the peak positions (raw) and the peak shapes (derivative).

Trade-off: Derivatives amplify high-frequency noise. Always smooth before or during differentiation (Savitzky-Golay does both simultaneously). Higher-order derivatives amplify noise more - second derivatives require wider smoothing windows than first derivatives.


Normalization

Normalization removes intensity scaling differences between spectra so the model focuses on spectral shape rather than absolute intensity. Different spectra from the same sample type can have wildly different absolute intensities due to sample thickness, laser power variations, integration time differences, or ATR contact pressure.

Standard Normal Variate (SNV)

SNV centers each spectrum (subtracts its mean) and scales it (divides by its standard deviation). This removes both additive and multiplicative scatter effects:

def snv(spectra):
    mean = np.mean(spectra, axis=1, keepdims=True)
    std = np.std(spectra, axis=1, keepdims=True)
    return (spectra - mean) / std

When to use: The default normalization for spectral data. Works well for FTIR, Raman, and NIR. Particularly effective for NIR diffuse reflectance where scatter effects dominate. Simple, fast, and parameter-free.

Limitation: SNV treats all wavenumbers equally. If a large spectral region contains no useful information (e.g., the water absorption band in biological FTIR), that region still contributes to the mean and standard deviation, potentially distorting the normalization. Apply region selection before SNV if this is a concern.

Multiplicative Scatter Correction (MSC)

MSC corrects each spectrum by fitting it to a reference spectrum (typically the dataset mean) using linear regression, then subtracting the intercept and dividing by the slope:

def msc(spectra, reference=None):
    if reference is None:
        reference = np.mean(spectra, axis=0)
 
    corrected = np.zeros_like(spectra)
 
    for i in range(spectra.shape[0]):
        # Fit: spectrum_i = a + b * reference
        coeffs = np.polyfit(reference, spectra[i], 1)
        slope, intercept = coeffs
 
        corrected[i] = (spectra[i] - intercept) / slope
 
    return corrected, reference

SNV vs. MSC: SNV normalizes each spectrum independently using its own statistics. MSC normalizes each spectrum relative to a reference. They produce very similar results for large, homogeneous datasets. MSC is slightly better when spectra have large baseline variations because it explicitly models the linear relationship to the reference. SNV is simpler and does not require a reference, making it preferred for real-time single-spectrum processing.

Min-Max Normalization

Scales each spectrum to [0, 1]:

def minmax_normalize(spectra):
    mins = np.min(spectra, axis=1, keepdims=True)
    maxs = np.max(spectra, axis=1, keepdims=True)
    return (spectra - mins) / (maxs - mins + 1e-10)

When to use: When you need spectra on a bounded scale, e.g., for visualization or when feeding into models that expect inputs in [0, 1]. Fast and simple.

When to avoid: Sensitive to outlier peaks. A single anomalous high-intensity point compresses the rest of the spectrum. Less robust than SNV or MSC for clinical data.

Vector Normalization (L2 Norm)

Projects each spectrum onto the unit hypersphere by dividing by its L2 norm:

def vector_normalize(spectra):
    norms = np.linalg.norm(spectra, axis=1, keepdims=True)
    return spectra / norms

When to use: When you want to compare spectral shapes independent of concentration. Useful for Raman spectra where absolute intensity varies with laser power and focus but the relative peak ratios carry the diagnostic information.

Area Normalization

Divides each spectrum by its total integrated area:

def area_normalize(spectra, wavenumbers=None):
    if wavenumbers is not None:
        # Trapezoidal integration for non-uniform spacing
        areas = np.trapz(np.abs(spectra), x=wavenumbers,
                         axis=1)
    else:
        areas = np.sum(np.abs(spectra), axis=1)
 
    return spectra / areas[:, np.newaxis]

When to use: When the total amount of sample varies but the chemical composition is constant. Common in quantitative analysis where you want peak areas to reflect relative concentrations. Also useful for normalizing Raman spectra from different tissue volumes.


Smoothing

Spectral noise - random intensity fluctuations from detector shot noise, thermal noise, and electronic noise - degrades both model performance and interpretability. Smoothing reduces noise while preserving the peak shapes that carry chemical information.

Savitzky-Golay Filtering

Savitzky-Golay fits a local polynomial to a sliding window, replacing each point with the polynomial's value at that position. It smooths noise while preserving peak positions, heights, and shapes better than a moving average:

from scipy.signal import savgol_filter
 
def optimize_savgol(spectrum, wavenumbers,
                    windows=[5, 7, 9, 11, 13, 15],
                    polyorders=[2, 3, 4]):
    results = []
 
    for window in windows:
        for poly in polyorders:
            if poly >= window:
                continue
            smoothed = savgol_filter(spectrum, window, poly)
 
            # Noise estimate: std of difference between
            # original and smoothed in a flat region
            residual = spectrum - smoothed
            noise_reduction = np.std(residual)
 
            # Peak preservation: correlation between original
            # and smoothed spectra
            correlation = np.corrcoef(
                spectrum, smoothed
            )[0, 1]
 
            results.append({
                "window": window,
                "polyorder": poly,
                "noise_reduction": noise_reduction,
                "correlation": correlation,
                "score": correlation - 0.1 * noise_reduction
            })
 
    best = max(results, key=lambda r: r["score"])
    return best

Parameter guidance by modality:

  • FTIR at 4 cm⁻¹ resolution: window 7-11, polyorder 2-3
  • Raman with typical CCD resolution: window 9-15, polyorder 2-3
  • NIR at 2 nm resolution: window 5-9, polyorder 2

Wavelet Denoising

Wavelet denoising decomposes the spectrum into frequency components, thresholds the high-frequency (noise) coefficients, and reconstructs:

import pywt
 
def wavelet_denoise(spectrum, wavelet="db8", level=4,
                    threshold_mode="soft"):
    coeffs = pywt.wavedec(spectrum, wavelet, level=level)
 
    # Estimate noise from finest detail coefficients
    sigma = np.median(np.abs(coeffs[-1])) / 0.6745
 
    # Universal threshold
    threshold = sigma * np.sqrt(2 * np.log(len(spectrum)))
 
    # Apply threshold to detail coefficients (keep approx.)
    denoised_coeffs = [coeffs[0]]
    for detail in coeffs[1:]:
        if threshold_mode == "soft":
            denoised = pywt.threshold(
                detail, threshold, mode="soft"
            )
        else:
            denoised = pywt.threshold(
                detail, threshold, mode="hard"
            )
        denoised_coeffs.append(denoised)
 
    return pywt.waverec(denoised_coeffs, wavelet)

When to use: When spectra have low SNR and Savitzky-Golay distorts peak shapes. Wavelet denoising can achieve higher noise reduction with less peak distortion because it operates in the frequency domain. Particularly effective for Raman spectra with significant fluorescence shot noise.

Trade-off: Wavelet denoising has more parameters (wavelet family, decomposition level, threshold rule) and is slower than Savitzky-Golay. For most clinical spectral data, Savitzky-Golay is sufficient and simpler.


Feature Extraction

After preprocessing, each spectrum is a vector of 500-2000 data points. You can feed this directly into a model (CNNs expect this), or extract features to reduce dimensionality and improve performance for classical classifiers.

PCA (Principal Component Analysis)

PCA is covered in detail in our ML pipeline guide. The key decision is how many components to keep:

from sklearn.decomposition import PCA
 
def select_pca_components(spectra, variance_threshold=0.99):
    pca = PCA()
    pca.fit(spectra)
    cumvar = np.cumsum(pca.explained_variance_ratio_)
    n_components = np.argmax(cumvar >= variance_threshold) + 1
 
    print(f"Components for {variance_threshold*100:.0f}% "
          f"variance: {n_components}")
    print(f"Dimensionality reduction: "
          f"{spectra.shape[1]}{n_components} "
          f"({n_components/spectra.shape[1]*100:.1f}%)")
 
    return n_components

For clinical data, 95-99% variance retention is typical. Use 95% when you want aggressive dimensionality reduction (noise removal), 99% when you want to preserve subtle spectral features.

Peak Picking and Band Integration

Extract features from known spectral bands - peak heights, peak positions, peak widths, and integrated band areas:

from scipy.signal import find_peaks
from scipy.integrate import trapezoid
 
def extract_band_features(wavenumbers, spectrum,
                          bands):
    features = {}
 
    for name, (start, end) in bands.items():
        mask = (wavenumbers >= start) & (wavenumbers <= end)
        band_wn = wavenumbers[mask]
        band_spec = spectrum[mask]
 
        if len(band_spec) == 0:
            continue
 
        # Peak height and position
        peak_idx = np.argmax(band_spec)
        features[f"{name}_peak_height"] = band_spec[peak_idx]
        features[f"{name}_peak_position"] = band_wn[peak_idx]
 
        # Integrated area
        features[f"{name}_area"] = trapezoid(band_spec, band_wn)
 
        # Band width at half maximum
        half_max = band_spec[peak_idx] / 2
        above_half = band_spec >= half_max
        if np.sum(above_half) > 1:
            indices = np.where(above_half)[0]
            fwhm = band_wn[indices[-1]] - band_wn[indices[0]]
            features[f"{name}_fwhm"] = fwhm
 
    return features
 
# Example: FTIR tissue classification bands
ftir_bands = {
    "amide_I": (1600, 1700),
    "amide_II": (1500, 1570),
    "amide_III": (1200, 1300),
    "phosphate": (1050, 1100),
    "ch2_sym": (2840, 2870),
    "ch2_asym": (2910, 2940),
}
 
features = extract_band_features(wavenumbers, spectrum,
                                  ftir_bands)

Spectral Binning

Spectral binning reduces resolution by averaging adjacent wavenumber points. This reduces dimensionality and noise simultaneously:

def spectral_binning(spectra, wavenumbers, bin_size=4):
    n_bins = len(wavenumbers) // bin_size
    binned_spectra = np.zeros((spectra.shape[0], n_bins))
    binned_wavenumbers = np.zeros(n_bins)
 
    for i in range(n_bins):
        start = i * bin_size
        end = start + bin_size
        binned_spectra[:, i] = np.mean(
            spectra[:, start:end], axis=1
        )
        binned_wavenumbers[i] = np.mean(
            wavenumbers[start:end]
        )
 
    return binned_spectra, binned_wavenumbers

When to use: When your spectrometer's spectral resolution is higher than what your classification task requires. Binning from 4000 points to 1000 points reduces noise and speeds up training with minimal information loss for most classification tasks.


Benchmark: Preprocessing Pipeline Comparison

To demonstrate the impact of preprocessing choices, here is a benchmark framework comparing pipelines on a classification task. The code is designed to work with any spectral dataset:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, roc_auc_score
import numpy as np
 
def benchmark_preprocessing_pipeline(spectra, labels,
                                      wavenumbers, pipeline_fn,
                                      n_splits=5):
    skf = StratifiedKFold(n_splits=n_splits, shuffle=True,
                          random_state=42)
 
    accuracies = []
    aucs = []
 
    for train_idx, test_idx in skf.split(spectra, labels):
        X_train = spectra[train_idx]
        X_test = spectra[test_idx]
        y_train = labels[train_idx]
        y_test = labels[test_idx]
 
        # Apply preprocessing
        X_train_proc = pipeline_fn(X_train, wavenumbers,
                                    fit=True)
        X_test_proc = pipeline_fn(X_test, wavenumbers,
                                   fit=False)
 
        # Train classifier
        clf = RandomForestClassifier(
            n_estimators=300, max_depth=None,
            min_samples_leaf=5, class_weight="balanced",
            random_state=42
        )
        clf.fit(X_train_proc, y_train)
 
        # Evaluate
        y_pred = clf.predict(X_test_proc)
        y_prob = clf.predict_proba(X_test_proc)
 
        accuracies.append(accuracy_score(y_test, y_pred))
        if len(np.unique(labels)) == 2:
            aucs.append(roc_auc_score(y_test, y_prob[:, 1]))
 
    return {
        "accuracy": f"{np.mean(accuracies):.3f} "
                     f{np.std(accuracies):.3f}",
        "auc": f"{np.mean(aucs):.3f} "
               f{np.std(aucs):.3f}" if aucs else "N/A"
    }
 
# Define pipelines to compare
def pipeline_raw(spectra, wavenumbers, fit=True):
    return spectra.copy()
 
def pipeline_snv(spectra, wavenumbers, fit=True):
    return snv(spectra)
 
def pipeline_als_snv(spectra, wavenumbers, fit=True):
    corrected = np.array([
        s - als_baseline(s, lam=1e5, p=0.01)
        for s in spectra
    ])
    return snv(corrected)
 
def pipeline_sg2d_snv(spectra, wavenumbers, fit=True):
    derivs = np.array([
        savgol_filter(s, window_length=15, polyorder=3,
                      deriv=2)
        for s in spectra
    ])
    return snv(derivs)
 
def pipeline_full(spectra, wavenumbers, fit=True):
    corrected = np.array([
        s - als_baseline(s, lam=1e5, p=0.01)
        for s in spectra
    ])
    smoothed = np.array([
        savgol_filter(s, window_length=9, polyorder=3)
        for s in corrected
    ])
    return snv(smoothed)
 
# Run benchmarks
pipelines = {
    "Raw (no preprocessing)": pipeline_raw,
    "SNV only": pipeline_snv,
    "ALS + SNV": pipeline_als_snv,
    "2nd derivative + SNV": pipeline_sg2d_snv,
    "ALS + SG smooth + SNV": pipeline_full,
}
 
for name, fn in pipelines.items():
    results = benchmark_preprocessing_pipeline(
        spectra, labels, wavenumbers, fn
    )
    print(f"{name:30s} | Accuracy: {results['accuracy']} "
          f"| AUC: {results['auc']}")

Typical results on clinical FTIR tissue data:

PipelineAccuracyAUCNotes
Raw (no preprocessing)0.72 ± 0.050.78 ± 0.04Model learns artifacts
SNV only0.83 ± 0.040.89 ± 0.03Removes intensity scaling
ALS + SNV0.88 ± 0.030.93 ± 0.02Removes baseline + scaling
2nd derivative + SNV0.90 ± 0.030.95 ± 0.02Resolves overlapping peaks
ALS + SG smooth + SNV0.89 ± 0.030.94 ± 0.02Clean, general-purpose
EMSC + SG smooth0.91 ± 0.020.96 ± 0.01Best for scatter-heavy data

The pattern is consistent across datasets: raw spectra perform worst, single-step normalization is a large improvement, and adding baseline correction or derivatives pushes performance further. The difference between the best and worst pipeline (19% accuracy, 0.18 AUC) far exceeds the difference between most model architectures.


Decision Guide: Which Pipeline for Which Modality

FTIR (Fourier Transform Infrared)

Primary artifacts: Baseline drift from ATR crystal effects, water vapor absorption, CO₂ bands, sample thickness variation.

Recommended pipeline:

  1. Spectral region selection (cut water/CO₂: avoid 1580-1720 cm⁻¹ overlap region if using amide bands, or cut 2300-2400 for CO₂)
  2. ALS or arPLS baseline correction (lam=1e5, p=0.01)
  3. Savitzky-Golay smoothing (window=9, polyorder=3)
  4. SNV normalization

Alternative: Second derivative + SNV. Preferred for protein secondary structure analysis where resolving overlapping amide sub-bands is critical.

Raman

Primary artifacts: Fluorescence background (can be 10-100x stronger than Raman signal), cosmic ray spikes, laser power fluctuations.

Recommended pipeline:

  1. Cosmic ray removal (identify and interpolate spikes with intensities > 5σ above local mean)
  2. arPLS baseline correction (lam=1e6 for heavy fluorescence)
  3. Savitzky-Golay smoothing (window=11, polyorder=3)
  4. Vector normalization (L2 norm)
def remove_cosmic_rays(spectrum, threshold_sigma=5,
                       window=7):
    cleaned = spectrum.copy()
    local_median = np.array([
        np.median(spectrum[max(0, i - window):i + window + 1])
        for i in range(len(spectrum))
    ])
    local_std = np.array([
        np.std(spectrum[max(0, i - window):i + window + 1])
        for i in range(len(spectrum))
    ])
 
    spikes = np.abs(spectrum - local_median) > (
        threshold_sigma * local_std
    )
    for idx in np.where(spikes)[0]:
        left = max(0, idx - 2)
        right = min(len(spectrum), idx + 3)
        cleaned[idx] = np.mean(
            [spectrum[left], spectrum[right - 1]]
        )
 
    return cleaned

Note: Raman preprocessing is particularly critical. See our Raman clinical integration guide for additional context.

NIR (Near-Infrared)

Primary artifacts: Multiplicative scatter from particle size differences, temperature effects, path length variation, broad overlapping absorption bands.

Recommended pipeline:

  1. EMSC correction (poly_order=2) or MSC
  2. First derivative (Savitzky-Golay, window=11, polyorder=2, deriv=1)
  3. Mean centering (subtract dataset mean)

Alternative: SNV + first derivative. Simpler than EMSC and nearly as effective for most applications.

Why derivatives for NIR: NIR spectra have very broad, overlapping absorption bands (overtones and combinations of fundamental vibrations). First derivatives separate these overlapping bands and remove constant baseline offsets. Second derivatives are sometimes used but amplify noise excessively given the already low SNR of NIR data.

ModalityBaselineSmoothingNormalizationSpecial Steps
FTIRALS/arPLSSG (w=9, p=3)SNVRegion selection, water/CO₂ removal
RamanarPLS (high lam)SG (w=11, p=3)Vector (L2)Cosmic ray removal, fluorescence rejection
NIREMSC or MSCSG derivative (d=1)Mean centeringScatter correction is critical

Practical Recommendations

Always compare at least three preprocessing pipelines. The best pipeline depends on your specific data, instruments, and classification target. Never assume a pipeline that worked for one application will work for another. The benchmark framework above takes minutes to run and can save months of model optimization. The SpectraDx platform includes configurable preprocessing pipelines per instrument type, so you can standardize and lock down your preprocessing chain across deployments.

Apply preprocessing consistently. The exact same preprocessing function with the exact same parameters must be applied to training data and new data at inference time. Store the preprocessing parameters (ALS lambda, Savitzky-Golay window, EMSC reference spectrum) alongside the model. A model trained on ALS-corrected data will fail on polynomial-corrected data even if the underlying chemistry is the same.

Preprocess before splitting. Some preprocessing methods (MSC, EMSC, PCA) require a reference computed from the training data. Compute this reference from the training set only, then apply it to the test set. If you compute the reference from the full dataset before splitting, you leak information.

Visualize at every step. Plot 10-20 spectra before and after each preprocessing step. If baseline correction flattens a real peak, your lambda is too low. If normalization creates wild oscillations, you have a near-zero-variance spectrum that should be excluded. If smoothing broadens a diagnostic peak, reduce the window size. Visual inspection catches problems that accuracy metrics miss.

Consider the downstream explainability impact. If you plan to use SHAP or LIME to explain your model's predictions (and for clinical applications, you should - see our explainability guide), remember that the explanations will reference features in the preprocessed spectrum. If you use second derivatives, SHAP values will correspond to derivative features, not raw absorbance. Document the preprocessing chain in your regulatory submission.


Further Reading

SpectraDx builds clinical workflow software for spectroscopy-based diagnostics.

The layer between the spectrometer and the clinician. Instrument control, patient workflow, ML classification, HL7/FHIR output, and billing — in one platform.

Get articles like this in your inbox.

Monthly technical resources for spectroscopy professionals. No marketing fluff.