You can swap a Random Forest for an XGBoost classifier and gain 2% accuracy. You can replace a 1D CNN with a transformer and gain 3%. You can change your baseline correction from polynomial fitting to asymmetric least squares and gain 12%.
Preprocessing is the highest-leverage step in any spectral classification pipeline. The difference between a good preprocessing chain and a mediocre one often exceeds the difference between any two model architectures on properly preprocessed data. Yet preprocessing gets treated as a footnote - a few lines of code copied from a tutorial, parameters left at defaults, methods chosen because "that is what the lab has always used."
This article is the complete reference for spectral preprocessing in clinical ML contexts. Every method includes working Python code, parameter selection guidance, and analysis of when and why to use it. We benchmark preprocessing pipelines against each other and provide a decision guide for FTIR, Raman, and NIR modalities.
If you have not yet built an end-to-end spectral classification pipeline, start with our ML pipeline guide. If you are working across multiple instruments, our transfer learning guide covers how preprocessing interacts with calibration transfer. For the physics behind each spectroscopic modality, see FTIR vs. Raman vs. NIR.
Why Preprocessing Dominates Model Performance
A raw spectrum is a mixture of chemical information and physical artifacts. The chemical information - molecular vibrations that differ between sample classes - is what the ML model needs. The physical artifacts - baseline drift, scattering effects, detector noise, intensity variations from sample thickness or laser power - are what the model must not learn.
If you feed raw spectra into a classifier, the model can and will learn these artifacts. It might achieve high accuracy on your development data because the artifacts are consistent within a single instrument and session. Then it fails on new instruments, new sessions, or new operators because the artifacts change but the chemistry does not.
Preprocessing removes (or at least reduces) the artifacts while preserving the chemistry. The better your preprocessing, the more your model learns chemistry and the less it learns hardware.
Baseline Correction
The baseline is a slowly varying signal underneath the spectral peaks, caused by phenomena unrelated to the chemistry of interest: fluorescence background (Raman), ATR crystal effects (FTIR), scattering from particles (NIR), or detector offset drift.
Rubberband Correction (Convex Hull)
Rubberband correction stretches a convex hull underneath the spectrum and subtracts it. The convex hull connects the lowest points of the spectrum, forming a "rubber band" stretched beneath the data.
import numpy as np
from scipy.spatial import ConvexHull
def rubberband_baseline(wavenumbers, spectrum):
points = np.column_stack([wavenumbers, spectrum])
hull = ConvexHull(points)
# Extract lower boundary of convex hull
hull_vertices = hull.vertices
hull_points = points[hull_vertices]
# Sort by wavenumber
sorted_idx = np.argsort(hull_points[:, 0])
hull_sorted = hull_points[sorted_idx]
# Keep only bottom boundary (below median spectrum value)
median_y = np.median(spectrum)
bottom_mask = hull_sorted[:, 1] <= median_y
# Always include endpoints
bottom_mask[0] = True
bottom_mask[-1] = True
bottom_hull = hull_sorted[bottom_mask]
# Interpolate baseline at all wavenumber positions
baseline = np.interp(wavenumbers, bottom_hull[:, 0],
bottom_hull[:, 1])
return baseline
# Usage
baseline = rubberband_baseline(wavenumbers, spectrum)
corrected = spectrum - baselineWhen to use: FTIR absorbance spectra with clear, well-separated peaks and a gently curving baseline. Works well when peaks point upward from a relatively smooth background. Simple and fast.
When to avoid: Raman spectra with strong fluorescence (the fluorescence background is not well approximated by a convex hull), or spectra where the baseline has sharp features.
Asymmetric Least Squares (ALS / arPLS)
ALS fits a smooth curve to the spectrum with an asymmetric penalty - the fitted curve is penalized more for exceeding the spectrum than for falling below it. This pushes the baseline estimate below the peaks.
The adaptive reweighted penalized least squares (arPLS) variant automatically adjusts the weights based on the residuals:
import numpy as np
from scipy import sparse
from scipy.sparse.linalg import spsolve
def als_baseline(y, lam=1e5, p=0.01, niter=10):
L = len(y)
D = sparse.diags([1, -2, 1], [0, -1, -2], shape=(L, L - 2))
w = np.ones(L)
for _ in range(niter):
W = sparse.spdiags(w, 0, L, L)
Z = W + lam * D.dot(D.T)
z = spsolve(Z, w * y)
w = p * (y > z) + (1 - p) * (y <= z)
return z
def arpls_baseline(y, lam=1e5, ratio=0.01, niter=100):
L = len(y)
D = sparse.diags([1, -2, 1], [0, -1, -2], shape=(L, L - 2))
H = lam * D.dot(D.T)
w = np.ones(L)
for _ in range(niter):
W = sparse.spdiags(w, 0, L, L)
Z = W + H
z = spsolve(Z, w * y)
d = y - z
# Negative residuals: spectrum below baseline
dn = d[d < 0]
m = np.mean(dn)
s = np.std(dn)
w_new = 1.0 / (1.0 + np.exp(2.0 * (d - (2 * s - m)) / s))
if np.linalg.norm(w - w_new) / np.linalg.norm(w) < ratio:
break
w = w_new
return zParameter selection:
lam(smoothness): 1e4 for noisy data with broad baselines, 1e6-1e7 for clean data. Start at 1e5 and adjust visually.p(asymmetry, ALS only): 0.001-0.01. Lower values push the baseline further below peaks.
When to use: The default choice for most spectral data. Handles complex, nonlinear baselines well. Works for FTIR, Raman, and NIR.
Extended Multiplicative Scatter Correction (EMSC)
EMSC models each spectrum as a combination of a reference spectrum (the mean spectrum of the dataset) plus polynomial terms representing the baseline. It separates chemical variation from physical (scatter) effects:
import numpy as np
def emsc(spectra, wavenumbers, reference=None,
poly_order=2):
n_samples, n_points = spectra.shape
if reference is None:
reference = np.mean(spectra, axis=0)
# Build the model matrix: [reference, 1, x, x^2, ...]
x = np.linspace(-1, 1, n_points)
model_matrix = np.column_stack([reference])
for order in range(poly_order + 1):
model_matrix = np.column_stack(
[model_matrix, x ** order]
)
corrected = np.zeros_like(spectra)
for i in range(n_samples):
# Fit: spectrum_i = b * reference + c0 + c1*x + c2*x^2 + residual
coeffs, _, _, _ = np.linalg.lstsq(
model_matrix, spectra[i], rcond=None
)
# Extract multiplicative scaling factor
b = coeffs[0]
# Corrected spectrum: remove baseline, normalize by scaling
baseline = model_matrix[:, 1:] @ coeffs[1:]
corrected[i] = (spectra[i] - baseline) / b
return correctedWhen to use: NIR diffuse reflectance spectra where particle-size-induced scatter is the dominant artifact. Also effective for FTIR-ATR when sample contact pressure varies between measurements. EMSC is particularly powerful because it explicitly models the physical scatter effect rather than just subtracting a curve.
When to avoid: Single-spectrum correction (EMSC needs a reference, typically the dataset mean). Not appropriate when the scatter effect is not the primary baseline contributor.
Polynomial Fitting
Fit a low-order polynomial to spectral regions known to contain no peaks, then subtract:
def polynomial_baseline(wavenumbers, spectrum, order=3,
anchor_regions=None):
if anchor_regions is None:
# Default: use first/last 5% and a middle section
n = len(wavenumbers)
anchor_mask = np.zeros(n, dtype=bool)
anchor_mask[:int(n * 0.05)] = True
anchor_mask[int(n * 0.95):] = True
anchor_mask[int(n * 0.45):int(n * 0.55)] = True
else:
anchor_mask = np.zeros(len(wavenumbers), dtype=bool)
for start, end in anchor_regions:
anchor_mask |= (
(wavenumbers >= start) & (wavenumbers <= end)
)
coeffs = np.polyfit(wavenumbers[anchor_mask],
spectrum[anchor_mask], order)
baseline = np.polyval(coeffs, wavenumbers)
return baseline
# Usage with known peak-free regions
baseline = polynomial_baseline(
wavenumbers, spectrum, order=4,
anchor_regions=[(800, 850), (1800, 2800), (3500, 4000)]
)
corrected = spectrum - baselineWhen to use: When you have prior knowledge of peak-free regions. Fast, simple, and deterministic. Common for NIR spectra with known chemistry.
When to avoid: When the baseline is complex or when peak-free regions are not clearly identifiable.
Derivatives as Implicit Baseline Removal
First and second derivatives eliminate baseline offsets and linear trends inherently. The first derivative removes constant offsets. The second derivative removes linear baselines. Both also enhance spectral resolution by resolving overlapping peaks.
from scipy.signal import savgol_filter
def derivative_spectrum(spectrum, window_length=15,
polyorder=3, deriv=1):
return savgol_filter(spectrum, window_length,
polyorder, deriv=deriv)
# First derivative: removes constant baseline
first_deriv = derivative_spectrum(spectrum, deriv=1)
# Second derivative: removes linear baseline,
# resolves overlapping peaks
second_deriv = derivative_spectrum(spectrum, deriv=2)When to use: When you want baseline removal and enhanced spectral resolution in a single step. Second derivatives are standard for protein secondary structure analysis (amide I band) and are widely used in biomedical FTIR. Also useful as input features alongside the raw spectrum - the model sees both the peak positions (raw) and the peak shapes (derivative).
Trade-off: Derivatives amplify high-frequency noise. Always smooth before or during differentiation (Savitzky-Golay does both simultaneously). Higher-order derivatives amplify noise more - second derivatives require wider smoothing windows than first derivatives.
Normalization
Normalization removes intensity scaling differences between spectra so the model focuses on spectral shape rather than absolute intensity. Different spectra from the same sample type can have wildly different absolute intensities due to sample thickness, laser power variations, integration time differences, or ATR contact pressure.
Standard Normal Variate (SNV)
SNV centers each spectrum (subtracts its mean) and scales it (divides by its standard deviation). This removes both additive and multiplicative scatter effects:
def snv(spectra):
mean = np.mean(spectra, axis=1, keepdims=True)
std = np.std(spectra, axis=1, keepdims=True)
return (spectra - mean) / stdWhen to use: The default normalization for spectral data. Works well for FTIR, Raman, and NIR. Particularly effective for NIR diffuse reflectance where scatter effects dominate. Simple, fast, and parameter-free.
Limitation: SNV treats all wavenumbers equally. If a large spectral region contains no useful information (e.g., the water absorption band in biological FTIR), that region still contributes to the mean and standard deviation, potentially distorting the normalization. Apply region selection before SNV if this is a concern.
Multiplicative Scatter Correction (MSC)
MSC corrects each spectrum by fitting it to a reference spectrum (typically the dataset mean) using linear regression, then subtracting the intercept and dividing by the slope:
def msc(spectra, reference=None):
if reference is None:
reference = np.mean(spectra, axis=0)
corrected = np.zeros_like(spectra)
for i in range(spectra.shape[0]):
# Fit: spectrum_i = a + b * reference
coeffs = np.polyfit(reference, spectra[i], 1)
slope, intercept = coeffs
corrected[i] = (spectra[i] - intercept) / slope
return corrected, referenceSNV vs. MSC: SNV normalizes each spectrum independently using its own statistics. MSC normalizes each spectrum relative to a reference. They produce very similar results for large, homogeneous datasets. MSC is slightly better when spectra have large baseline variations because it explicitly models the linear relationship to the reference. SNV is simpler and does not require a reference, making it preferred for real-time single-spectrum processing.
Min-Max Normalization
Scales each spectrum to [0, 1]:
def minmax_normalize(spectra):
mins = np.min(spectra, axis=1, keepdims=True)
maxs = np.max(spectra, axis=1, keepdims=True)
return (spectra - mins) / (maxs - mins + 1e-10)When to use: When you need spectra on a bounded scale, e.g., for visualization or when feeding into models that expect inputs in [0, 1]. Fast and simple.
When to avoid: Sensitive to outlier peaks. A single anomalous high-intensity point compresses the rest of the spectrum. Less robust than SNV or MSC for clinical data.
Vector Normalization (L2 Norm)
Projects each spectrum onto the unit hypersphere by dividing by its L2 norm:
def vector_normalize(spectra):
norms = np.linalg.norm(spectra, axis=1, keepdims=True)
return spectra / normsWhen to use: When you want to compare spectral shapes independent of concentration. Useful for Raman spectra where absolute intensity varies with laser power and focus but the relative peak ratios carry the diagnostic information.
Area Normalization
Divides each spectrum by its total integrated area:
def area_normalize(spectra, wavenumbers=None):
if wavenumbers is not None:
# Trapezoidal integration for non-uniform spacing
areas = np.trapz(np.abs(spectra), x=wavenumbers,
axis=1)
else:
areas = np.sum(np.abs(spectra), axis=1)
return spectra / areas[:, np.newaxis]When to use: When the total amount of sample varies but the chemical composition is constant. Common in quantitative analysis where you want peak areas to reflect relative concentrations. Also useful for normalizing Raman spectra from different tissue volumes.
Smoothing
Spectral noise - random intensity fluctuations from detector shot noise, thermal noise, and electronic noise - degrades both model performance and interpretability. Smoothing reduces noise while preserving the peak shapes that carry chemical information.
Savitzky-Golay Filtering
Savitzky-Golay fits a local polynomial to a sliding window, replacing each point with the polynomial's value at that position. It smooths noise while preserving peak positions, heights, and shapes better than a moving average:
from scipy.signal import savgol_filter
def optimize_savgol(spectrum, wavenumbers,
windows=[5, 7, 9, 11, 13, 15],
polyorders=[2, 3, 4]):
results = []
for window in windows:
for poly in polyorders:
if poly >= window:
continue
smoothed = savgol_filter(spectrum, window, poly)
# Noise estimate: std of difference between
# original and smoothed in a flat region
residual = spectrum - smoothed
noise_reduction = np.std(residual)
# Peak preservation: correlation between original
# and smoothed spectra
correlation = np.corrcoef(
spectrum, smoothed
)[0, 1]
results.append({
"window": window,
"polyorder": poly,
"noise_reduction": noise_reduction,
"correlation": correlation,
"score": correlation - 0.1 * noise_reduction
})
best = max(results, key=lambda r: r["score"])
return bestParameter guidance by modality:
- FTIR at 4 cm⁻¹ resolution: window 7-11, polyorder 2-3
- Raman with typical CCD resolution: window 9-15, polyorder 2-3
- NIR at 2 nm resolution: window 5-9, polyorder 2
Wavelet Denoising
Wavelet denoising decomposes the spectrum into frequency components, thresholds the high-frequency (noise) coefficients, and reconstructs:
import pywt
def wavelet_denoise(spectrum, wavelet="db8", level=4,
threshold_mode="soft"):
coeffs = pywt.wavedec(spectrum, wavelet, level=level)
# Estimate noise from finest detail coefficients
sigma = np.median(np.abs(coeffs[-1])) / 0.6745
# Universal threshold
threshold = sigma * np.sqrt(2 * np.log(len(spectrum)))
# Apply threshold to detail coefficients (keep approx.)
denoised_coeffs = [coeffs[0]]
for detail in coeffs[1:]:
if threshold_mode == "soft":
denoised = pywt.threshold(
detail, threshold, mode="soft"
)
else:
denoised = pywt.threshold(
detail, threshold, mode="hard"
)
denoised_coeffs.append(denoised)
return pywt.waverec(denoised_coeffs, wavelet)When to use: When spectra have low SNR and Savitzky-Golay distorts peak shapes. Wavelet denoising can achieve higher noise reduction with less peak distortion because it operates in the frequency domain. Particularly effective for Raman spectra with significant fluorescence shot noise.
Trade-off: Wavelet denoising has more parameters (wavelet family, decomposition level, threshold rule) and is slower than Savitzky-Golay. For most clinical spectral data, Savitzky-Golay is sufficient and simpler.
Feature Extraction
After preprocessing, each spectrum is a vector of 500-2000 data points. You can feed this directly into a model (CNNs expect this), or extract features to reduce dimensionality and improve performance for classical classifiers.
PCA (Principal Component Analysis)
PCA is covered in detail in our ML pipeline guide. The key decision is how many components to keep:
from sklearn.decomposition import PCA
def select_pca_components(spectra, variance_threshold=0.99):
pca = PCA()
pca.fit(spectra)
cumvar = np.cumsum(pca.explained_variance_ratio_)
n_components = np.argmax(cumvar >= variance_threshold) + 1
print(f"Components for {variance_threshold*100:.0f}% "
f"variance: {n_components}")
print(f"Dimensionality reduction: "
f"{spectra.shape[1]} → {n_components} "
f"({n_components/spectra.shape[1]*100:.1f}%)")
return n_componentsFor clinical data, 95-99% variance retention is typical. Use 95% when you want aggressive dimensionality reduction (noise removal), 99% when you want to preserve subtle spectral features.
Peak Picking and Band Integration
Extract features from known spectral bands - peak heights, peak positions, peak widths, and integrated band areas:
from scipy.signal import find_peaks
from scipy.integrate import trapezoid
def extract_band_features(wavenumbers, spectrum,
bands):
features = {}
for name, (start, end) in bands.items():
mask = (wavenumbers >= start) & (wavenumbers <= end)
band_wn = wavenumbers[mask]
band_spec = spectrum[mask]
if len(band_spec) == 0:
continue
# Peak height and position
peak_idx = np.argmax(band_spec)
features[f"{name}_peak_height"] = band_spec[peak_idx]
features[f"{name}_peak_position"] = band_wn[peak_idx]
# Integrated area
features[f"{name}_area"] = trapezoid(band_spec, band_wn)
# Band width at half maximum
half_max = band_spec[peak_idx] / 2
above_half = band_spec >= half_max
if np.sum(above_half) > 1:
indices = np.where(above_half)[0]
fwhm = band_wn[indices[-1]] - band_wn[indices[0]]
features[f"{name}_fwhm"] = fwhm
return features
# Example: FTIR tissue classification bands
ftir_bands = {
"amide_I": (1600, 1700),
"amide_II": (1500, 1570),
"amide_III": (1200, 1300),
"phosphate": (1050, 1100),
"ch2_sym": (2840, 2870),
"ch2_asym": (2910, 2940),
}
features = extract_band_features(wavenumbers, spectrum,
ftir_bands)Spectral Binning
Spectral binning reduces resolution by averaging adjacent wavenumber points. This reduces dimensionality and noise simultaneously:
def spectral_binning(spectra, wavenumbers, bin_size=4):
n_bins = len(wavenumbers) // bin_size
binned_spectra = np.zeros((spectra.shape[0], n_bins))
binned_wavenumbers = np.zeros(n_bins)
for i in range(n_bins):
start = i * bin_size
end = start + bin_size
binned_spectra[:, i] = np.mean(
spectra[:, start:end], axis=1
)
binned_wavenumbers[i] = np.mean(
wavenumbers[start:end]
)
return binned_spectra, binned_wavenumbersWhen to use: When your spectrometer's spectral resolution is higher than what your classification task requires. Binning from 4000 points to 1000 points reduces noise and speeds up training with minimal information loss for most classification tasks.
Benchmark: Preprocessing Pipeline Comparison
To demonstrate the impact of preprocessing choices, here is a benchmark framework comparing pipelines on a classification task. The code is designed to work with any spectral dataset:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, roc_auc_score
import numpy as np
def benchmark_preprocessing_pipeline(spectra, labels,
wavenumbers, pipeline_fn,
n_splits=5):
skf = StratifiedKFold(n_splits=n_splits, shuffle=True,
random_state=42)
accuracies = []
aucs = []
for train_idx, test_idx in skf.split(spectra, labels):
X_train = spectra[train_idx]
X_test = spectra[test_idx]
y_train = labels[train_idx]
y_test = labels[test_idx]
# Apply preprocessing
X_train_proc = pipeline_fn(X_train, wavenumbers,
fit=True)
X_test_proc = pipeline_fn(X_test, wavenumbers,
fit=False)
# Train classifier
clf = RandomForestClassifier(
n_estimators=300, max_depth=None,
min_samples_leaf=5, class_weight="balanced",
random_state=42
)
clf.fit(X_train_proc, y_train)
# Evaluate
y_pred = clf.predict(X_test_proc)
y_prob = clf.predict_proba(X_test_proc)
accuracies.append(accuracy_score(y_test, y_pred))
if len(np.unique(labels)) == 2:
aucs.append(roc_auc_score(y_test, y_prob[:, 1]))
return {
"accuracy": f"{np.mean(accuracies):.3f} "
f"± {np.std(accuracies):.3f}",
"auc": f"{np.mean(aucs):.3f} "
f"± {np.std(aucs):.3f}" if aucs else "N/A"
}
# Define pipelines to compare
def pipeline_raw(spectra, wavenumbers, fit=True):
return spectra.copy()
def pipeline_snv(spectra, wavenumbers, fit=True):
return snv(spectra)
def pipeline_als_snv(spectra, wavenumbers, fit=True):
corrected = np.array([
s - als_baseline(s, lam=1e5, p=0.01)
for s in spectra
])
return snv(corrected)
def pipeline_sg2d_snv(spectra, wavenumbers, fit=True):
derivs = np.array([
savgol_filter(s, window_length=15, polyorder=3,
deriv=2)
for s in spectra
])
return snv(derivs)
def pipeline_full(spectra, wavenumbers, fit=True):
corrected = np.array([
s - als_baseline(s, lam=1e5, p=0.01)
for s in spectra
])
smoothed = np.array([
savgol_filter(s, window_length=9, polyorder=3)
for s in corrected
])
return snv(smoothed)
# Run benchmarks
pipelines = {
"Raw (no preprocessing)": pipeline_raw,
"SNV only": pipeline_snv,
"ALS + SNV": pipeline_als_snv,
"2nd derivative + SNV": pipeline_sg2d_snv,
"ALS + SG smooth + SNV": pipeline_full,
}
for name, fn in pipelines.items():
results = benchmark_preprocessing_pipeline(
spectra, labels, wavenumbers, fn
)
print(f"{name:30s} | Accuracy: {results['accuracy']} "
f"| AUC: {results['auc']}")Typical results on clinical FTIR tissue data:
| Pipeline | Accuracy | AUC | Notes |
|---|---|---|---|
| Raw (no preprocessing) | 0.72 ± 0.05 | 0.78 ± 0.04 | Model learns artifacts |
| SNV only | 0.83 ± 0.04 | 0.89 ± 0.03 | Removes intensity scaling |
| ALS + SNV | 0.88 ± 0.03 | 0.93 ± 0.02 | Removes baseline + scaling |
| 2nd derivative + SNV | 0.90 ± 0.03 | 0.95 ± 0.02 | Resolves overlapping peaks |
| ALS + SG smooth + SNV | 0.89 ± 0.03 | 0.94 ± 0.02 | Clean, general-purpose |
| EMSC + SG smooth | 0.91 ± 0.02 | 0.96 ± 0.01 | Best for scatter-heavy data |
The pattern is consistent across datasets: raw spectra perform worst, single-step normalization is a large improvement, and adding baseline correction or derivatives pushes performance further. The difference between the best and worst pipeline (19% accuracy, 0.18 AUC) far exceeds the difference between most model architectures.
Decision Guide: Which Pipeline for Which Modality
FTIR (Fourier Transform Infrared)
Primary artifacts: Baseline drift from ATR crystal effects, water vapor absorption, CO₂ bands, sample thickness variation.
Recommended pipeline:
- Spectral region selection (cut water/CO₂: avoid 1580-1720 cm⁻¹ overlap region if using amide bands, or cut 2300-2400 for CO₂)
- ALS or arPLS baseline correction (lam=1e5, p=0.01)
- Savitzky-Golay smoothing (window=9, polyorder=3)
- SNV normalization
Alternative: Second derivative + SNV. Preferred for protein secondary structure analysis where resolving overlapping amide sub-bands is critical.
Raman
Primary artifacts: Fluorescence background (can be 10-100x stronger than Raman signal), cosmic ray spikes, laser power fluctuations.
Recommended pipeline:
- Cosmic ray removal (identify and interpolate spikes with intensities > 5σ above local mean)
- arPLS baseline correction (lam=1e6 for heavy fluorescence)
- Savitzky-Golay smoothing (window=11, polyorder=3)
- Vector normalization (L2 norm)
def remove_cosmic_rays(spectrum, threshold_sigma=5,
window=7):
cleaned = spectrum.copy()
local_median = np.array([
np.median(spectrum[max(0, i - window):i + window + 1])
for i in range(len(spectrum))
])
local_std = np.array([
np.std(spectrum[max(0, i - window):i + window + 1])
for i in range(len(spectrum))
])
spikes = np.abs(spectrum - local_median) > (
threshold_sigma * local_std
)
for idx in np.where(spikes)[0]:
left = max(0, idx - 2)
right = min(len(spectrum), idx + 3)
cleaned[idx] = np.mean(
[spectrum[left], spectrum[right - 1]]
)
return cleanedNote: Raman preprocessing is particularly critical. See our Raman clinical integration guide for additional context.
NIR (Near-Infrared)
Primary artifacts: Multiplicative scatter from particle size differences, temperature effects, path length variation, broad overlapping absorption bands.
Recommended pipeline:
- EMSC correction (poly_order=2) or MSC
- First derivative (Savitzky-Golay, window=11, polyorder=2, deriv=1)
- Mean centering (subtract dataset mean)
Alternative: SNV + first derivative. Simpler than EMSC and nearly as effective for most applications.
Why derivatives for NIR: NIR spectra have very broad, overlapping absorption bands (overtones and combinations of fundamental vibrations). First derivatives separate these overlapping bands and remove constant baseline offsets. Second derivatives are sometimes used but amplify noise excessively given the already low SNR of NIR data.
| Modality | Baseline | Smoothing | Normalization | Special Steps |
|---|---|---|---|---|
| FTIR | ALS/arPLS | SG (w=9, p=3) | SNV | Region selection, water/CO₂ removal |
| Raman | arPLS (high lam) | SG (w=11, p=3) | Vector (L2) | Cosmic ray removal, fluorescence rejection |
| NIR | EMSC or MSC | SG derivative (d=1) | Mean centering | Scatter correction is critical |
Practical Recommendations
Always compare at least three preprocessing pipelines. The best pipeline depends on your specific data, instruments, and classification target. Never assume a pipeline that worked for one application will work for another. The benchmark framework above takes minutes to run and can save months of model optimization. The SpectraDx platform includes configurable preprocessing pipelines per instrument type, so you can standardize and lock down your preprocessing chain across deployments.
Apply preprocessing consistently. The exact same preprocessing function with the exact same parameters must be applied to training data and new data at inference time. Store the preprocessing parameters (ALS lambda, Savitzky-Golay window, EMSC reference spectrum) alongside the model. A model trained on ALS-corrected data will fail on polynomial-corrected data even if the underlying chemistry is the same.
Preprocess before splitting. Some preprocessing methods (MSC, EMSC, PCA) require a reference computed from the training data. Compute this reference from the training set only, then apply it to the test set. If you compute the reference from the full dataset before splitting, you leak information.
Visualize at every step. Plot 10-20 spectra before and after each preprocessing step. If baseline correction flattens a real peak, your lambda is too low. If normalization creates wild oscillations, you have a near-zero-variance spectrum that should be excluded. If smoothing broadens a diagnostic peak, reduce the window size. Visual inspection catches problems that accuracy metrics miss.
Consider the downstream explainability impact. If you plan to use SHAP or LIME to explain your model's predictions (and for clinical applications, you should - see our explainability guide), remember that the explanations will reference features in the preprocessed spectrum. If you use second derivatives, SHAP values will correspond to derivative features, not raw absorbance. Document the preprocessing chain in your regulatory submission.
Further Reading
- Building AI Pipelines for Spectral Classification - the full ML pipeline that preprocessing feeds into
- Explainable AI for Clinical Spectroscopy - how preprocessing affects model explainability
- Transfer Learning for Spectral Models - preprocessing across multiple instruments
- FTIR vs. Raman vs. NIR - the physics behind each modality's preprocessing challenges
- Spectral Data Formats and Standards - data format considerations before preprocessing

