A spectral classifier that outputs "malignant" is making a claim. A spectral classifier that outputs "malignant, 0.97 confidence, prediction set " is making a claim you can act on. The difference between these two outputs is the difference between a research prototype and a deployable clinical system.
This article covers the engineering of that difference. We start with why raw neural network outputs are unreliable as probabilities, then work through four methods for producing calibrated confidence estimates:
- Temperature scaling and Platt scaling (post-hoc calibration)
- Conformal prediction (guaranteed-coverage prediction sets)
- Monte Carlo dropout (Bayesian uncertainty approximation)
- Deep ensembles (multi-model disagreement)
We then cover out-of-distribution detection for spectra that should never reach the classifier at all, and finish with clinical threshold design and regulatory expectations.
Throughout, the code is written for the spectral classification pipeline described in Building AI Pipelines for Spectral Classification. That article covers the SpectralCNN architecture, preprocessing, validation methodology, and the classify_with_confidence() function that implements a basic three-zone threshold scheme. This article replaces that basic scheme with rigorous uncertainty quantification.
All code uses PyTorch, scikit-learn, and NumPy. Variable names assume spectral data: spectra, wavenumbers, n_spectra, n_wavenumbers.
The Clinical Problem
Consider two patients. Patient A's FTIR tissue spectrum runs through a classifier and returns "malignant" with a softmax output of 0.52. Patient B's spectrum returns "malignant" with a softmax output of 0.998. The clinical action should be radically different - yet a system that reports only the label treats them identically.
This is not a theoretical concern. In clinical spectroscopy, several factors conspire to produce ambiguous classifications:
- Borderline pathology. Not every tissue sample is clearly normal or clearly malignant. Dysplastic tissue, early-stage lesions, and mixed samples produce spectra that sit between class distributions.
- Sample quality variation. Tissue hydration, fixation protocols, section thickness, and contamination (blood, paraffin residue, adhesive) introduce spectral artifacts that degrade classifier certainty.
- Instrument variability. The same tissue measured on two different spectrometers produces slightly different spectra. A model trained on instrument A encounters systematic shifts on instrument B.
- Novel sample types. A classifier trained on normal and malignant breast tissue receives a benign fibroadenoma it has never seen. It must output something - and whatever it outputs will be wrong.
A clinician needs three things from a spectral classification system:
- The predicted label
- A calibrated probability that the label is correct
- A flag when the system is operating outside its competence
The rest of this article is about producing those three outputs reliably.
Why Raw Softmax Outputs Are Not Probabilities
The softmax function converts a vector of logits into values that sum to 1. This superficially resembles a probability distribution, and many deployed systems treat softmax outputs as calibrated probabilities. They are not.
Modern deep neural networks - including the 1D CNNs used for spectral classification - are systematically overconfident. A network that achieves 90% accuracy on a test set will routinely assign softmax probabilities above 0.99 to its predictions. When the network predicts with 0.95 confidence, it is correct far less than 95% of the time.
This miscalibration was characterized systematically by Guo et al. (2017) and has been confirmed across architectures and domains. The cause is a combination of overparameterization, batch normalization, and training with negative log-likelihood loss - all standard practices that improve accuracy while degrading calibration.
Measuring calibration requires a reliability diagram. Bin predictions by their confidence level and compare the mean confidence in each bin to the actual accuracy:
import numpy as np
from sklearn.metrics import accuracy_score
def reliability_diagram(y_true, y_prob, n_bins=10):
bin_edges = np.linspace(0, 1, n_bins + 1)
bin_accs = []
bin_confs = []
bin_counts = []
max_probs = np.max(y_prob, axis=1)
y_pred = np.argmax(y_prob, axis=1)
for i in range(n_bins):
mask = (max_probs > bin_edges[i]) & (max_probs <= bin_edges[i + 1])
if np.sum(mask) == 0:
continue
bin_acc = accuracy_score(y_true[mask], y_pred[mask])
bin_conf = np.mean(max_probs[mask])
bin_accs.append(bin_acc)
bin_confs.append(bin_conf)
bin_counts.append(np.sum(mask))
ece = np.sum([
(c / sum(bin_counts)) * abs(a - f)
for a, f, c in zip(bin_accs, bin_confs, bin_counts)
])
return bin_accs, bin_confs, bin_counts, eceThe Expected Calibration Error (ECE) is the weighted average gap between accuracy and confidence across bins. A perfectly calibrated model has ECE = 0. Uncalibrated spectral CNNs typically show ECE between 0.05 and 0.15 - meaning the average confidence-accuracy gap is 5-15 percentage points. That gap is clinically dangerous.
Post-Hoc Calibration: Temperature Scaling and Platt Scaling
The simplest fix for miscalibrated softmax outputs is post-hoc calibration - learning a transformation that maps raw model outputs to calibrated probabilities without modifying the model itself.
Temperature Scaling
Temperature scaling divides logits by a single learned parameter T before applying softmax. When T > 1, the distribution becomes softer (less confident). When T < 1, it becomes sharper (more confident). The parameter is optimized on a held-out calibration set by minimizing negative log-likelihood.
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
class TemperatureScaler(nn.Module):
def __init__(self):
super().__init__()
self.temperature = nn.Parameter(torch.ones(1) * 1.5)
def forward(self, logits):
return logits / self.temperature
def fit(self, val_loader, model, device, lr=0.01, max_iter=200):
model.eval()
logits_list, labels_list = [], []
with torch.no_grad():
for spectra_batch, labels in val_loader:
spectra_batch = spectra_batch.to(device)
logits = model(spectra_batch)
logits_list.append(logits.cpu())
labels_list.append(labels)
logits_all = torch.cat(logits_list)
labels_all = torch.cat(labels_list)
nll = nn.CrossEntropyLoss()
optimizer = torch.optim.LBFGS([self.temperature], lr=lr, max_iter=max_iter)
def closure():
optimizer.zero_grad()
scaled = self.forward(logits_all)
loss = nll(scaled, labels_all)
loss.backward()
return loss
optimizer.step(closure)
return self.temperature.item()Usage after training:
scaler = TemperatureScaler()
optimal_T = scaler.fit(val_loader, model, device)
model.eval()
with torch.no_grad():
logits = model(spectrum_tensor)
calibrated_logits = logits / optimal_T
calibrated_probs = torch.softmax(calibrated_logits, dim=1)Typical optimal temperatures for spectral CNNs range from 1.3 to 2.5. An optimal T of 2.0 means the network was approximately twice as confident as it should have been. Temperature scaling preserves prediction accuracy (the argmax does not change) while fixing the probability estimates.
Platt Scaling
Platt scaling fits a logistic regression model on the logits, learning both a scale and a shift parameter per class. This is more flexible than temperature scaling and can correct asymmetric miscalibration (e.g., the model is well-calibrated for class 0 but overconfident for class 1):
from sklearn.linear_model import LogisticRegression
def platt_calibrate(logits_val, y_val, logits_test):
calibrator = LogisticRegression(
C=1e10, solver='lbfgs', max_iter=1000, multi_class='multinomial'
)
calibrator.fit(logits_val, y_val)
calibrated_probs = calibrator.predict_proba(logits_test)
return calibrated_probsWhen to use which. Temperature scaling is preferred when you have limited calibration data (50-200 spectra) because it learns a single parameter. Platt scaling is preferred when you have more calibration data (500+) and observe class-specific miscalibration. Both are simple to implement, add zero inference latency (a single division or linear transform), and should be the default first step for any deployed spectral classifier.
Neither method addresses a deeper problem: they assume the model's ranking of predictions is correct and merely adjust the scale. If the model is fundamentally uncertain about a sample, post-hoc calibration cannot recover that uncertainty. For that, we need methods that quantify uncertainty structurally.
Conformal Prediction
Conformal prediction is the single most important method in this article. Unlike calibration methods that adjust point estimates, conformal prediction produces prediction sets - sets of classes that are guaranteed to contain the true class with a user-specified probability (e.g., 90%, 95%, 99%). The guarantee is distribution-free: it holds regardless of the model architecture, the data distribution, or the quality of the model, provided only that calibration and test data are exchangeable.
For clinical spectroscopy, this is transformative. Instead of reporting "malignant, probability 0.87," the system reports "prediction set: , coverage 95%." If the model is uncertain, the prediction set grows: ", coverage 95%." The clinician sees exactly how many classes remain plausible, and the coverage guarantee is mathematically proven.
How Split Conformal Prediction Works
Split conformal prediction requires three components:
- A trained model that produces softmax probabilities (or any scores)
- A calibration set of labeled spectra that were not used for training
- A target coverage level (1 - alpha), e.g., 0.95
The algorithm:
- Run the model on each calibration spectrum. For each, compute a nonconformity score - a measure of how "surprising" the true label is. The standard choice is
1 - softmax_probability_of_true_class. - Sort the calibration scores and find the (1 - alpha)(1 + 1/n) quantile, where n is the calibration set size. Call this threshold
q_hat. - At test time, include class k in the prediction set if its softmax probability exceeds
1 - q_hat.
The coverage guarantee follows from a quantile argument on exchangeable data. No assumptions about the model or data distribution are required.
Full Implementation
import numpy as np
from typing import List, Dict
class SpectralConformalPredictor:
def __init__(self, alpha: float = 0.05):
self.alpha = alpha
self.q_hat = None
self.n_cal = 0
def calibrate(self, cal_probs: np.ndarray, cal_labels: np.ndarray):
n = len(cal_labels)
self.n_cal = n
scores = 1.0 - cal_probs[np.arange(n), cal_labels]
quantile_level = np.ceil((1 - self.alpha) * (n + 1)) / n
quantile_level = min(quantile_level, 1.0)
self.q_hat = np.quantile(scores, quantile_level, method='higher')
def predict(self, test_probs: np.ndarray) -> List[List[int]]:
threshold = 1.0 - self.q_hat
prediction_sets = []
for probs in test_probs:
pset = np.where(probs >= threshold)[0].tolist()
if len(pset) == 0:
pset = [int(np.argmax(probs))]
prediction_sets.append(pset)
return prediction_sets
def evaluate(self, test_probs: np.ndarray, test_labels: np.ndarray) -> Dict:
psets = self.predict(test_probs)
covered = sum(
1 for pset, y in zip(psets, test_labels) if y in pset
)
coverage = covered / len(test_labels)
sizes = [len(ps) for ps in psets]
return {
'empirical_coverage': coverage,
'target_coverage': 1 - self.alpha,
'mean_set_size': np.mean(sizes),
'median_set_size': np.median(sizes),
'singleton_fraction': np.mean([s == 1 for s in sizes]),
'empty_fraction': 0.0,
'q_hat': self.q_hat,
'n_calibration': self.n_cal
}Using Conformal Prediction in a Spectral Pipeline
from sklearn.model_selection import train_test_split
model.eval()
with torch.no_grad():
cal_logits = model(cal_spectra_tensor)
cal_probs = torch.softmax(cal_logits / optimal_T, dim=1).numpy()
test_logits = model(test_spectra_tensor)
test_probs = torch.softmax(test_logits / optimal_T, dim=1).numpy()
cp = SpectralConformalPredictor(alpha=0.05)
cp.calibrate(cal_probs, cal_labels)
results = cp.evaluate(test_probs, test_labels)
print(f"Coverage: {results['empirical_coverage']:.3f} "
f"(target: {results['target_coverage']:.3f})")
print(f"Mean set size: {results['mean_set_size']:.2f}")
print(f"Singleton fraction: {results['singleton_fraction']:.3f}")Notice that we apply temperature scaling before conformal prediction. This is deliberate and recommended by Angelopoulos and Bates (2023). Conformal prediction's coverage guarantee holds regardless of calibration, but better-calibrated softmax scores produce smaller prediction sets - which means more clinically useful results.
Interpreting Prediction Sets Clinically
The prediction set size is the key clinical signal:
| Set Size | Interpretation | Clinical Action |
|---|---|---|
| 1 | Model is confident in a single class | Report result with confidence level |
| 2 | Model cannot distinguish between two classes | Flag for pathologist review or repeat measurement |
| 3+ | Model has high uncertainty | Do not report; require confirmatory testing |
| = total classes | Model knows nothing about this sample | Likely out-of-distribution; investigate sample quality |
A well-performing spectral classifier at α = 0.05 should produce singleton prediction sets for 85-95% of samples. If the singleton fraction drops below 80%, the model lacks discriminative power for the clinical task, or the calibration set is too small.
Adaptive Conformal Prediction (ACI)
Standard split conformal prediction provides marginal coverage - the guarantee holds on average across all test spectra. It does not guarantee coverage for specific subpopulations (e.g., spectra from a particular instrument or patient demographic). Adaptive Conformal Inference (ACI) adjusts the threshold dynamically based on recent prediction performance:
class AdaptiveConformalPredictor:
def __init__(self, alpha: float = 0.05, gamma: float = 0.01):
self.alpha = alpha
self.gamma = gamma
self.alpha_t = alpha
def update(self, covered: bool):
if covered:
self.alpha_t = self.alpha_t + self.gamma * (1 - self.alpha)
else:
self.alpha_t = self.alpha_t - self.gamma * self.alpha
self.alpha_t = np.clip(self.alpha_t, 0.001, 0.999)
def get_threshold(self, cal_scores: np.ndarray) -> float:
quantile_level = min(
np.ceil((1 - self.alpha_t) * (len(cal_scores) + 1))
/ len(cal_scores),
1.0
)
return np.quantile(cal_scores, quantile_level, method='higher')ACI is valuable in clinical deployments where data distribution shifts over time - new instruments come online, patient populations change, sample preparation protocols evolve. The threshold adapts automatically to maintain coverage despite these shifts.
Calibration Set Sizing
The calibration set must be large enough for the quantile estimate to be tight. As a practical guideline:
| Target α | Minimum calibration size | Notes |
|---|---|---|
| 0.10 | 100 | Loose coverage, small sets |
| 0.05 | 200 | Standard clinical target |
| 0.01 | 500 | Stringent coverage, larger sets |
For spectral classification with 3-5 classes, 200-300 calibration spectra (balanced across classes) provide stable quantile estimates. These spectra must come from the same distribution as test data - same instruments, same sample types, same preprocessing pipeline.
Monte Carlo Dropout
Monte Carlo (MC) dropout provides a Bayesian approximation to model uncertainty by running multiple stochastic forward passes with dropout enabled at inference time. Each pass produces a different prediction because different neurons are randomly dropped. The variance across passes estimates epistemic uncertainty - uncertainty that arises from limited training data, as opposed to irreducible noise in the measurement.
This is particularly useful for spectral classification because it distinguishes between "this spectrum is ambiguous" (high aleatoric uncertainty - the spectrum genuinely sits between classes) and "this spectrum is unlike the training data" (high epistemic uncertainty - the model has not learned this part of the spectral space).
Implementation
The SpectralCNN from the AI classification pipeline article already includes nn.Dropout(0.5) in the classifier head. MC dropout keeps this dropout active during inference:
import torch
import torch.nn.functional as F
def mc_dropout_predict(model, spectrum_tensor, n_forward=50):
model.train() # keeps dropout active
predictions = []
for _ in range(n_forward):
with torch.no_grad():
logits = model(spectrum_tensor)
probs = F.softmax(logits, dim=1)
predictions.append(probs.cpu().numpy())
predictions = np.array(predictions) # (n_forward, batch, n_classes)
mean_probs = predictions.mean(axis=0)
std_probs = predictions.std(axis=0)
predictive_entropy = -np.sum(
mean_probs * np.log(mean_probs + 1e-10), axis=1
)
individual_entropies = -np.sum(
predictions * np.log(predictions + 1e-10), axis=2
)
mean_entropy = individual_entropies.mean(axis=0)
mutual_information = predictive_entropy - mean_entropy
return {
'mean_probs': mean_probs,
'std_probs': std_probs,
'predictive_entropy': predictive_entropy,
'mutual_information': mutual_information,
'all_predictions': predictions
}Predictive entropy captures total uncertainty (aleatoric + epistemic). Mutual information between the model parameters and the prediction isolates epistemic uncertainty. High mutual information means the model's predictions change substantially across dropout masks - it is uncertain because it lacks data in this region, not because the spectrum is inherently ambiguous.
Practical Considerations
- Number of forward passes. 30-50 passes provide stable uncertainty estimates for spectral classifiers with 3-5 output classes. Beyond 50, the marginal improvement in estimate stability is negligible. At 50 passes with a typical spectral CNN (< 1M parameters), total inference time is 50-200ms on CPU - within the clinical latency budget.
- Dropout rate matters. The standard 0.5 dropout rate used during training is often too aggressive for MC dropout inference. A rate of 0.1-0.3 produces tighter uncertainty estimates while still capturing meaningful epistemic variation. You can modify the dropout rate at inference time without retraining:
def set_mc_dropout_rate(model, rate=0.2):
for module in model.modules():
if isinstance(module, nn.Dropout):
module.p = rate- Where to place dropout. For spectral CNNs, dropout in the classifier head (after the global average pooling) captures uncertainty about the final classification decision. Adding dropout between convolutional blocks captures uncertainty about learned spectral features. Both are useful; the classifier-head-only approach is simpler and sufficient for most clinical applications.
Combining MC Dropout with Conformal Prediction
MC dropout uncertainty and conformal prediction address different needs. Conformal prediction gives you prediction sets with coverage guarantees. MC dropout gives you a continuous uncertainty score that can be thresholded. Combining them produces a system that is both statistically rigorous and operationally useful:
def combined_inference(model, spectrum_tensor, conformal_predictor,
scaler_T, n_forward=50):
mc_result = mc_dropout_predict(model, spectrum_tensor, n_forward)
mean_probs = mc_result['mean_probs']
calibrated_probs = np.exp(np.log(mean_probs + 1e-10) / scaler_T)
calibrated_probs /= calibrated_probs.sum(axis=1, keepdims=True)
psets = conformal_predictor.predict(calibrated_probs)
return {
'prediction_sets': psets,
'mean_probs': mean_probs,
'epistemic_uncertainty': mc_result['mutual_information'],
'total_uncertainty': mc_result['predictive_entropy'],
'std_probs': mc_result['std_probs']
}Deep Ensembles
Deep ensembles train M independent models (typically M = 5) with different random initializations and use their disagreement as an uncertainty measure. Introduced by Lakshminarayanan et al. (2017), deep ensembles consistently outperform other uncertainty estimation methods in empirical benchmarks, including MC dropout and variational inference.
The intuition is straightforward: if five independently trained models all agree that a spectrum is malignant, the prediction is robust. If three say malignant and two say dysplastic, the epistemic uncertainty is high.
Implementation
import torch
import torch.nn as nn
import copy
class SpectralEnsemble:
def __init__(self, base_model_class, model_kwargs, n_models=5):
self.models = [
base_model_class(**model_kwargs) for _ in range(n_models)
]
self.n_models = n_models
def train_ensemble(self, train_dataset, val_dataset, n_epochs=100,
lr=1e-3, device='cpu'):
for i, model in enumerate(self.models):
model.to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=lr,
weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer, T_max=n_epochs
)
train_loader = torch.utils.data.DataLoader(
train_dataset, batch_size=64, shuffle=True,
generator=torch.Generator().manual_seed(i * 1000)
)
for epoch in range(n_epochs):
model.train()
for spectra_batch, labels in train_loader:
spectra_batch = spectra_batch.to(device)
labels = labels.to(device)
optimizer.zero_grad()
loss = nn.CrossEntropyLoss()(model(spectra_batch), labels)
loss.backward()
optimizer.step()
scheduler.step()
def predict(self, spectrum_tensor, device='cpu'):
all_probs = []
for model in self.models:
model.eval()
model.to(device)
with torch.no_grad():
logits = model(spectrum_tensor.to(device))
probs = torch.softmax(logits, dim=1).cpu().numpy()
all_probs.append(probs)
all_probs = np.array(all_probs) # (n_models, batch, n_classes)
mean_probs = all_probs.mean(axis=0)
std_probs = all_probs.std(axis=0)
predictions = all_probs.argmax(axis=2) # (n_models, batch)
agreement = np.array([
np.max(np.bincount(predictions[:, i], minlength=mean_probs.shape[1]))
/ self.n_models
for i in range(predictions.shape[1])
])
return {
'mean_probs': mean_probs,
'std_probs': std_probs,
'agreement': agreement,
'all_probs': all_probs
}Computational Cost
The primary drawback of deep ensembles is computational cost. Training 5 models takes 5x the training time. Inference takes 5x the single-model latency. For spectral CNNs with fewer than 1M parameters, this is manageable - 5 forward passes through a spectral CNN take ~20ms on CPU. For larger models or real-time applications, consider:
- Snapshot ensembles. Save checkpoints from different phases of a single training run (using cyclic learning rates). You get ensemble diversity without training separate models, at the cost of reduced diversity.
- Batch ensemble. Share the majority of weights across ensemble members and learn only rank-1 perturbations per member. Reduces storage and inference cost to approximately 1.2x a single model.
- Pruned ensembles. Train the full ensemble, then prune each member to 50-70% sparsity. Maintains diversity while reducing inference cost.
For clinical spectroscopy, the full deep ensemble (M = 5) is usually feasible. Spectral CNNs are small. Training takes minutes to hours, not days. Storage for 5 models is a few megabytes. Inference at 20ms is well within latency constraints. Use the full approach unless you are deploying on an embedded device at the spectrometer.
Spectral Novelty Detection
All methods discussed so far assume the test spectrum belongs to one of the known classes. But in clinical deployment, spectra arrive from outside the training distribution: contaminated samples, sample types the model has never seen, instrument malfunctions that produce corrupted data. A model forced to classify a completely novel spectrum will produce a confident but meaningless result.
Novelty detection - also called out-of-distribution (OOD) detection - sits upstream of the classifier and gates whether the spectrum should be classified at all. The AI classification pipeline article introduces a SpectralDriftDetector using Mahalanobis distance in PCA space. Here we extend this with two complementary approaches.
Mahalanobis Distance in Feature Space
The Mahalanobis distance approach from the existing pipeline operates in PCA space on preprocessed spectra. For deep learning models, operating in the penultimate layer feature space (the 128-dimensional representation before the final classification layer) produces substantially better OOD detection:
import torch
import numpy as np
from scipy.spatial.distance import mahalanobis
class DeepSpectralNoveltyDetector:
def __init__(self, model, device='cpu'):
self.model = model
self.device = device
self.class_means = {}
self.precision = None
self.hook_output = None
for name, module in model.named_modules():
if isinstance(module, nn.Flatten):
module.register_forward_hook(self._hook)
break
def _hook(self, module, input, output):
self.hook_output = output
def fit(self, train_loader, n_classes):
self.model.eval()
features_by_class = {c: [] for c in range(n_classes)}
with torch.no_grad():
for spectra_batch, labels in train_loader:
spectra_batch = spectra_batch.to(self.device)
_ = self.model(spectra_batch)
feats = self.hook_output.cpu().numpy()
for feat, label in zip(feats, labels.numpy()):
features_by_class[int(label)].append(feat)
all_features = []
for c in range(n_classes):
class_feats = np.array(features_by_class[c])
self.class_means[c] = class_feats.mean(axis=0)
all_features.append(class_feats)
all_features = np.vstack(all_features)
cov = np.cov(all_features.T)
cov += 1e-6 * np.eye(cov.shape[0])
self.precision = np.linalg.inv(cov)
def score(self, spectrum_tensor):
self.model.eval()
with torch.no_grad():
_ = self.model(spectrum_tensor.to(self.device))
feat = self.hook_output.cpu().numpy().squeeze()
distances = {
c: mahalanobis(feat, mean, self.precision)
for c, mean in self.class_means.items()
}
min_distance = min(distances.values())
closest_class = min(distances, key=distances.get)
return {
'min_mahalanobis': min_distance,
'closest_class': closest_class,
'all_distances': distances
}Set the OOD threshold at the 99th percentile of Mahalanobis distances computed on the training set. Spectra exceeding this threshold are flagged as novel and excluded from classification.
Autoencoder Reconstruction Error
An autoencoder trained on in-distribution spectra learns to compress and reconstruct normal spectral patterns. Out-of-distribution spectra - spectra unlike anything in training - cannot be reconstructed well, producing high reconstruction error:
class SpectralAutoencoder(nn.Module):
def __init__(self, n_wavenumbers, latent_dim=32):
super().__init__()
self.encoder = nn.Sequential(
nn.Linear(n_wavenumbers, 256),
nn.ReLU(),
nn.Linear(256, 128),
nn.ReLU(),
nn.Linear(128, latent_dim)
)
self.decoder = nn.Sequential(
nn.Linear(latent_dim, 128),
nn.ReLU(),
nn.Linear(128, 256),
nn.ReLU(),
nn.Linear(256, n_wavenumbers)
)
def forward(self, x):
z = self.encoder(x)
return self.decoder(z), z
class AutoencoderNoveltyDetector:
def __init__(self, n_wavenumbers, latent_dim=32, device='cpu'):
self.model = SpectralAutoencoder(n_wavenumbers, latent_dim)
self.device = device
self.threshold = None
self.latent_mean = None
self.latent_precision = None
def fit(self, train_spectra, n_epochs=200, lr=1e-3):
self.model.to(self.device)
dataset = torch.utils.data.TensorDataset(
torch.FloatTensor(train_spectra)
)
loader = torch.utils.data.DataLoader(dataset, batch_size=64,
shuffle=True)
optimizer = torch.optim.Adam(self.model.parameters(), lr=lr)
for epoch in range(n_epochs):
for (batch,) in loader:
batch = batch.to(self.device)
recon, z = self.model(batch)
loss = nn.MSELoss()(recon, batch)
optimizer.zero_grad()
loss.backward()
optimizer.step()
recon_errors = self._compute_errors(train_spectra)
self.threshold = np.percentile(recon_errors, 99)
self.model.eval()
with torch.no_grad():
_, z_all = self.model(torch.FloatTensor(train_spectra).to(self.device))
z_np = z_all.cpu().numpy()
self.latent_mean = z_np.mean(axis=0)
cov = np.cov(z_np.T) + 1e-6 * np.eye(z_np.shape[1])
self.latent_precision = np.linalg.inv(cov)
def _compute_errors(self, spectra):
self.model.eval()
with torch.no_grad():
tensor = torch.FloatTensor(spectra).to(self.device)
recon, _ = self.model(tensor)
errors = ((recon.cpu().numpy() - spectra) ** 2).mean(axis=1)
return errors
def score(self, spectrum):
self.model.eval()
with torch.no_grad():
tensor = torch.FloatTensor(spectrum).unsqueeze(0).to(self.device)
recon, z = self.model(tensor)
recon_error = float(((recon.cpu().numpy() - spectrum) ** 2).mean())
z_np = z.cpu().numpy().squeeze()
latent_dist = mahalanobis(z_np, self.latent_mean,
self.latent_precision)
combined = recon_error / self.threshold + latent_dist / 10.0
return {
'reconstruction_error': recon_error,
'latent_mahalanobis': latent_dist,
'combined_score': combined,
'is_novel': recon_error > self.threshold,
'threshold': self.threshold
}The combined approach - reconstruction error plus Mahalanobis distance in latent space - catches two distinct failure modes. Pure reconstruction error misses OOD samples that happen to land on the latent manifold. Pure latent distance misses OOD samples that are close to the training distribution in latent space but have unusual spectral features. The combination catches both.
Which Novelty Detector to Use
| Scenario | Recommended approach |
|---|---|
| CNN classifier already deployed | Deep feature Mahalanobis (uses existing model) |
| Classical ML pipeline (PLS-DA, SVM) | PCA-space Mahalanobis (lightweight) |
| High-stakes application requiring redundancy | Autoencoder + Mahalanobis combined |
| Need to detect subtle instrument drift | Autoencoder (sensitive to systematic shifts) |
Setting Clinical Decision Thresholds
Confidence scores and prediction sets must be translated into clinically actionable categories. The standard approach is a three-zone decision system - positive, negative, and indeterminate - but the zone boundaries must be set using clinical performance requirements, not arbitrary probability cutoffs.
Threshold Optimization Framework
from sklearn.metrics import confusion_matrix
def optimize_clinical_thresholds(y_true, y_prob, uncertainty_scores,
min_sensitivity=0.95,
min_specificity=0.90,
max_indeterminate=0.15):
results = []
for conf_thresh in np.arange(0.50, 0.99, 0.01):
for unc_thresh in np.arange(0.01, 0.50, 0.01):
determinate_mask = (
(np.max(y_prob, axis=1) >= conf_thresh) &
(uncertainty_scores <= unc_thresh)
)
indeterminate_rate = 1.0 - determinate_mask.mean()
if indeterminate_rate > max_indeterminate:
continue
y_true_det = y_true[determinate_mask]
y_pred_det = np.argmax(y_prob[determinate_mask], axis=1)
if len(y_true_det) == 0:
continue
tn, fp, fn, tp = confusion_matrix(
y_true_det, y_pred_det, labels=[0, 1]
).ravel()
sens = tp / (tp + fn) if (tp + fn) > 0 else 0
spec = tn / (tn + fp) if (tn + fp) > 0 else 0
if sens >= min_sensitivity and spec >= min_specificity:
results.append({
'confidence_threshold': conf_thresh,
'uncertainty_threshold': unc_thresh,
'sensitivity': sens,
'specificity': spec,
'indeterminate_rate': indeterminate_rate,
'n_determinate': int(determinate_mask.sum())
})
if not results:
return None
results.sort(key=lambda r: r['indeterminate_rate'])
return results[0]The framework searches for the tightest thresholds that simultaneously achieve the target sensitivity, specificity, and indeterminate rate. The indeterminate rate is capped - a system that says "I don't know" for 50% of samples is not clinically viable. The optimization minimizes the indeterminate rate while maintaining the performance guarantees.
Clinical Decision Zones
The final three-zone system integrates confidence, uncertainty, and novelty detection:
def clinical_decision(spectrum_tensor, model, conformal_predictor,
novelty_detector, mc_n_forward=50,
confidence_threshold=0.85,
uncertainty_threshold=0.15,
class_names=None):
novelty = novelty_detector.score(spectrum_tensor.squeeze().cpu().numpy())
if novelty['is_novel']:
return {
'decision': 'REJECTED',
'reason': 'Out-of-distribution spectrum',
'novelty_score': novelty['combined_score']
}
mc_result = mc_dropout_predict(model, spectrum_tensor, mc_n_forward)
psets = conformal_predictor.predict(mc_result['mean_probs'])
pset = psets[0]
max_prob = float(np.max(mc_result['mean_probs']))
epistemic = float(mc_result['mutual_information'][0])
predicted_class = int(np.argmax(mc_result['mean_probs']))
if len(pset) == 1 and max_prob >= confidence_threshold \
and epistemic <= uncertainty_threshold:
decision = 'POSITIVE' if predicted_class > 0 else 'NEGATIVE'
zone = 'determinate'
else:
decision = 'INDETERMINATE'
zone = 'review'
label = class_names[predicted_class] if class_names else str(predicted_class)
return {
'decision': decision,
'zone': zone,
'predicted_class': label,
'confidence': max_prob,
'epistemic_uncertainty': epistemic,
'prediction_set': [class_names[c] if class_names else str(c)
for c in pset],
'prediction_set_size': len(pset),
'conformal_coverage': 1 - conformal_predictor.alpha
}The layered design is deliberate. Novelty detection runs first and rejects spectra that should never reach the classifier. MC dropout produces uncertainty estimates. Conformal prediction produces prediction sets. The thresholds combine all three signals to determine the clinical decision zone. Only spectra that pass all checks receive a determinate result.
Regulatory Perspective
The FDA's evolving framework for AI/ML-based Software as a Medical Device (SaMD) has direct implications for how confidence scores are implemented and documented. As of 2026, the key regulatory touchpoints are outlined below.
Predetermined Change Control Plans (PCCP)
The FDA's December 2024 final guidance on PCCPs allows manufacturers to pre-specify how an AI/ML model will be updated post-market without requiring a new 510(k) or PMA for each change. For confidence scoring, this means your threshold parameters (confidence cutoffs, indeterminate zone boundaries) can be part of a PCCP -- but only if you document in advance:
- The conditions under which thresholds will be updated
- The validation protocol for any threshold change
- The performance boundaries that trigger a threshold review
Clinical Decision Support Transparency
The FDA's 2026 guidance on Clinical Decision Support (CDS) software clarifies that AI-assisted diagnostic results must allow the healthcare professional to understand the basis of the recommendation. For spectral classifiers, this means:
- Confidence scores must be presented - not just the classification label. A bare "positive" result without an associated confidence metric does not meet CDS transparency expectations.
- Uncertainty must be communicated in a way that is interpretable by the clinician. Prediction sets (from conformal prediction) are preferable to raw probability values because they directly communicate the set of plausible diagnoses.
- Limitations must be explicit. The system must communicate when it is operating outside validated conditions - which is what novelty detection provides.
Good Machine Learning Practice (GMLP)
The IMDRF's 2025 finalized GMLP principles, endorsed jointly by the FDA and EMA, include specific expectations relevant to confidence scoring:
- Training data must be representative of the intended use population
- Models must be tested on independent datasets
- Model limitations must be documented and communicated
- Post-market performance monitoring must be in place
For a spectral classification system, satisfying these principles requires the infrastructure described in this article: calibrated confidence scores (not raw softmax outputs), prediction sets with documented coverage guarantees, out-of-distribution detection with documented thresholds, and clinical performance monitoring that tracks calibration and coverage over time.
What to Include in Your Regulatory Submission
| Component | What to document |
|---|---|
| Calibration method | Temperature scaling or Platt scaling, with ECE before and after |
| Conformal prediction | Coverage guarantee, calibration set size and composition, empirical coverage on validation set |
| Uncertainty method | MC dropout or ensemble, number of passes/members, mutual information thresholds |
| Novelty detection | OOD detection approach, threshold derivation, false positive/negative rates on known OOD samples |
| Decision zones | Threshold values, optimization methodology, clinical rationale for zone boundaries |
| Performance metrics | Sensitivity, specificity, PPV, NPV per decision zone, indeterminate rate |
Method Comparison
| Method | Impact on accuracy | Computational cost | Regulatory acceptance | Implementation difficulty | Outputs |
|---|---|---|---|---|---|
| Temperature scaling | None (preserves ranking) | Negligible | High - well-understood | Low - single parameter | Calibrated probabilities |
| Platt scaling | None (preserves ranking) | Negligible | High - well-understood | Low - logistic regression | Calibrated probabilities |
| Conformal prediction | None (post-hoc) | Negligible at inference | Very high - formal guarantees | Medium - requires calibration set | Prediction sets with coverage |
| MC dropout | Slight decrease (stochastic) | 30-50x single pass | Medium - growing acceptance | Medium - modify inference loop | Mean/variance, entropy, MI |
| Deep ensembles | Slight increase (averaging) | 5x training, 5x inference | Medium - growing acceptance | High - train multiple models | Mean/variance, agreement |
| Mahalanobis OOD | None (upstream gate) | Negligible | High - interpretable | Medium - fit distribution | Distance score, in/out flag |
| Autoencoder OOD | None (upstream gate) | Train autoencoder + negligible inference | Medium - less interpretable | High - separate model | Reconstruction error, distance |
For a production spectral classification system, the recommended combination is: temperature scaling + conformal prediction + MC dropout + Mahalanobis OOD detection. This provides calibrated probabilities, guaranteed-coverage prediction sets, continuous uncertainty estimates, and novelty detection - covering all clinical and regulatory requirements at manageable computational cost.
Deep ensembles are the gold standard for uncertainty quality but are usually reserved for offline validation or high-stakes applications where the 5x cost is justified.
Putting It All Together
The complete inference pipeline for a clinical spectral classifier with uncertainty quantification:
Incoming Spectrum
│
├── 1. Preprocessing (baseline, normalization, region selection)
│
├── 2. Novelty Detection (Mahalanobis in feature space)
│ → REJECT if OOD
│
├── 3. MC Dropout Inference (50 forward passes)
│ → Mean probabilities, epistemic uncertainty
│
├── 4. Temperature Scaling (calibrate mean probabilities)
│
├── 5. Conformal Prediction (generate prediction set)
│
├── 6. Clinical Decision Logic
│ → Combine confidence, uncertainty, set size
│ → Assign to POSITIVE / NEGATIVE / INDETERMINATE
│
└── 7. Report Generation
→ Classification, confidence, prediction set,
uncertainty flag, audit trail
Every component is independently testable and independently validatable. The novelty detector can be validated with known OOD spectra. The calibration can be assessed with reliability diagrams. The conformal predictor can be validated with held-out coverage experiments. The decision thresholds can be optimized with clinical performance simulations. This modularity is not just good engineering - it is a regulatory requirement under IEC 62304, which mandates that software components are independently verified and validated. The SpectraDx platform implements this full confidence scoring pipeline out of the box, including calibrated probabilities, conformal prediction sets, and novelty detection.
For the data infrastructure that feeds this pipeline, see spectral data pipeline architecture. For the explainability layer that explains why the model made a given prediction - a natural companion to confidence scoring that explains how certain the prediction is. And for the data format layer that ensures spectra arrive in a consistent, parseable format regardless of instrument vendor, see spectral data formats.
Further Reading
- Building AI Pipelines for Spectral Classification - the foundational ML pipeline this article extends
- Clinical Workflow Architecture - system architecture for clinical spectroscopy software
- Spectral Data Pipeline Architecture - data infrastructure for spectral classification systems
- Explainable AI for Spectroscopy - model interpretability, complementary to uncertainty quantification
- Spectral Data Formats - data interchange formats for multi-instrument pipelines
- Guo et al., "On Calibration of Modern Neural Networks" (ICML 2017) - the definitive study on neural network miscalibration
- Angelopoulos and Bates, "A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification" (2023) - the best tutorial on conformal prediction
- Lakshminarayanan et al., "Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles" (NeurIPS 2017) - the deep ensembles paper
Part of the SpectraDx technical blog.

