A spectral classifier that outputs "malignant" is making a claim. A spectral classifier that outputs "malignant, 0.97 confidence, prediction set " is making a claim you can act on. The difference between these two outputs is the difference between a research prototype and a deployable clinical system.

This article covers the engineering of that difference. We start with why raw neural network outputs are unreliable as probabilities, then work through four methods for producing calibrated confidence estimates:

Temperature scaling and Platt scaling (post-hoc calibration)
Conformal prediction (guaranteed-coverage prediction sets)
Monte Carlo dropout (Bayesian uncertainty approximation)
Deep ensembles (multi-model disagreement)

We then cover out-of-distribution detection for spectra that should never reach the classifier at all, and finish with clinical threshold design and regulatory expectations.

Throughout, the code is written for the spectral classification pipeline described in Building AI Pipelines for Spectral Classification. That article covers the SpectralCNN architecture, preprocessing, validation methodology, and the classify_with_confidence() function that implements a basic three-zone threshold scheme. This article replaces that basic scheme with rigorous uncertainty quantification.

All code uses PyTorch, scikit-learn, and NumPy. Variable names assume spectral data: spectra, wavenumbers, n_spectra, n_wavenumbers.

The Clinical Problem

Consider two patients. Patient A's FTIR tissue spectrum runs through a classifier and returns "malignant" with a softmax output of 0.52. Patient B's spectrum returns "malignant" with a softmax output of 0.998. The clinical action should be radically different - yet a system that reports only the label treats them identically.

This is not a theoretical concern. In clinical spectroscopy, several factors conspire to produce ambiguous classifications:

Borderline pathology. Not every tissue sample is clearly normal or clearly malignant. Dysplastic tissue, early-stage lesions, and mixed samples produce spectra that sit between class distributions.
Sample quality variation. Tissue hydration, fixation protocols, section thickness, and contamination (blood, paraffin residue, adhesive) introduce spectral artifacts that degrade classifier certainty.
Instrument variability. The same tissue measured on two different spectrometers produces slightly different spectra. A model trained on instrument A encounters systematic shifts on instrument B.
Novel sample types. A classifier trained on normal and malignant breast tissue receives a benign fibroadenoma it has never seen. It must output something - and whatever it outputs will be wrong.

A clinician needs three things from a spectral classification system:

The predicted label
A calibrated probability that the label is correct
A flag when the system is operating outside its competence

The rest of this article is about producing those three outputs reliably.

Why Raw Softmax Outputs Are Not Probabilities

The softmax function converts a vector of logits into values that sum to 1. This superficially resembles a probability distribution, and many deployed systems treat softmax outputs as calibrated probabilities. They are not.

Modern deep neural networks - including the 1D CNNs used for spectral classification - are systematically overconfident. A network that achieves 90% accuracy on a test set will routinely assign softmax probabilities above 0.99 to its predictions. When the network predicts with 0.95 confidence, it is correct far less than 95% of the time.

This miscalibration was characterized systematically by Guo et al. (2017) and has been confirmed across architectures and domains. The cause is a combination of overparameterization, batch normalization, and training with negative log-likelihood loss - all standard practices that improve accuracy while degrading calibration.

Measuring calibration requires a reliability diagram. Bin predictions by their confidence level and compare the mean confidence in each bin to the actual accuracy:

import numpy as np
from sklearn.metrics import accuracy_score
 
def reliability_diagram(y_true, y_prob, n_bins=10):
    bin_edges = np.linspace(0, 1, n_bins + 1)
    bin_accs = []
    bin_confs = []
    bin_counts = []
 
    max_probs = np.max(y_prob, axis=1)
    y_pred = np.argmax(y_prob, axis=1)
 
    for i in range(n_bins):
        mask = (max_probs > bin_edges[i]) & (max_probs <= bin_edges[i + 1])
        if np.sum(mask) == 0:
            continue
        bin_acc = accuracy_score(y_true[mask], y_pred[mask])
        bin_conf = np.mean(max_probs[mask])
        bin_accs.append(bin_acc)
        bin_confs.append(bin_conf)
        bin_counts.append(np.sum(mask))
 
    ece = np.sum([
        (c / sum(bin_counts)) * abs(a - f)
        for a, f, c in zip(bin_accs, bin_confs, bin_counts)
    ])
    return bin_accs, bin_confs, bin_counts, ece

The Expected Calibration Error (ECE) is the weighted average gap between accuracy and confidence across bins. A perfectly calibrated model has ECE = 0. Uncalibrated spectral CNNs typically show ECE between 0.05 and 0.15 - meaning the average confidence-accuracy gap is 5-15 percentage points. That gap is clinically dangerous.

Post-Hoc Calibration: Temperature Scaling and Platt Scaling

The simplest fix for miscalibrated softmax outputs is post-hoc calibration - learning a transformation that maps raw model outputs to calibrated probabilities without modifying the model itself.

Temperature Scaling

Temperature scaling divides logits by a single learned parameter T before applying softmax. When T > 1, the distribution becomes softer (less confident). When T < 1, it becomes sharper (more confident). The parameter is optimized on a held-out calibration set by minimizing negative log-likelihood.

import torch
import torch.nn as nn
from torch.utils.data import DataLoader
 
class TemperatureScaler(nn.Module):
    def __init__(self):
        super().__init__()
        self.temperature = nn.Parameter(torch.ones(1) * 1.5)
 
    def forward(self, logits):
        return logits / self.temperature
 
    def fit(self, val_loader, model, device, lr=0.01, max_iter=200):
        model.eval()
        logits_list, labels_list = [], []
 
        with torch.no_grad():
            for spectra_batch, labels in val_loader:
                spectra_batch = spectra_batch.to(device)
                logits = model(spectra_batch)
                logits_list.append(logits.cpu())
                labels_list.append(labels)
 
        logits_all = torch.cat(logits_list)
        labels_all = torch.cat(labels_list)
        nll = nn.CrossEntropyLoss()
        optimizer = torch.optim.LBFGS([self.temperature], lr=lr, max_iter=max_iter)
 
        def closure():
            optimizer.zero_grad()
            scaled = self.forward(logits_all)
            loss = nll(scaled, labels_all)
            loss.backward()
            return loss
 
        optimizer.step(closure)
        return self.temperature.item()

Usage after training:

scaler = TemperatureScaler()
optimal_T = scaler.fit(val_loader, model, device)
 
model.eval()
with torch.no_grad():
    logits = model(spectrum_tensor)
    calibrated_logits = logits / optimal_T
    calibrated_probs = torch.softmax(calibrated_logits, dim=1)

Typical optimal temperatures for spectral CNNs range from 1.3 to 2.5. An optimal T of 2.0 means the network was approximately twice as confident as it should have been. Temperature scaling preserves prediction accuracy (the argmax does not change) while fixing the probability estimates.

Platt Scaling

Platt scaling fits a logistic regression model on the logits, learning both a scale and a shift parameter per class. This is more flexible than temperature scaling and can correct asymmetric miscalibration (e.g., the model is well-calibrated for class 0 but overconfident for class 1):

from sklearn.linear_model import LogisticRegression
 
def platt_calibrate(logits_val, y_val, logits_test):
    calibrator = LogisticRegression(
        C=1e10, solver='lbfgs', max_iter=1000, multi_class='multinomial'
    )
    calibrator.fit(logits_val, y_val)
    calibrated_probs = calibrator.predict_proba(logits_test)
    return calibrated_probs

When to use which. Temperature scaling is preferred when you have limited calibration data (50-200 spectra) because it learns a single parameter. Platt scaling is preferred when you have more calibration data (500+) and observe class-specific miscalibration. Both are simple to implement, add zero inference latency (a single division or linear transform), and should be the default first step for any deployed spectral classifier.

Neither method addresses a deeper problem: they assume the model's ranking of predictions is correct and merely adjust the scale. If the model is fundamentally uncertain about a sample, post-hoc calibration cannot recover that uncertainty. For that, we need methods that quantify uncertainty structurally.

Conformal Prediction

Conformal prediction is the single most important method in this article. Unlike calibration methods that adjust point estimates, conformal prediction produces prediction sets - sets of classes that are guaranteed to contain the true class with a user-specified probability (e.g., 90%, 95%, 99%). The guarantee is distribution-free: it holds regardless of the model architecture, the data distribution, or the quality of the model, provided only that calibration and test data are exchangeable.

For clinical spectroscopy, this is transformative. Instead of reporting "malignant, probability 0.87," the system reports "prediction set: , coverage 95%." If the model is uncertain, the prediction set grows: ", coverage 95%." The clinician sees exactly how many classes remain plausible, and the coverage guarantee is mathematically proven.

How Split Conformal Prediction Works

Split conformal prediction requires three components:

A trained model that produces softmax probabilities (or any scores)
A calibration set of labeled spectra that were not used for training
A target coverage level (1 - alpha), e.g., 0.95

The algorithm:

Run the model on each calibration spectrum. For each, compute a nonconformity score - a measure of how "surprising" the true label is. The standard choice is 1 - softmax_probability_of_true_class.
Sort the calibration scores and find the (1 - alpha)(1 + 1/n) quantile, where n is the calibration set size. Call this threshold q_hat.
At test time, include class k in the prediction set if its softmax probability exceeds 1 - q_hat.

The coverage guarantee follows from a quantile argument on exchangeable data. No assumptions about the model or data distribution are required.

Full Implementation

import numpy as np
from typing import List, Dict
 
class SpectralConformalPredictor:
    def __init__(self, alpha: float = 0.05):
        self.alpha = alpha
        self.q_hat = None
        self.n_cal = 0
 
    def calibrate(self, cal_probs: np.ndarray, cal_labels: np.ndarray):
        n = len(cal_labels)
        self.n_cal = n
        scores = 1.0 - cal_probs[np.arange(n), cal_labels]
        quantile_level = np.ceil((1 - self.alpha) * (n + 1)) / n
        quantile_level = min(quantile_level, 1.0)
        self.q_hat = np.quantile(scores, quantile_level, method='higher')
 
    def predict(self, test_probs: np.ndarray) -> List[List[int]]:
        threshold = 1.0 - self.q_hat
        prediction_sets = []
        for probs in test_probs:
            pset = np.where(probs >= threshold)[0].tolist()
            if len(pset) == 0:
                pset = [int(np.argmax(probs))]
            prediction_sets.append(pset)
        return prediction_sets
 
    def evaluate(self, test_probs: np.ndarray, test_labels: np.ndarray) -> Dict:
        psets = self.predict(test_probs)
        covered = sum(
            1 for pset, y in zip(psets, test_labels) if y in pset
        )
        coverage = covered / len(test_labels)
        sizes = [len(ps) for ps in psets]
        return {
            'empirical_coverage': coverage,
            'target_coverage': 1 - self.alpha,
            'mean_set_size': np.mean(sizes),
            'median_set_size': np.median(sizes),
            'singleton_fraction': np.mean([s == 1 for s in sizes]),
            'empty_fraction': 0.0,
            'q_hat': self.q_hat,
            'n_calibration': self.n_cal
        }

Using Conformal Prediction in a Spectral Pipeline

from sklearn.model_selection import train_test_split
 
model.eval()
with torch.no_grad():
    cal_logits = model(cal_spectra_tensor)
    cal_probs = torch.softmax(cal_logits / optimal_T, dim=1).numpy()
 
    test_logits = model(test_spectra_tensor)
    test_probs = torch.softmax(test_logits / optimal_T, dim=1).numpy()
 
cp = SpectralConformalPredictor(alpha=0.05)
cp.calibrate(cal_probs, cal_labels)
results = cp.evaluate(test_probs, test_labels)
 
print(f"Coverage: {results['empirical_coverage']:.3f} "
      f"(target: {results['target_coverage']:.3f})")
print(f"Mean set size: {results['mean_set_size']:.2f}")
print(f"Singleton fraction: {results['singleton_fraction']:.3f}")

Notice that we apply temperature scaling before conformal prediction. This is deliberate and recommended by Angelopoulos and Bates (2023). Conformal prediction's coverage guarantee holds regardless of calibration, but better-calibrated softmax scores produce smaller prediction sets - which means more clinically useful results.

Interpreting Prediction Sets Clinically

The prediction set size is the key clinical signal:

Set Size	Interpretation	Clinical Action
1	Model is confident in a single class	Report result with confidence level
2	Model cannot distinguish between two classes	Flag for pathologist review or repeat measurement
3+	Model has high uncertainty	Do not report; require confirmatory testing
= total classes	Model knows nothing about this sample	Likely out-of-distribution; investigate sample quality

A well-performing spectral classifier at α = 0.05 should produce singleton prediction sets for 85-95% of samples. If the singleton fraction drops below 80%, the model lacks discriminative power for the clinical task, or the calibration set is too small.

Adaptive Conformal Prediction (ACI)

Standard split conformal prediction provides marginal coverage - the guarantee holds on average across all test spectra. It does not guarantee coverage for specific subpopulations (e.g., spectra from a particular instrument or patient demographic). Adaptive Conformal Inference (ACI) adjusts the threshold dynamically based on recent prediction performance:

class AdaptiveConformalPredictor:
    def __init__(self, alpha: float = 0.05, gamma: float = 0.01):
        self.alpha = alpha
        self.gamma = gamma
        self.alpha_t = alpha
 
    def update(self, covered: bool):
        if covered:
            self.alpha_t = self.alpha_t + self.gamma * (1 - self.alpha)
        else:
            self.alpha_t = self.alpha_t - self.gamma * self.alpha
        self.alpha_t = np.clip(self.alpha_t, 0.001, 0.999)
 
    def get_threshold(self, cal_scores: np.ndarray) -> float:
        quantile_level = min(
            np.ceil((1 - self.alpha_t) * (len(cal_scores) + 1))
            / len(cal_scores),
            1.0
        )
        return np.quantile(cal_scores, quantile_level, method='higher')

ACI is valuable in clinical deployments where data distribution shifts over time - new instruments come online, patient populations change, sample preparation protocols evolve. The threshold adapts automatically to maintain coverage despite these shifts.

Calibration Set Sizing

The calibration set must be large enough for the quantile estimate to be tight. As a practical guideline:

Target α	Minimum calibration size	Notes
0.10	100	Loose coverage, small sets
0.05	200	Standard clinical target
0.01	500	Stringent coverage, larger sets

For spectral classification with 3-5 classes, 200-300 calibration spectra (balanced across classes) provide stable quantile estimates. These spectra must come from the same distribution as test data - same instruments, same sample types, same preprocessing pipeline.

Monte Carlo Dropout

Monte Carlo (MC) dropout provides a Bayesian approximation to model uncertainty by running multiple stochastic forward passes with dropout enabled at inference time. Each pass produces a different prediction because different neurons are randomly dropped. The variance across passes estimates epistemic uncertainty - uncertainty that arises from limited training data, as opposed to irreducible noise in the measurement.

This is particularly useful for spectral classification because it distinguishes between "this spectrum is ambiguous" (high aleatoric uncertainty - the spectrum genuinely sits between classes) and "this spectrum is unlike the training data" (high epistemic uncertainty - the model has not learned this part of the spectral space).

Implementation

The SpectralCNN from the AI classification pipeline article already includes nn.Dropout(0.5) in the classifier head. MC dropout keeps this dropout active during inference:

import torch
import torch.nn.functional as F
 
def mc_dropout_predict(model, spectrum_tensor, n_forward=50):
    model.train()  # keeps dropout active
 
    predictions = []
    for _ in range(n_forward):
        with torch.no_grad():
            logits = model(spectrum_tensor)
            probs = F.softmax(logits, dim=1)
            predictions.append(probs.cpu().numpy())
 
    predictions = np.array(predictions)  # (n_forward, batch, n_classes)
    mean_probs = predictions.mean(axis=0)
    std_probs = predictions.std(axis=0)
 
    predictive_entropy = -np.sum(
        mean_probs * np.log(mean_probs + 1e-10), axis=1
    )
 
    individual_entropies = -np.sum(
        predictions * np.log(predictions + 1e-10), axis=2
    )
    mean_entropy = individual_entropies.mean(axis=0)
    mutual_information = predictive_entropy - mean_entropy
 
    return {
        'mean_probs': mean_probs,
        'std_probs': std_probs,
        'predictive_entropy': predictive_entropy,
        'mutual_information': mutual_information,
        'all_predictions': predictions
    }

Predictive entropy captures total uncertainty (aleatoric + epistemic). Mutual information between the model parameters and the prediction isolates epistemic uncertainty. High mutual information means the model's predictions change substantially across dropout masks - it is uncertain because it lacks data in this region, not because the spectrum is inherently ambiguous.

Practical Considerations

Number of forward passes. 30-50 passes provide stable uncertainty estimates for spectral classifiers with 3-5 output classes. Beyond 50, the marginal improvement in estimate stability is negligible. At 50 passes with a typical spectral CNN (< 1M parameters), total inference time is 50-200ms on CPU - within the clinical latency budget.
Dropout rate matters. The standard 0.5 dropout rate used during training is often too aggressive for MC dropout inference. A rate of 0.1-0.3 produces tighter uncertainty estimates while still capturing meaningful epistemic variation. You can modify the dropout rate at inference time without retraining:

def set_mc_dropout_rate(model, rate=0.2):
    for module in model.modules():
        if isinstance(module, nn.Dropout):
            module.p = rate

Where to place dropout. For spectral CNNs, dropout in the classifier head (after the global average pooling) captures uncertainty about the final classification decision. Adding dropout between convolutional blocks captures uncertainty about learned spectral features. Both are useful; the classifier-head-only approach is simpler and sufficient for most clinical applications.

Combining MC Dropout with Conformal Prediction

MC dropout uncertainty and conformal prediction address different needs. Conformal prediction gives you prediction sets with coverage guarantees. MC dropout gives you a continuous uncertainty score that can be thresholded. Combining them produces a system that is both statistically rigorous and operationally useful:

def combined_inference(model, spectrum_tensor, conformal_predictor,
                       scaler_T, n_forward=50):
    mc_result = mc_dropout_predict(model, spectrum_tensor, n_forward)
    mean_probs = mc_result['mean_probs']
 
    calibrated_probs = np.exp(np.log(mean_probs + 1e-10) / scaler_T)
    calibrated_probs /= calibrated_probs.sum(axis=1, keepdims=True)
 
    psets = conformal_predictor.predict(calibrated_probs)
 
    return {
        'prediction_sets': psets,
        'mean_probs': mean_probs,
        'epistemic_uncertainty': mc_result['mutual_information'],
        'total_uncertainty': mc_result['predictive_entropy'],
        'std_probs': mc_result['std_probs']
    }

Deep Ensembles

Deep ensembles train M independent models (typically M = 5) with different random initializations and use their disagreement as an uncertainty measure. Introduced by Lakshminarayanan et al. (2017), deep ensembles consistently outperform other uncertainty estimation methods in empirical benchmarks, including MC dropout and variational inference.

The intuition is straightforward: if five independently trained models all agree that a spectrum is malignant, the prediction is robust. If three say malignant and two say dysplastic, the epistemic uncertainty is high.

Implementation

import torch
import torch.nn as nn
import copy
 
class SpectralEnsemble:
    def __init__(self, base_model_class, model_kwargs, n_models=5):
        self.models = [
            base_model_class(**model_kwargs) for _ in range(n_models)
        ]
        self.n_models = n_models
 
    def train_ensemble(self, train_dataset, val_dataset, n_epochs=100,
                       lr=1e-3, device='cpu'):
        for i, model in enumerate(self.models):
            model.to(device)
            optimizer = torch.optim.AdamW(model.parameters(), lr=lr,
                                          weight_decay=1e-4)
            scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
                optimizer, T_max=n_epochs
            )
            train_loader = torch.utils.data.DataLoader(
                train_dataset, batch_size=64, shuffle=True,
                generator=torch.Generator().manual_seed(i * 1000)
            )
 
            for epoch in range(n_epochs):
                model.train()
                for spectra_batch, labels in train_loader:
                    spectra_batch = spectra_batch.to(device)
                    labels = labels.to(device)
                    optimizer.zero_grad()
                    loss = nn.CrossEntropyLoss()(model(spectra_batch), labels)
                    loss.backward()
                    optimizer.step()
                scheduler.step()
 
    def predict(self, spectrum_tensor, device='cpu'):
        all_probs = []
        for model in self.models:
            model.eval()
            model.to(device)
            with torch.no_grad():
                logits = model(spectrum_tensor.to(device))
                probs = torch.softmax(logits, dim=1).cpu().numpy()
                all_probs.append(probs)
 
        all_probs = np.array(all_probs)  # (n_models, batch, n_classes)
        mean_probs = all_probs.mean(axis=0)
        std_probs = all_probs.std(axis=0)
 
        predictions = all_probs.argmax(axis=2)  # (n_models, batch)
        agreement = np.array([
            np.max(np.bincount(predictions[:, i], minlength=mean_probs.shape[1]))
            / self.n_models
            for i in range(predictions.shape[1])
        ])
 
        return {
            'mean_probs': mean_probs,
            'std_probs': std_probs,
            'agreement': agreement,
            'all_probs': all_probs
        }

Computational Cost

The primary drawback of deep ensembles is computational cost. Training 5 models takes 5x the training time. Inference takes 5x the single-model latency. For spectral CNNs with fewer than 1M parameters, this is manageable - 5 forward passes through a spectral CNN take ~20ms on CPU. For larger models or real-time applications, consider:

Snapshot ensembles. Save checkpoints from different phases of a single training run (using cyclic learning rates). You get ensemble diversity without training separate models, at the cost of reduced diversity.
Batch ensemble. Share the majority of weights across ensemble members and learn only rank-1 perturbations per member. Reduces storage and inference cost to approximately 1.2x a single model.
Pruned ensembles. Train the full ensemble, then prune each member to 50-70% sparsity. Maintains diversity while reducing inference cost.

For clinical spectroscopy, the full deep ensemble (M = 5) is usually feasible. Spectral CNNs are small. Training takes minutes to hours, not days. Storage for 5 models is a few megabytes. Inference at 20ms is well within latency constraints. Use the full approach unless you are deploying on an embedded device at the spectrometer.

Spectral Novelty Detection

All methods discussed so far assume the test spectrum belongs to one of the known classes. But in clinical deployment, spectra arrive from outside the training distribution: contaminated samples, sample types the model has never seen, instrument malfunctions that produce corrupted data. A model forced to classify a completely novel spectrum will produce a confident but meaningless result.

Novelty detection - also called out-of-distribution (OOD) detection - sits upstream of the classifier and gates whether the spectrum should be classified at all. The AI classification pipeline article introduces a SpectralDriftDetector using Mahalanobis distance in PCA space. Here we extend this with two complementary approaches.

Mahalanobis Distance in Feature Space

The Mahalanobis distance approach from the existing pipeline operates in PCA space on preprocessed spectra. For deep learning models, operating in the penultimate layer feature space (the 128-dimensional representation before the final classification layer) produces substantially better OOD detection:

import torch
import numpy as np
from scipy.spatial.distance import mahalanobis
 
class DeepSpectralNoveltyDetector:
    def __init__(self, model, device='cpu'):
        self.model = model
        self.device = device
        self.class_means = {}
        self.precision = None
        self.hook_output = None
 
        for name, module in model.named_modules():
            if isinstance(module, nn.Flatten):
                module.register_forward_hook(self._hook)
                break
 
    def _hook(self, module, input, output):
        self.hook_output = output
 
    def fit(self, train_loader, n_classes):
        self.model.eval()
        features_by_class = {c: [] for c in range(n_classes)}
 
        with torch.no_grad():
            for spectra_batch, labels in train_loader:
                spectra_batch = spectra_batch.to(self.device)
                _ = self.model(spectra_batch)
                feats = self.hook_output.cpu().numpy()
                for feat, label in zip(feats, labels.numpy()):
                    features_by_class[int(label)].append(feat)
 
        all_features = []
        for c in range(n_classes):
            class_feats = np.array(features_by_class[c])
            self.class_means[c] = class_feats.mean(axis=0)
            all_features.append(class_feats)
 
        all_features = np.vstack(all_features)
        cov = np.cov(all_features.T)
        cov += 1e-6 * np.eye(cov.shape[0])
        self.precision = np.linalg.inv(cov)
 
    def score(self, spectrum_tensor):
        self.model.eval()
        with torch.no_grad():
            _ = self.model(spectrum_tensor.to(self.device))
            feat = self.hook_output.cpu().numpy().squeeze()
 
        distances = {
            c: mahalanobis(feat, mean, self.precision)
            for c, mean in self.class_means.items()
        }
        min_distance = min(distances.values())
        closest_class = min(distances, key=distances.get)
        return {
            'min_mahalanobis': min_distance,
            'closest_class': closest_class,
            'all_distances': distances
        }

Set the OOD threshold at the 99th percentile of Mahalanobis distances computed on the training set. Spectra exceeding this threshold are flagged as novel and excluded from classification.

Autoencoder Reconstruction Error

An autoencoder trained on in-distribution spectra learns to compress and reconstruct normal spectral patterns. Out-of-distribution spectra - spectra unlike anything in training - cannot be reconstructed well, producing high reconstruction error:

class SpectralAutoencoder(nn.Module):
    def __init__(self, n_wavenumbers, latent_dim=32):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(n_wavenumbers, 256),
            nn.ReLU(),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Linear(128, latent_dim)
        )
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 256),
            nn.ReLU(),
            nn.Linear(256, n_wavenumbers)
        )
 
    def forward(self, x):
        z = self.encoder(x)
        return self.decoder(z), z
 
 
class AutoencoderNoveltyDetector:
    def __init__(self, n_wavenumbers, latent_dim=32, device='cpu'):
        self.model = SpectralAutoencoder(n_wavenumbers, latent_dim)
        self.device = device
        self.threshold = None
        self.latent_mean = None
        self.latent_precision = None
 
    def fit(self, train_spectra, n_epochs=200, lr=1e-3):
        self.model.to(self.device)
        dataset = torch.utils.data.TensorDataset(
            torch.FloatTensor(train_spectra)
        )
        loader = torch.utils.data.DataLoader(dataset, batch_size=64,
                                              shuffle=True)
        optimizer = torch.optim.Adam(self.model.parameters(), lr=lr)
 
        for epoch in range(n_epochs):
            for (batch,) in loader:
                batch = batch.to(self.device)
                recon, z = self.model(batch)
                loss = nn.MSELoss()(recon, batch)
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()
 
        recon_errors = self._compute_errors(train_spectra)
        self.threshold = np.percentile(recon_errors, 99)
 
        self.model.eval()
        with torch.no_grad():
            _, z_all = self.model(torch.FloatTensor(train_spectra).to(self.device))
            z_np = z_all.cpu().numpy()
            self.latent_mean = z_np.mean(axis=0)
            cov = np.cov(z_np.T) + 1e-6 * np.eye(z_np.shape[1])
            self.latent_precision = np.linalg.inv(cov)
 
    def _compute_errors(self, spectra):
        self.model.eval()
        with torch.no_grad():
            tensor = torch.FloatTensor(spectra).to(self.device)
            recon, _ = self.model(tensor)
            errors = ((recon.cpu().numpy() - spectra) ** 2).mean(axis=1)
        return errors
 
    def score(self, spectrum):
        self.model.eval()
        with torch.no_grad():
            tensor = torch.FloatTensor(spectrum).unsqueeze(0).to(self.device)
            recon, z = self.model(tensor)
 
        recon_error = float(((recon.cpu().numpy() - spectrum) ** 2).mean())
        z_np = z.cpu().numpy().squeeze()
        latent_dist = mahalanobis(z_np, self.latent_mean,
                                  self.latent_precision)
 
        combined = recon_error / self.threshold + latent_dist / 10.0
 
        return {
            'reconstruction_error': recon_error,
            'latent_mahalanobis': latent_dist,
            'combined_score': combined,
            'is_novel': recon_error > self.threshold,
            'threshold': self.threshold
        }

The combined approach - reconstruction error plus Mahalanobis distance in latent space - catches two distinct failure modes. Pure reconstruction error misses OOD samples that happen to land on the latent manifold. Pure latent distance misses OOD samples that are close to the training distribution in latent space but have unusual spectral features. The combination catches both.

Which Novelty Detector to Use

Scenario	Recommended approach
CNN classifier already deployed	Deep feature Mahalanobis (uses existing model)
Classical ML pipeline (PLS-DA, SVM)	PCA-space Mahalanobis (lightweight)
High-stakes application requiring redundancy	Autoencoder + Mahalanobis combined
Need to detect subtle instrument drift	Autoencoder (sensitive to systematic shifts)

Setting Clinical Decision Thresholds

Confidence scores and prediction sets must be translated into clinically actionable categories. The standard approach is a three-zone decision system - positive, negative, and indeterminate - but the zone boundaries must be set using clinical performance requirements, not arbitrary probability cutoffs.

Threshold Optimization Framework

from sklearn.metrics import confusion_matrix
 
def optimize_clinical_thresholds(y_true, y_prob, uncertainty_scores,
                                 min_sensitivity=0.95,
                                 min_specificity=0.90,
                                 max_indeterminate=0.15):
    results = []
    for conf_thresh in np.arange(0.50, 0.99, 0.01):
        for unc_thresh in np.arange(0.01, 0.50, 0.01):
            determinate_mask = (
                (np.max(y_prob, axis=1) >= conf_thresh) &
                (uncertainty_scores <= unc_thresh)
            )
            indeterminate_rate = 1.0 - determinate_mask.mean()
            if indeterminate_rate > max_indeterminate:
                continue
 
            y_true_det = y_true[determinate_mask]
            y_pred_det = np.argmax(y_prob[determinate_mask], axis=1)
 
            if len(y_true_det) == 0:
                continue
 
            tn, fp, fn, tp = confusion_matrix(
                y_true_det, y_pred_det, labels=[0, 1]
            ).ravel()
 
            sens = tp / (tp + fn) if (tp + fn) > 0 else 0
            spec = tn / (tn + fp) if (tn + fp) > 0 else 0
 
            if sens >= min_sensitivity and spec >= min_specificity:
                results.append({
                    'confidence_threshold': conf_thresh,
                    'uncertainty_threshold': unc_thresh,
                    'sensitivity': sens,
                    'specificity': spec,
                    'indeterminate_rate': indeterminate_rate,
                    'n_determinate': int(determinate_mask.sum())
                })
 
    if not results:
        return None
 
    results.sort(key=lambda r: r['indeterminate_rate'])
    return results[0]

The framework searches for the tightest thresholds that simultaneously achieve the target sensitivity, specificity, and indeterminate rate. The indeterminate rate is capped - a system that says "I don't know" for 50% of samples is not clinically viable. The optimization minimizes the indeterminate rate while maintaining the performance guarantees.

Clinical Decision Zones

The final three-zone system integrates confidence, uncertainty, and novelty detection:

def clinical_decision(spectrum_tensor, model, conformal_predictor,
                      novelty_detector, mc_n_forward=50,
                      confidence_threshold=0.85,
                      uncertainty_threshold=0.15,
                      class_names=None):
 
    novelty = novelty_detector.score(spectrum_tensor.squeeze().cpu().numpy())
    if novelty['is_novel']:
        return {
            'decision': 'REJECTED',
            'reason': 'Out-of-distribution spectrum',
            'novelty_score': novelty['combined_score']
        }
 
    mc_result = mc_dropout_predict(model, spectrum_tensor, mc_n_forward)
    psets = conformal_predictor.predict(mc_result['mean_probs'])
    pset = psets[0]
 
    max_prob = float(np.max(mc_result['mean_probs']))
    epistemic = float(mc_result['mutual_information'][0])
    predicted_class = int(np.argmax(mc_result['mean_probs']))
 
    if len(pset) == 1 and max_prob >= confidence_threshold \
            and epistemic <= uncertainty_threshold:
        decision = 'POSITIVE' if predicted_class > 0 else 'NEGATIVE'
        zone = 'determinate'
    else:
        decision = 'INDETERMINATE'
        zone = 'review'
 
    label = class_names[predicted_class] if class_names else str(predicted_class)
 
    return {
        'decision': decision,
        'zone': zone,
        'predicted_class': label,
        'confidence': max_prob,
        'epistemic_uncertainty': epistemic,
        'prediction_set': [class_names[c] if class_names else str(c)
                           for c in pset],
        'prediction_set_size': len(pset),
        'conformal_coverage': 1 - conformal_predictor.alpha
    }

The layered design is deliberate. Novelty detection runs first and rejects spectra that should never reach the classifier. MC dropout produces uncertainty estimates. Conformal prediction produces prediction sets. The thresholds combine all three signals to determine the clinical decision zone. Only spectra that pass all checks receive a determinate result.

Regulatory Perspective

The FDA's evolving framework for AI/ML-based Software as a Medical Device (SaMD) has direct implications for how confidence scores are implemented and documented. As of 2026, the key regulatory touchpoints are outlined below.

Predetermined Change Control Plans (PCCP)

The FDA's December 2024 final guidance on PCCPs allows manufacturers to pre-specify how an AI/ML model will be updated post-market without requiring a new 510(k) or PMA for each change. For confidence scoring, this means your threshold parameters (confidence cutoffs, indeterminate zone boundaries) can be part of a PCCP -- but only if you document in advance:

The conditions under which thresholds will be updated
The validation protocol for any threshold change
The performance boundaries that trigger a threshold review

Clinical Decision Support Transparency

The FDA's 2026 guidance on Clinical Decision Support (CDS) software clarifies that AI-assisted diagnostic results must allow the healthcare professional to understand the basis of the recommendation. For spectral classifiers, this means:

Confidence scores must be presented - not just the classification label. A bare "positive" result without an associated confidence metric does not meet CDS transparency expectations.
Uncertainty must be communicated in a way that is interpretable by the clinician. Prediction sets (from conformal prediction) are preferable to raw probability values because they directly communicate the set of plausible diagnoses.
Limitations must be explicit. The system must communicate when it is operating outside validated conditions - which is what novelty detection provides.

Good Machine Learning Practice (GMLP)

The IMDRF's 2025 finalized GMLP principles, endorsed jointly by the FDA and EMA, include specific expectations relevant to confidence scoring:

Training data must be representative of the intended use population
Models must be tested on independent datasets
Model limitations must be documented and communicated
Post-market performance monitoring must be in place

For a spectral classification system, satisfying these principles requires the infrastructure described in this article: calibrated confidence scores (not raw softmax outputs), prediction sets with documented coverage guarantees, out-of-distribution detection with documented thresholds, and clinical performance monitoring that tracks calibration and coverage over time.

What to Include in Your Regulatory Submission

Component	What to document
Calibration method	Temperature scaling or Platt scaling, with ECE before and after
Conformal prediction	Coverage guarantee, calibration set size and composition, empirical coverage on validation set
Uncertainty method	MC dropout or ensemble, number of passes/members, mutual information thresholds
Novelty detection	OOD detection approach, threshold derivation, false positive/negative rates on known OOD samples
Decision zones	Threshold values, optimization methodology, clinical rationale for zone boundaries
Performance metrics	Sensitivity, specificity, PPV, NPV per decision zone, indeterminate rate

Method Comparison

Method	Impact on accuracy	Computational cost	Regulatory acceptance	Implementation difficulty	Outputs
Temperature scaling	None (preserves ranking)	Negligible	High - well-understood	Low - single parameter	Calibrated probabilities
Platt scaling	None (preserves ranking)	Negligible	High - well-understood	Low - logistic regression	Calibrated probabilities
Conformal prediction	None (post-hoc)	Negligible at inference	Very high - formal guarantees	Medium - requires calibration set	Prediction sets with coverage
MC dropout	Slight decrease (stochastic)	30-50x single pass	Medium - growing acceptance	Medium - modify inference loop	Mean/variance, entropy, MI
Deep ensembles	Slight increase (averaging)	5x training, 5x inference	Medium - growing acceptance	High - train multiple models	Mean/variance, agreement
Mahalanobis OOD	None (upstream gate)	Negligible	High - interpretable	Medium - fit distribution	Distance score, in/out flag
Autoencoder OOD	None (upstream gate)	Train autoencoder + negligible inference	Medium - less interpretable	High - separate model	Reconstruction error, distance

For a production spectral classification system, the recommended combination is: temperature scaling + conformal prediction + MC dropout + Mahalanobis OOD detection. This provides calibrated probabilities, guaranteed-coverage prediction sets, continuous uncertainty estimates, and novelty detection - covering all clinical and regulatory requirements at manageable computational cost.

Deep ensembles are the gold standard for uncertainty quality but are usually reserved for offline validation or high-stakes applications where the 5x cost is justified.

Putting It All Together

The complete inference pipeline for a clinical spectral classifier with uncertainty quantification:

Incoming Spectrum
    │
    ├── 1. Preprocessing (baseline, normalization, region selection)
    │
    ├── 2. Novelty Detection (Mahalanobis in feature space)
    │       → REJECT if OOD
    │
    ├── 3. MC Dropout Inference (50 forward passes)
    │       → Mean probabilities, epistemic uncertainty
    │
    ├── 4. Temperature Scaling (calibrate mean probabilities)
    │
    ├── 5. Conformal Prediction (generate prediction set)
    │
    ├── 6. Clinical Decision Logic
    │       → Combine confidence, uncertainty, set size
    │       → Assign to POSITIVE / NEGATIVE / INDETERMINATE
    │
    └── 7. Report Generation
            → Classification, confidence, prediction set,
              uncertainty flag, audit trail

Every component is independently testable and independently validatable. The novelty detector can be validated with known OOD spectra. The calibration can be assessed with reliability diagrams. The conformal predictor can be validated with held-out coverage experiments. The decision thresholds can be optimized with clinical performance simulations. This modularity is not just good engineering - it is a regulatory requirement under IEC 62304, which mandates that software components are independently verified and validated. The SpectraDx platform implements this full confidence scoring pipeline out of the box, including calibrated probabilities, conformal prediction sets, and novelty detection.

For the data infrastructure that feeds this pipeline, see spectral data pipeline architecture. For the explainability layer that explains why the model made a given prediction - a natural companion to confidence scoring that explains how certain the prediction is. And for the data format layer that ensures spectra arrive in a consistent, parseable format regardless of instrument vendor, see spectral data formats.

Confidence Scoring for Spectral Classification in Clinical AI

The Clinical Problem

Why Raw Softmax Outputs Are Not Probabilities

Post-Hoc Calibration: Temperature Scaling and Platt Scaling

Temperature Scaling

Platt Scaling

Conformal Prediction

How Split Conformal Prediction Works

Full Implementation

Using Conformal Prediction in a Spectral Pipeline

Interpreting Prediction Sets Clinically

Adaptive Conformal Prediction (ACI)

Calibration Set Sizing

Monte Carlo Dropout

Implementation

Practical Considerations

Combining MC Dropout with Conformal Prediction

Deep Ensembles

Implementation

Computational Cost

Spectral Novelty Detection

Mahalanobis Distance in Feature Space

Autoencoder Reconstruction Error

Which Novelty Detector to Use

Setting Clinical Decision Thresholds

Threshold Optimization Framework

Clinical Decision Zones

Regulatory Perspective

Predetermined Change Control Plans (PCCP)

Clinical Decision Support Transparency

Good Machine Learning Practice (GMLP)

What to Include in Your Regulatory Submission

Method Comparison

Putting It All Together

Further Reading

SpectraDx builds clinical workflow software for spectroscopy-based diagnostics.

Get articles like this in your inbox.