A recent study published in the Proceedings of Machine Learning Research investigated methods for calibrating probabilistic predictions from machine learning models. The ability to produce well-calibrated probability estimates is crucial for enabling trust in predictive models, especially for high-stakes applications like medical diagnosis or autonomous vehicles.
The study evaluated several standard machine learning algorithms available in the popular Python library scikit-learn, including decision trees, AdaBoost, gradient boosting, kNN, logistic regression, naive Bayes, and random forests. The algorithms were tested on 22 public benchmark datasets spanning domains like healthcare, finance, and engineering.
The key finding was that most algorithms were poorly calibrated “out-of-the-box”, meaning their predicted probabilities did not match the true underlying probabilities. For instance, decision trees were on average over 8 percentage points too optimistic compared to their actual accuracy. The only exception was logistic regression, which had well-calibrated probabilities without any calibration.
The study then showed that applying calibration techniques like Platt scaling, isotonic regression, and Venn-Abers significantly improved the probability estimates for all algorithms except logistic regression. These post-processing calibration methods work by fitting a secondary model on the classifier’s predictions to map them into better probability estimates. Overall, Venn-Abers and Platt scaling worked best.
This suggests practitioners using scikit-learn should always calibrate their models’ probabilities, as it comes at almost no cost. The study focused on binary classification, but calibration should help for multiclass problems too. An interesting area for future work is calibrating modern neural networks, which are also often miscalibrated.
Properly calibrated models could increase adoption of machine learning where trust in predictions is critical. For example, a doctor would be more inclined to rely on a calibrated model predicting a patient’s risk of heart disease. The uncertainty information provided by calibrated probabilities is indispensable for many real-world applications. This study demonstrates effective calibration is attainable using simple methods available in standard libraries.