TabPFN
An editor has nominated this article for deletion. You are welcome to participate in the deletion discussion, which will decide whether or not to retain it. |
![]() | This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these messages)
|
TabPFN | |
---|---|
Developer(s) | Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, Frank Hutter, Leo Grinsztajn, Klemens Flöge, Oscar Key & Sauraj Gambhir [1] |
Initial release | September 16, 2023[2][3] |
Written in | Python [3] |
Operating system | Linux, macOS, Microsoft Windows[3] |
Type | Machine learning |
License | Apache License 2.0 |
Website | github |
TabPFN (Tabular Prior-data Fitted Network) is a machine learning model that uses a transformer architecture for supervised classification and regression tasks on small to medium-sized tabular datasets, typically up to 10,000 samples.[1] Unlike traditional models requiring extensive tuning, TabPFN is pre-trained on synthetic datasets, allowing it to predict outcomes on new tabular data in seconds without dataset-specific adjustments.[4] Developed by researchers now linked to Prior Labs, TabPFN was detailed in a 2025 Nature article.[1]
Overview
[edit]TabPFN addresses challenges in modeling tabular data, where traditional methods like Gradient-Boosted Decision Trees (e.g., XGBoost, CatBoost) require time-intensive tuning and may struggle with small datasets.[5][6] Large Language Models, effective for text, are less suited for tabular data’s structured format.[1] TabPFN uses a transformer pre-trained on synthetic tabular datasets to provide fast, accurate predictions.[2]
Technical details
[edit]TabPFN employs a transformer-based architecture and the Prior-Data Fitted Network (PFN) approach.[7] It is pre-trained once on around 130 million synthetic datasets generated using Structural Causal Models or Bayesian Neural Networks, simulating real-world data characteristics like missing values or noise.[1] This enables TabPFN to process new datasets in a single forward pass, adapting to the input without retraining.[2]
The model’s transformer encoder processes features and labels by alternating attention across rows and columns, capturing relationships within the data.[8] TabPFN v2, an updated version, handles numerical and categorical features, missing values, and supports tasks like regression and synthetic data generation.[1]
Key features
[edit]- Speed: Delivers predictions in seconds without hyperparameter tuning.[2]
- Data Efficiency: Performs well on small datasets (up to 1,000 samples for v1, 10,000 for v2).[1]
- Versatility: Handles classification, regression, and generative tasks; supports missing values and outliers.[1]
- Time Series: An extension, TabPFN-TS, supports time series forecasting.[9]
Model training
[edit]TabPFN's pre-training exclusively uses synthetically generated datasets, avoiding benchmark contamination and the costs of curating real-world data.[2] TabPFN v2 was pre-trained on approximately 130 million such datasets, each serving as a "meta-datapoint".[1]
The synthetic datasets are primarily drawn from a prior distribution embodying causal reasoning principles, using Structural Causal Models (SCMs) or Bayesian Neural Networks (BNNs). Random inputs are passed through these models to generate outputs, with a bias towards simpler causal structures. The process generates diverse datasets that simulate real-world imperfections like missing values, imbalanced data and noise. During pre-training, TabPFN predicts the masked target values of new data points given training data points and their known targets, effectively learning a generic learning algorithm that is executed by running a neural network forward pass.[1]
Performance
[edit]TabPFN v2 often outperforms tuned tree-based models like XGBoost or CatBoost on small tabular datasets in terms of accuracy (e.g., ROC AUC) and speed.[2] It can match CatBoost’s accuracy with half the training data.[10] However, for larger datasets, traditional models may be more efficient.[8]
Limitations
[edit]- Scalability: Best suited for small to medium datasets due to transformer complexity.[8]
- Class Limits: TabPFN v1 has constraints on multi-class classification; v2 improves this.[2]
- Ongoing Research: A fully understanding of the inner workings of TabPFN is still evolving in the community, with active research on extensions.[11]
Applications and use cases
[edit]This section draws on various applications demonstrated or suggested for TabPFN and its variants:
- Time Series Forecasting: TabPFN-TS[9] extension for financial forecasting, demand planning.
- Chemoproteomics[12]
- Health insurance classification[13]
- Fault classification in machinery[14]
- Early detection of still birth[15]
- Prostate cancer diagnosis[16]
- Outcomes after anterior cervical corpectomy[17]
- Predicting River Algal Blooms[18]
- Metagenomics[19]
- Immunotherapy predictions for patients with cancer[20]
- Predicting dementia in Parkinson's disease[21]
- Pricing models in actuarial science[22]
- Diagnostic prediction of minimal change disease[23]
- Predicting wildfire propagation[24]
- Glucose monitoring[10]
- Classifying lunar meteorite minearls[25]
- Prognosis of distal medium vessel occlusion[26]
History
[edit]TabPFN builds on Prior-Data Fitted Networks research.[7] Introduced in a 2022 pre-print and presented at ICLR 2023, TabPFN v1 focused on small tabular classification.[2] TabPFN v2, published in Nature in 2025, expanded its capabilities.[1] Prior Labs, founded in 2024 by key contributors, aims to commercialize the model.[11]
See also
[edit]References
[edit]- ^ a b c d e f g h i j k Hollmann, N.; Müller, S.; Purucker, L. (2025). "Accurate predictions on small data with a tabular foundation model". Nature. 637 (8045): 319–326. Bibcode:2025Natur.637..319H. doi:10.1038/s41586-024-08328-6. PMC 11711098. PMID 39780007.
- ^ a b c d e f g h Hollmann, Noah (2023). TabPFN: A transformer that solves small tabular classification problems in a second. International Conference on Learning Representations (ICLR).
- ^ a b c Python Package Index (PyPI) - tabpfn https://pypi.org/project/tabpfn/
- ^ McElfresh, Duncan C. (2025-01-01). "The AI tool that can interpret any spreadsheet instantly". Nature.
- ^ Shwartz-Ziv, Ravid; Armon, Amitai (2022). "Tabular data: Deep learning is not all you need". Information Fusion. 81: 84–90. arXiv:2106.03253. doi:10.1016/j.inffus.2021.11.011.
- ^ Grinsztajn, Léo; Oyallon, Edouard; Varoquaux, Gaël (2022). Why do tree-based models still outperform deep learning on typical tabular data?. Proceedings of the 36th International Conference on Neural Information Processing Systems (NIPS '22). pp. 507–520.
- ^ a b Müller, Samuel (2022). Transformers can do Bayesian inference. International Conference on Learning Representations (ICLR).
- ^ a b c McElfresh, Duncan C. (8 January 2025). "The AI tool that can interpret any spreadsheet instantly". Nature. 637 (8045): 274–275. Bibcode:2025Natur.637..274M. doi:10.1038/d41586-024-03852-x. PMID 39780000.
- ^ a b "TabPFN Time Series". GitHub.
- ^ a b Bender, C.; Vestergaard, P.; Cichosz, S.L. (2025). "The History, Evolution and Future of Continuous Glucose Monitoring (CGM)". Diabetology. 6 (3): 17. doi:10.3390/diabetology6030017.
- ^ a b Kahn, Jeremy (5 February 2025). "AI has struggled to analyze tables and spreadsheets. This German startup thinks its breakthrough is about to change that". Fortune.
- ^ Fabian Offensperger et al. ,Large-scale chemoproteomics expedites ligand discovery and predicts ligand behavior in cells. Science 384. https://www.science.org/doi/abs/10.1126/science.adk5864
- ^ J. Z. K. Chu, J. C. M. Than and H. S. Jo, "Deep Learning for Cross-Selling Health Insurance Classification," 2024 International Conference on Green Energy, Computing and Sustainable Technology (GECOST), Miri Sarawak, Malaysia, 2024. https://ieeexplore.ieee.org/document/10475046
- ^ L. Magadán, J. Roldán-Gómez, J. C. Granda and F. J. Suárez, "Early Fault Classification in Rotating Machinery With Limited Data Using TabPFN," in IEEE Sensors Journal, vol. 23, no. 24, pp. 30960-30970, 15 Dec.15, 2023. https://ieeexplore.ieee.org/abstract/document/10318062
- ^ Sarah A. Alzakari, Asma Aldrees, Muhammad Umer, Lucia Cascone, Nisreen Innab, Imran Ashraf, "Artificial intelligence-driven predictive framework for early detection of still birth", SLAS Technology, Volume 29, Issue 6, 2024, 100203, ISSN 2472-6303, https://doi.org/10.1016/j.slast.2024.100203.
- ^ El-Melegy, M., Mamdouh, A., Ali, S., Badawy, M., El-Ghar, M. A., Alghamdi, N. S., & El-Baz, A. (2024). Prostate Cancer Diagnosis via Visual Representation of Tabular Data and Deep Transfer Learning. Bioengineering, 11(7), 635. https://doi.org/10.3390/bioengineering11070635
- ^ Karabacak M, Schupper A, Carr M, Margetis K. A machine learning-based approach for individualized prediction of short-term outcomes after anterior cervical corpectomy. Asian Spine J. 2024 Aug;18(4):541-549. doi: 10.31616/asj.2024.0048. Epub 2024 Aug 8. PMID 39113482; PMCID: PMC11366553. https://pmc.ncbi.nlm.nih.gov/articles/PMC11366553/
- ^ Yang, H., & Park, J. (2024). Comparing the Performance of a Deep Learning Model (TabPFN) for Predicting River Algal Blooms with Varying Data Composition. Journal of Wetlands Research, 26(3), 197–203. https://doi.org/10.17663/JWR.2024.26.3.197
- ^ Perciballi, G., Granese, F., Fall, A., Zehraoui, F., Prifti, E., & Zucker, J.-D. (2024). Adapting TabPFN for Zero-Inflated Metagenomic Data. NeurIPS 2024 Workshop TRL. https://openreview.net/pdf?id=3I0bVvUj25
- ^ Dyikanov, D., Zaitsev, A., Vasileva, T., Luginbuhl, A. J., Ataullakhanov, R. I., & Goldberg, M. F. (2024). Comprehensive peripheral blood immunoprofiling reveals five immunotypes with immunotherapy response characteristics in patients with cancer. Cancer Cell, 42(5), 759–779.e12. https://doi.org/10.1016/j.ccell.2024.04.009
- ^ Tran VQ, Byeon H. Predicting dementia in Parkinson’s disease on a small tabular dataset using hybrid LightGBM–TabPFN and SHAP. DIGITAL HEALTH. 2024;10. https://journals.sagepub.com/doi/full/10.1177/20552076241272585
- ^ Brauer, A. Enhancing actuarial non-life pricing models via transformers. Eur. Actuar. J. 14, 991–1012 (2024). https://doi.org/10.1007/s13385-024-00388-2
- ^ Noda, R., Ichikawa, D. & Shibagaki, Y. Machine learning-based diagnostic prediction of minimal change disease: model development study. Sci Rep 14, 23460 (2024). https://doi.org/10.1038/s41598-024-73898-4
- ^ Sadegh Khanmohammadi, Miguel G. Cruz, Daniel D.B. Perrakis, Martin E. Alexander, Mehrdad Arashpour, Using AutoML and generative AI to predict the type of wildfire propagation in Canadian conifer forests, Ecological Informatics, Volume 82, 2024, 102711, ISSN 1574-9541, https://doi.org/10.1016/j.ecoinf.2024.102711. (https://www.sciencedirect.com/science/article/pii/S157495412400253X)
- ^ Eloy Peña-Asensio, Josep M. Trigo-Rodríguez, Jordi Sort, Jordi Ibáñez-Insa, Albert Rimola, Machine learning applications on lunar meteorite minerals: From classification to mechanical properties prediction, International Journal of Mining Science and Technology, Volume 34, Issue 9, 2024, Pages 1283-1292, ISSN 2095-2686, https://doi.org/10.1016/j.ijmst.2024.08.001. (https://www.sciencedirect.com/science/article/pii/S2095268624001010)
- ^ Mert Karabacak, Burak Berksu Ozkara, Tobias D. Faizy, Trevor Hardigan, Jeremy J. Heit, Dhairya A. Lakhani, Konstantinos Margetis, J Mocco, Kambiz Nael, Max Wintermark, Vivek S. Yedavalli. Data-Driven Prognostication in Distal Medium Vessel Occlusions Using Explainable Machine Learning American Journal of Neuroradiology Oct 2024, ajnr.A8547. https://www.ajnr.org/content/early/2024/10/28/ajnr.A8547