Jump to content

scikit-learn

From Wikipedia, the free encyclopedia
scikit-learn
Original author(s)David Cournapeau
Developer(s)Google Summer of Code project
Initial releaseJune 2007; 18 years ago (2007-06)
Stable release
1.7.0[1] / 6 June 2025; 2 months ago (6 June 2025)
Repository
Written inPython, Cython, C and C++[2]
Operating systemLinux, macOS, Windows
TypeLibrary for machine learning
LicenseNew BSD License
Websitescikit-learn.org

scikit-learn (formerly scikits.learn and also known as sklearn) is a free and open-source machine learning library for the Python programming language.[3] It features various classification, regression and clustering algorithms including support-vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy. Scikit-learn is a NumFOCUS fiscally sponsored project.[4]

Overview

[edit]

The scikit-learn project started as scikits.learn, a Google Summer of Code project by French data scientist David Cournapeau. The name of the project derives from its role as a "scientific toolkit for machine learning", originally developed and distributed as a third-party extension to SciPy.[5] The original codebase was later rewritten by other developers.[who?] In 2010, contributors Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort and Vincent Michel, from the French Institute for Research in Computer Science and Automation in Saclay, France, took leadership of the project and released the first public version of the library on February 1, 2010.[6] In November 2012, scikit-learn as well as scikit-image were described as two of the "well-maintained and popular" scikits libraries.[7] In 2019, it was noted that scikit-learn is one of the most popular machine learning libraries on GitHub.[8]

Features

[edit]
  • Large catalogue of well-established machine learning algorithms and data pre-processing methods (i.e. feature engineering)
  • Utility methods for common data-science tasks, such as splitting data into train and test sets, cross-validation and grid search
  • Consistent way of running machine learning models (estimator.fit() and estimator.predict()), which libraries can implement
  • Declarative way of structuring a data science process (the Pipeline), including data pre-processing and model fitting

Examples

[edit]

Fitting a random forest classifier:

>>> from sklearn.ensemble import RandomForestClassifier
>>> classifier = RandomForestClassifier(random_state=0)
>>> X = [[ 1,  2,  3],  # 2 samples, 3 features
...      [11, 12, 13]]
>>> y = [0, 1]  # classes of each sample
>>> classifier.fit(X, y)
RandomForestClassifier(random_state=0)

Implementation

[edit]

scikit-learn is largely written in Python, and uses NumPy extensively for high-performance linear algebra and array operations. Furthermore, some core algorithms are written in Cython to improve performance. Support vector machines are implemented by a Cython wrapper around LIBSVM; logistic regression and linear support vector machines by a similar wrapper around LIBLINEAR. In such cases, extending these methods with Python may not be possible.

scikit-learn integrates well with many other Python libraries, such as Matplotlib and plotly for plotting, NumPy for array vectorization, Pandas dataframes, SciPy, and many more.

History

[edit]

scikit-learn was initially developed by David Cournapeau as a Google Summer of Code project in 2007. Later that year, Matthieu Brucher joined the project and started to use it as a part of his thesis work. In 2010, INRIA, the French Institute for Research in Computer Science and Automation, got involved and the first public release (v0.1 beta) was published in late January 2010.

Applications

[edit]

Scikit-learn is widely used across industries for a variety of machine learning tasks such as classification, regression, clustering, and model selection. The following are real-world applications of the library:

Finance and Insurance

[edit]
  • AXA uses scikit-learn to speed up the compensation process for car accidents and to detect insurance fraud.[9]
  • Zopa, a peer-to-peer lending platform, employs scikit-learn for credit risk modelling, fraud detection, marketing segmentation, and loan pricing.[9]
  • BNP Paribas Cardif uses scikit-learn to improve the dispatching of incoming mail and manage internal model risk governance through pipelines that reduce operational and overfitting risks.[9]
  • J.P. Morgan reports broad usage of scikit-learn across the bank for classification tasks and predictive analytics in financial decision-making.[9]

Retail and E-Commerce

[edit]
  • Booking.com uses scikit-learn for hotel and destination recommendation systems, fraudulent reservation detection, and workforce scheduling for customer support agents.[9]
  • HowAboutWe uses it to predict user engagement and preferences on a dating platform.[9]
  • Lovely leverages the library to understand user behaviour and detect fraudulent activity on its platform.[9]
  • Data Publica uses it for customer segmentation based on the success of past partnerships.[9]
  • Otto Group integrates scikit-learn throughout its data science stack, particularly in logistics optimization and product recommendations.[9]

Media, Marketing, and Social Platforms

[edit]
  • Spotify applies scikit-learn in its recommendation systems.[9]
  • Betaworks uses the library for both recommendation systems (e.g., for Digg) and dynamic subspace clustering applied to weather forecasting data.[9]
  • PeerIndex used scikit-learn for missing data imputation, tweet classification, and community clustering in social media analytics.[9]
  • Bestofmedia Group employs it for spam detection and ad click prediction.[9]
  • Machinalis utilizes scikit-learn for click-through rate prediction and relational information extraction for content classification and advertising optimization.[9]
  • Change.org applies scikit-learn for targeted email outreach based on user behaviour.[9]

Technology

[edit]
  • AWeber uses scikit-learn to extract features from emails and build pipelines for managing large-scale email campaigns.[9]
  • Solido applies it to semiconductor design tasks such as rare-event estimation and worst-case verification using statistical learning.[9]
  • Evernote, Dataiku, and other tech companies employ scikit-learn in prototyping and production workflows due to its consistent API and integration with the Python ecosystem.[9]

Academia

[edit]
  • Télécom ParisTech integrates scikit-learn in hands-on coursework and assignments as part of its machine learning curriculum.[9]

Awards

[edit]
  • 2019 Inria-French Academy of Sciences-Dassault Systèmes Innovation Prize[10]
  • 2022 Open Science Award for Open Source Research Software[11]

References

[edit]
  1. ^ "Release 1.7.0". 6 June 2025. Retrieved 16 June 2025.
  2. ^ "The scikit-learn Open Source Project on Open Hub: Languages Page". Open Hub. Retrieved 14 July 2018.
  3. ^ Fabian Pedregosa; Gaël Varoquaux; Alexandre Gramfort; Vincent Michel; Bertrand Thirion; Olivier Grisel; Mathieu Blondel; Peter Prettenhofer; Ron Weiss; Vincent Dubourg; Jake Vanderplas; Alexandre Passos; David Cournapeau; Matthieu Perrot; Édouard Duchesnay (2011). "scikit-learn: Machine Learning in Python". Journal of Machine Learning Research. 12: 2825–2830. arXiv:1201.0490. Bibcode:2011JMLR...12.2825P.
  4. ^ "NumFOCUS Sponsored Projects". NumFOCUS. Retrieved 2021-10-25.
  5. ^ Dreijer, Janto. "scikit-learn".
  6. ^ "About us — scikit-learn 0.20.1 documentation". scikit-learn.org.
  7. ^ Eli Bressert (2012). SciPy and NumPy: an overview for developers. O'Reilly. p. 43. ISBN 978-1-4493-6162-4.
  8. ^ "The State of the Octoverse: machine learning". The GitHub Blog. GitHub. 2019-01-24. Retrieved 2019-10-17.
  9. ^ a b c d e f g h i j k l m n o p q r s "Testimonials". scikit-learn.org. Retrieved 2025-08-06.
  10. ^ "The 2019 Inria-French Academy of Sciences-Dassault Systèmes Innovation Prize : scikit-learn , a success story for machine learning free software | Inria". www.inria.fr. Retrieved 2025-03-19.
  11. ^ Badolato, Anne-Marie (2022-02-07). "Open Science Awards for Open Source Research Software". Ouvrir la Science. Retrieved 2025-03-19.
[edit]