Repositories of data used to test/validate machine learning algorithms.
More information
More information
Sites 14
20 Newsgroups for text categorization. Widely used dataset.
Random 10,000 worldwide companies sampled from aiHit. All data in this DB extracted and updated automatically from WWW using AI and machine learning.
ArrayExpress is a database of functional genomics experiments that can be queried and the data downloaded. It includes gene expression data from microarray and high throughput sequencing studies.
Datgen is a computer program that generates data to systematically test programs that consume data. These synthetic datasets can be used to validate learning algorithms.
A standardized environment designed to evaluate the performance of methods that learn relationships based primarily on empirical data. Delve makes it possible for users to compare their learning methods with other methods on many datasets.
A dataset of face images for face recognition algorithms.
A set of data sets, where each data set is represented in first order logic. Maintained at the University of Dortmund, Germany.
Machine Learning and Data Mining - Datasets (USPS digits, faces, links to various datasets prepared for Matlab)
Provides access to a wide variety of astrophysics, space physics, solar physics, lunar and planetary data from NASA space flight missions, in addition to selected other data and some models and software.
This NIST database of fingerprint images contains 2000 8- bit gray scale fingerprint image pairs. NIST charges $90+$30 shipping for the data.
Archive of experimentally-determined, biological macromolecule 3-D structures from the Brookhaven National Laboratory.
A classic benchmark for text categorization algorithms.
Text datasets used in information retrieval and learning in text domains.
Web pages partitioned into classes, with hyperlink data. The dataset has been used for text categorization and learning to extract symbolic knowledge from the World Wide Web.
Archive of experimentally-determined, biological macromolecule 3-D structures from the Brookhaven National Laboratory.
A standardized environment designed to evaluate the performance of methods that learn relationships based primarily on empirical data. Delve makes it possible for users to compare their learning methods with other methods on many datasets.
Random 10,000 worldwide companies sampled from aiHit. All data in this DB extracted and updated automatically from WWW using AI and machine learning.
Datgen is a computer program that generates data to systematically test programs that consume data. These synthetic datasets can be used to validate learning algorithms.
Machine Learning and Data Mining - Datasets (USPS digits, faces, links to various datasets prepared for Matlab)
A set of data sets, where each data set is represented in first order logic. Maintained at the University of Dortmund, Germany.
ArrayExpress is a database of functional genomics experiments that can be queried and the data downloaded. It includes gene expression data from microarray and high throughput sequencing studies.
This NIST database of fingerprint images contains 2000 8- bit gray scale fingerprint image pairs. NIST charges $90+$30 shipping for the data.
20 Newsgroups for text categorization. Widely used dataset.
A classic benchmark for text categorization algorithms.
A dataset of face images for face recognition algorithms.
Text datasets used in information retrieval and learning in text domains.
Provides access to a wide variety of astrophysics, space physics, solar physics, lunar and planetary data from NASA space flight missions, in addition to selected other data and some models and software.
Web pages partitioned into classes, with hyperlink data. The dataset has been used for text categorization and learning to extract symbolic knowledge from the World Wide Web.

Last update:
October 30, 2023 at 5:15:15 UTC

Check out
Society: Law: Services: Lawyers and Law Firms: Personal Injury: North America: United States: Arkansas
- Recently edited by cherel
- Recently edited by cherel