25 Public Datasets for Machine Learning

Datasets are a critical part of machine learning. Any company of sufficient size will have unique domain-specific data in which they can create private datasets. The right data is helpful in building effective AI models that can improve efficiencies and productivity, reduce operational cost, enhance customer service, and ultimately create a competitive advantage. However, preparing the ideal datasets can be a time-consuming process. That’s were public datasets come in handy.

The AI strategy shouldn’t stall while the perfect datasets are being prepared. Public datasets provide organizations with data that can be used to build and test AI models. There are thousands of public datasets available for use, in areas like weather, disease, Twitter, animals, facial recognition, aerial, self-driving, object detection, banking, stock market, and much more. Some of the well-known datasets like ImageNet and MNST are great for using in POC’s to test frameworks, algorithms, libraries, hardware (DIY), and more.

In the matrix below, the first six columns are collections of datasets. For example, Google host more than 100 datasets, some are public, as developed by various government entities and others were created by Google. The same applies to AWS, Microsoft, Kaggle, and so on. The global AI community is coming together and publishing datasets regularly, in the hopes that companies will not only use it to further their business model, but tackle problems like heart disease, diabetes, droughts, poverty, etc. At the same time, some datasets might create concerns, as in those in the area of facial recognition. Below are twenty-five sources of publicly available datasets.

Name	Type	Creator	Size	Description
AWS	Collection	Many	Vary	Hosted collection of public datasets
Google	Collection	Many	Vary	Hosted collection of public datasets
Kaggle	Collection	Kaggle	Vary	Hosted collection of public datasets
Microsoft	Collection	Many	Vary	Hosted collection of public datasets
Notre Dame	Collection	Univ. of Notre Dame	Vary	Face and 3D Face datasets
VisualData.io	Collection	VisiualData.io	Vary	Collection of computer vision datasets

ACS	Demographic	US Census	3.5M households	Detailed US demographics data
ApolloScape	Autonomous	Baidu	74k video frames, 147k depth images, and 147k scenes	Research project on autonomous driving
Berkeley DeepDrive	Self-driving	UC Berkeley	100k HD video sequences	Video dataset
Data USA	Government	Deloitte & others	36k locations & more	Visualize US issues like jobs, skills, etc.
Diabetes	Disease	UCI	100k instances	Data on outcomes pertaining to diabetes patients
El Nino Dataset	Weather	UCI	178k instances	Oceanographic and meteorological readings
Feret	Facial Recognition	DoD/NIST	14k images	Develop capabilities to help with security, intelligence, and law enforcemnt
HAR Dataset	Human Motion	UCI	43M	Human activity recognition - sitting, biking, standing, walking, etc.
Heart Disease	Disease	UCI	303 instances	Individual data - age, sex, cholesterol,...
ImageNet	Visual database	Stanford University	14M	Image database
Movieslens	Recommendation	GroupLens	20M ratings on 27k movies from 138k users	Movie Ratings
Million Song	Recommendation	Kaggle	1M music tracks	Music
Netflix Prize	Recommendation	Netflix	17k files	Movie Ratings
Open Images	Images	Google	9M images	Various images and their relationships
Overhead Imagery Research Data Set	Aerial	ORID	1k images	Overhead Imagery
SAT-4 Airborne Dataset	Aerial	ASU	330k scenes	Pics of different landscapes
Serre Lab	Human Motions	Brow University	7k clips	Human actions-smile, laugh, chew, talk, smoke, eat...
SIFT10M Data Set	Object Detection	UCI	11M instances	Nearest neighbor search algorithm method
SpaceNet	Satellite Imagery	SpaceNet	+600k building labeled	Precision-labeled and high resolution satellite images.