Datasets are a critical part of machine learning. Any company of sufficient size will have unique domain-specific data in which they can create private datasets. The right data is helpful in building effective AI models that can improve efficiencies and productivity, reduce operational cost, enhance customer service, and ultimately create a competitive advantage. However, preparing the ideal datasets can be a time-consuming process. That’s were public datasets come in handy.
The AI strategy shouldn’t stall while the perfect datasets are being prepared. Public datasets provide organizations with data that can be used to build and test AI models. There are thousands of public datasets available for use, in areas like weather, disease, Twitter, animals, facial recognition, aerial, self-driving, object detection, banking, stock market, and much more. Some of the well-known datasets like ImageNet and MNST are great for using in POC’s to test frameworks, algorithms, libraries, hardware (DIY), and more.
In the matrix below, the first six columns are collections of datasets. For example, Google host more than 100 datasets, some are public, as developed by various government entities and others were created by Google. The same applies to AWS, Microsoft, Kaggle, and so on. The global AI community is coming together and publishing datasets regularly, in the hopes that companies will not only use it to further their business model, but tackle problems like heart disease, diabetes, droughts, poverty, etc. At the same time, some datasets might create concerns, as in those in the area of facial recognition. Below are twenty-five sources of publicly available datasets.
Name | Type | Creator | Size | Description |
---|---|---|---|---|
AWS | Collection | Many | Vary | Hosted collection of public datasets |
Collection | Many | Vary | Hosted collection of public datasets | |
Kaggle | Collection | Kaggle | Vary | Hosted collection of public datasets |
Microsoft | Collection | Many | Vary | Hosted collection of public datasets |
Notre Dame | Collection | Univ. of Notre Dame | Vary | Face and 3D Face datasets |
VisualData.io | Collection | VisiualData.io | Vary | Collection of computer vision datasets |
ACS | Demographic | US Census | 3.5M households | Detailed US demographics data |
ApolloScape | Autonomous | Baidu | 74k video frames, 147k depth images, and 147k scenes | Research project on autonomous driving |
Berkeley DeepDrive | Self-driving | UC Berkeley | 100k HD video sequences | Video dataset |
Data USA | Government | Deloitte & others | 36k locations & more | Visualize US issues like jobs, skills, etc. |
Diabetes | Disease | UCI | 100k instances | Data on outcomes pertaining to diabetes patients |
El Nino Dataset | Weather | UCI | 178k instances | Oceanographic and meteorological readings |
Feret | Facial Recognition | DoD/NIST | 14k images | Develop capabilities to help with security, intelligence, and law enforcemnt |
HAR Dataset | Human Motion | UCI | 43M | Human activity recognition - sitting, biking, standing, walking, etc. |
Heart Disease | Disease | UCI | 303 instances | Individual data - age, sex, cholesterol,... |
ImageNet | Visual database | Stanford University | 14M | Image database |
Movieslens | Recommendation | GroupLens | 20M ratings on 27k movies from 138k users | Movie Ratings |
Million Song | Recommendation | Kaggle | 1M music tracks | Music |
Netflix Prize | Recommendation | Netflix | 17k files | Movie Ratings |
Open Images | Images | 9M images | Various images and their relationships | |
Overhead Imagery Research Data Set | Aerial | ORID | 1k images | Overhead Imagery |
SAT-4 Airborne Dataset | Aerial | ASU | 330k scenes | Pics of different landscapes |
Serre Lab | Human Motions | Brow University | 7k clips | Human actions-smile, laugh, chew, talk, smoke, eat... |
SIFT10M Data Set | Object Detection | UCI | 11M instances | Nearest neighbor search algorithm method |
SpaceNet | Satellite Imagery | SpaceNet | +600k building labeled | Precision-labeled and high resolution satellite images. |