25 Public Datasets for Machine Learning

Datasets are a critical part of machine learning. Any company of sufficient size will have unique domain-specific data in which they can create private datasets. The right data is helpful in building effective AI models that can improve efficiencies and productivity, reduce operational cost, enhance customer service, and ultimately create a competitive advantage. However, preparing the ideal datasets can be a time-consuming process. That’s were public datasets come in handy.

The AI strategy shouldn’t stall while the perfect datasets are being prepared. Public datasets provide organizations with data that can be used to build and test AI models. There are thousands of public datasets available for use, in areas like weather, disease, Twitter, animals, facial recognition, aerial, self-driving, object detection, banking, stock market, and much more. Some of the well-known datasets like ImageNet and MNST are great for using in POC’s to test frameworks, algorithms, libraries, hardware (DIY), and more.

In the matrix below, the first six columns are collections of datasets. For example, Google host more than 100 datasets, some are public, as developed by various government entities and others were created by Google. The same applies to AWS, Microsoft, Kaggle, and so on. The global AI community is coming together and publishing datasets regularly, in the hopes that companies will not only use it to further their business model, but tackle problems like heart disease, diabetes, droughts, poverty, etc. At the same time, some datasets might create concerns, as in those in the area of facial recognition. Below are twenty-five sources of publicly available datasets.

AWSCollectionManyVaryHosted collection of public datasets
GoogleCollectionManyVaryHosted collection of public datasets
KaggleCollectionKaggleVaryHosted collection of public datasets
MicrosoftCollectionManyVaryHosted collection of public datasets
Notre DameCollectionUniv. of Notre DameVaryFace and 3D Face datasets
VisualData.ioCollection VisiualData.ioVaryCollection of computer vision datasets
ACSDemographicUS Census3.5M householdsDetailed US demographics data
ApolloScapeAutonomousBaidu74k video frames, 147k depth images, and 147k scenes Research project on autonomous driving
Berkeley DeepDriveSelf-drivingUC Berkeley100k HD video sequencesVideo dataset
Data USAGovernmentDeloitte & others36k locations & moreVisualize US issues like jobs, skills, etc.
DiabetesDiseaseUCI100k instancesData on outcomes pertaining to diabetes patients
El Nino DatasetWeatherUCI178k instancesOceanographic and meteorological readings
FeretFacial RecognitionDoD/NIST14k imagesDevelop capabilities to help with security, intelligence, and law enforcemnt
HAR DatasetHuman MotionUCI43MHuman activity recognition - sitting, biking, standing, walking, etc.
Heart DiseaseDiseaseUCI303 instancesIndividual data - age, sex, cholesterol,...
ImageNetVisual databaseStanford University14MImage database
MovieslensRecommendationGroupLens20M ratings on 27k movies from 138k usersMovie Ratings
Million SongRecommendationKaggle1M music tracksMusic
Netflix PrizeRecommendationNetflix17k filesMovie Ratings
Open ImagesImagesGoogle9M imagesVarious images and their relationships
Overhead Imagery Research Data SetAerialORID1k images Overhead Imagery
SAT-4 Airborne DatasetAerial ASU330k scenesPics of different landscapes
Serre LabHuman MotionsBrow University7k clipsHuman actions-smile, laugh, chew, talk, smoke, eat...
SIFT10M Data SetObject DetectionUCI11M instancesNearest neighbor search algorithm method
SpaceNetSatellite ImagerySpaceNet+600k building labeledPrecision-labeled and high resolution satellite images.