Wednesday, May 23, 2018

Generic datasets for machine learning

Simple & Generic datasets to get you started

  • data.gov – This is the home of the U.S. Government’s open data. The site contains more than 190,000 data points at time of publishing. These datasets vary from data about climate, education, energy, Finance and many more areas.
screen-shot-2016-11-24-at-6-07-19-am
  • data.gov.inscreen-shot-2016-11-24-at-6-16-38-am – This is the home of the Indian Government’s open data. Find data by various industries, climate, health care etc. You can check out a few visualizations for inspiration here. Depending on your country of residence, you can also follow similar websites from a few other websites – check them out.

  • World Bank – The open data from the World bank. The platform provides several tools like Open Data Catalog, world development indices, education indices etc.

  • RBI – Data available from the Reserve Bank of India. This includes several metrics on money market operations, balance of payments, use of banking and several products. A must go to site, if you come from BFSI domain in India.

  • Five Thirty Eight Datasets – Here is a link to datasets used by Five Thirty Eight in their stories. Each dataset includes the data, a dictionary explaining the data and the link to the story carried out by Five Thirty Eight. If you want to learn how to create data stories, it can’t get better than this.

Huge Datasets – things are getting serious now!

  • Amazon Web Services (AWS) datasets – awsAmazon provides a few big datasets, which can be used on their platform or on your local computers. You can also analyze the data in the cloud using EC2and Hadoop via EMR. Popular datasets on Amazon include full Enron email dataset, Google Books n-grams, NASA NEX datasets, Million Songs dataset and many more. More information can be found here.

  • Google datasets – Google provides a few datasets as part of its Big Query tool. This includes baby names, data from GitHub public repositories, all stories & comments from Hacker News etc.

  • Youtube labeled Video Dataset
    screen-shot-2016-11-24-at-9-29-30-amA few months back, Google Research Group released YouTube labeled dataset, which consists of 8 million YouTube video IDs and associated labels from 4800 visual entities. It comes with pre-computed, state-of-the-art vision features from billions of frames.

Datasets for predictive modeling & machine learning:

  • UCI Machine Learning Repository – screen-shot-2016-11-24-at-9-41-58-amUCI Machine Learning Repository is clearly the most famous data repository. It is usually the first place to go, if you are looking for datasets related to machine learning repositories. The datasets include a diverse range of datasets from popular datasets like Iris and Titanic survival to recent contributions like that of Air Quality and GPS trajectories. The repository contains more than 350 datasets with labels like domain, purpose of the problem (Classification / Regression). You can use these filters to identify good datasets for your need.

  • Kaggle site-logoKaggle has come up with a platform, where people can donate datasets and other community members can vote and run Kernel / scripts on them. They have more than 350 datasets in total – with more than 200 as Featured datasets. While some of the initial datasets were usually present at other places, I have seen a few interesting datasets on the platform, not present at other places. Along with new datasets, another benefit of the interface is that you can see scripts and questions from community members on the same interface.

  • Analytics Vidhya logoYou can participate and download datasets from our practice problems and hackathon problems. The problem datasets are based on real-life industry problems and are relatively smaller as they are meant for 2 – 7 days hackathons. While practice problems are available to people always, the hackathon problems become unavailable after the hackathons. So, you need to participate on the hackathon to get access to the datasets.

  • Quandl screen-shot-2016-11-24-at-10-16-33-amQuandl provides financial, economic and alternative data from various sources through their website / API or direct integration with a few tools. Their datasets are classified as Open or Premium. You can access all the open datasets for Free, but you need to pay for the premium datasets. If you search, you still get good datasets on the platform. Eg. Stock Exchange data from India is available for free.

  • Past KDD Cups KDD Cup is the annual Data Mining and Knowledge Discovery competition organized by ACM Special Interest Group on Knowledge Discovery and Data Mining. Archives includes datasets and instructions. Winners are available for most years.

  • Driven Data screen-shot-2016-11-24-at-10-24-19-amDriven Data finds real-world challenges where data science can be used to create a positive social impact. They then run online modeling competitions for data scientists to develop the best models to solve them. If you are interested in use of data science for social good – this is the place to be.

Image classification datasets

  • The MNIST Database – The most popular dataset for image recognition using hand-written digits. It includes 60,000 train examples and a test set of 10,000 examples. This serves as typically the first dataset to practice image recognition.

  • Chars74K – Here is the next level of evolution, if you have passed hand written digits. This dataset includes character recognition in natural images. The dataset contains 74,000 images and hence the name of the dataset.

  • Frontal Face Images If you have worked on previous 2 projects and are able to identify digits and characters, here is the next level of challenge in Image recognition – Frontal Face images. The images were collected by CMU & MIT and are arranged in four folders.

  • ImageNet Time to build something generic now. Image database organised according to the WordNet hierarchy (currently only the nouns). Each node of the hierarchy is depicted by hundreds of images. Currently, the collection has an average of over five hundred images per node (and increasing).

Text Classification datasets

  • Spam – Non Spam An interesting problem with 1324 SMSs (Span and non-spam). You need to build a classifier classifying the SMS as span or non-spam.


  • Movie Review Data This site provides collections of movie-review documents labeled on their overall sentiment polarity (positive or negative) or subjective rating (e.g., “two and a half stars”) and sentences labeled with respect to their subjectivity status (subjective or objective) or polarity

Datasets for Recommendation Engine

  • MovieLens MovieLens is a web site that helps people find movies to watch. It has hundreds of thousands of registered users. They conduct online field experiments in MovieLens in the areas of automated content recommendation, recommendation interfaces, tagging-based recommenders. These datasets are available for download and can be used to create your own recommender systems.

  • Jester Datasets about online joke recommender system

Websites which Curate list of datasets from various sources:

  • KDNuggets – The dataset page on KDNuggets has long been a reference point for people looking for datasets out there. A really comprehensive list, however some of the sources no longer provide the datasets. So, you will need to apply your own prudence on the datasets and the sources.

  • Awesome Public Datasets A GitHub repository with a comprehensive list of datasets categorized by domain. Datasets are classified neatly in various domains, which is very helpful. However, there is no description about the datasets on the repository itself – which could have made it very useful.

  • Reddit Datasets Subreddit Since this is a community driven forum, it might come across a bit messy (compared to previous 2 sources). However, you can sort datasets by popularity / votes to see the most popular ones. Also, it has some interesting datasets and discussions.

No comments:

Post a Comment