If you've ever worked on a personal data science project, you've probably spent a lot of time browsing the internet looking for interesting data sets to analyze
This is the fifth post in a series of posts on how to build a Data Science Portfolio. You can find links to the others in this series at the bottom of the post.
If you've ever worked on a personal data science project, you've probably spent a lot of time browsing the internet looking for interesting data sets to analyze. It can be fun to sift through dozens of data sets to find the perfect one, but it can also be frustrating to download and import several csv files, only to realize that the data isn't that interesting after all. Luckily, there are online repositories that curate data sets and (mostly) remove the uninteresting ones.
In this post, we'll walk through several types of data science projects, including data visualization projects, data cleaning projects, and machine learning projects, and identify good places to find data sets for each. Whether you want to strengthen your data science portfolio by showing that you can visualize data well, or you have a spare few hours and want to practice your machine learning skills, we've got you covered.
Data sets for Data Visualization Projects
A typical data visualization project might be something along the lines of "I want to make an infographic about how income varies across the different states in the US". There are a few considerations to keep in mind when looking for a good data set for a data visualization project:
It shouldn't be messy, because you don't want to spend a lot of time cleaning data.
It should be nuanced and interesting enough to make charts about.
Ideally, each column should be well-explained, so the visualization is accurate.
The data set shouldn't have too many rows or columns, so it's easy to work with.
A good place to find good data sets for data visualization projects are news sites that release their data publicly. They typically clean the data for you, and also already have charts they've made that you can replicate or improve.
1. FiveThirtyEight
FiveThirtyEight is an incredibly popular interactive news and sports site started by Nate Silver. They write interesting data-driven articles, like "Don't blame a skills gap for lack of hiring in manufacturing" and "2016 NFL Predictions".
FiveThirtyEight makes the data sets used in its articles available online on Github.
View the FiveThirtyEight Data sets
Here are some examples:
Airline Safety — contains information on accidents from each airline.
US Weather History — historical weather data for the US.
Study Drugs — data on who's taking Adderall in the US.
2. BuzzFeed
BuzzFeed started as a purveyor of low-quality articles, but has since evolved and now writes some investigative pieces, like "The court that rules the world" and "The short life of Deonte Hoard".
BuzzFeed makes the data sets used in its articles available on Github.
Here are some examples:
Federal Surveillance Planes — contains data on planes used for domestic surveillance.
Zika Virus — data about the geography of the Zika virus outbreak.
Firearm background checks — data on background checks of people attempting to buy firearms.
for the rest of this article check the source