50 amusing Data Science Data sets part 2

By arvind |Email | Sep 15, 2018 | 8229 Views

We covered 50 data sets for data scientists that are amusing in part 1. In part two we cover 50 more of those.

Here are the next 50 data sets.

  1. In what is the smallest data set on this list, the survival rates of men and women on the Titanic. Female passengers were 4x times more likely to survive than male passengers.

  2. Want a super specific breakdown of the contents of your food? You're in luck. 

  3. There's a similar database of the metabolites in the human body. I'm not sure what you could do with it, but it might come in handy in some sort of dystopian future where humans are raised like cattle for their nutrients. 

  4. Invented a new image compression algorithm and need data to test it on? Look no further than the CSAIL's tiny image data set.

  5. Or maybe tiny images are too tiny. In that case, try the ImageNet database, which is structured around the WordNet hierarchy. So if you want to teach an algorithm what a narwhal looks like, this would be a good place to start. 

  6. Still not enough? How about all the Wikipedia images?

  7. Let's say you're building the next generation of a book reader, and you want to automatically associate phrases with the relevant Wikipedia article. How? Stanford in association with Google Research has you covered with their English-phrase-to-associated-Wikipedia-article database. The research paper can be downloaded here.

  8. Yandex, the Russian search engine, has made a bunch of search data available. Namely, if someone searches for something, what do they click on? Downsides: It's a Russian search engine with Russian search results.

  9. Just what kind of edits do people usually make on Wikipedia? I don't know, but you can figure it out with this data set.

  10. Did you know that Google has a search engine for data sets?

  11. Pew Research has many free data sets, including their "Global Attitudes Project" archive. Questions this data could answer: Is the world becoming more progressive over time? How have attitudes towards religion shifted over time?

  12. Speaking of public attitudes over time, you can download a set of the General Social Survey from 1972 until about 2012, which should answer both of those questions.

  13. There's a fun math problem called the celebrity problem, which asks you to find the person who everyone knows, but who knows nobody. But what about the real-life celebrity problem? Try Yahoo's collection of celebrity faces.

  14. Need a billion web pages from February 2009? Maybe to train a never-ending language learner named NELLYup, it's available.    

  15. If you need economic census data on any industry, check out census.gov's industry statistics portal. If finance is really evil, you ought to be able to find something damning in the data.

  16. For those unfamiliar with Usenet, it's sort of like a huge, text-only forum. It was much more popular before the rise of the world wide web. Anyways, you can download a huge data set of postings to Usenet here. It might be pretty good for some kind of textual analysis project or training a machine learning algorithm (maybe a spellchecker?) You could use the data to build out a Google Groups competitor, too.

  17. Nick Bostrom has a very interesting paper called "Existential Risk Prevention as Global Priority." The basic intuition is that preventing even small risks of human extinction is worthwhile if we consider all the human generations it would save. One way to start saving all those future lives might be by digging into this data set of every recorded meteor impact on Earth from 2500 BCE to 2012.

  18. How do gender and mental illness affect crime? This data set was collected explicitly with that question in mind.

  19. Speaking of mental health, if you're interested in how it affects minorities specifically, try this.

  20. There are a lot of lonely men and women out there, and some of those lonely men and women have excellent analytical skills. For those lonely people, I suggest using this data set, which "surveyed how Americans met their spouses and romantic partners and compared traditional to non-traditional couples" to determine the best way to meet that special someone.   

  21. Tons of data on what is called "adolescent health" available here, but is actually more, including a bunch of related data and biomarkers.

  22. Here's a question: Are modern jobs worse than those of the past? My grandparents built tires at Firestone. Today, people rarely have that level of control and visceral experience of the finished product of their work. This set of five surveys regarding how different groups experience employment could answer that question. I can see the article now - "Is everything getting slightly worse? We found out."

  23. Stanford has 35 million Amazon reviews available for download. Lot's of stuff you could do with this: use it to improve recommendation algorithms, figure out whether or not there├?┬ó??s a follow-the-leader effect with reviews.

  24. Based on some of my research prior to writing this, the Google keyword "data sets on serial killers' is 1) really specific and 2) weirdly popular, but I guess there's no accounting for taste. And, of course, we've got data for that, thanks to the Serial Killer Information Center.

  25. In this gruesome vein, the University of Maryland has a "Global Terrorism Database," which is a set of more than 113,000 terror incidents. You can download it after filling out a form. Ideas for use: visualization of terror incidents by location over time, predicting and preventing terror attacks, and creating early alert systems for vulnerable areas.

  26. The MNIST Database is a classic in the field of machine learning. It's a set of labeled hand-written characters, which are necessary for OCR algorithms. Today, some algorithms are actually more accurate than human judges! This would have been nice to have back when I was in grade school. I distinctly recall once arguing with a teacher over missing a question because she insisted that I had written the letter when it was clearly a. In the future, we'll let the machines decide.

  27. UCI has a poker hand data set available. My poker-fu is fairly weak, but I'm sure there's some interesting analysis to be done there. I've heard second hand that humans still maintain some advantage over machines when it comes to poker, but I'm unable to verify that via Google. Machines have won in at least one tournament.

  28. Another data set from UCI: images labeled as either advertisements or non-advertisements. This is good for building up classification algorithms that decide whether or not a new image is an ad or not, which might be good for, say, automatic adblocking or spam detection. Or maybe a Google Glass application that filters out real-life advertisements. That'd be cool. Look at a billboard and instead see a virtual extension of the natural landscape.

  29. Remember the whole Star Wars Kid debacle? Wikipedia informs me that Attack of the Show rated it the number 1 viral video of all time. Andy Baio, one of the guys who were in on it before it was cool and coined the phrase "Star Wars Kid" has made his server logs from the time publicly available. Someone could take this data and produce a visualization of who saw it when via maps, along with annotations of where the traffic was coming from.

  30. Who's linking to who (and what) on WordPress? (Tidbit: most of the links to this site come from WordPress blogs.) With this WordPress crawl, you can find out. Visualizing the network might be sorta cool, but it'd be cooler still to uncover some information about "supernodes" that either is linked to often or put out a lot of links (or maybe both). Or maybe clustering people by interest.

  31. Is Obama in bed with big oil? Or extremist environmentalists? Or the corn lobbies? And who was backing that Herman Cain dude, anyway? The 2012 Presidential Campaign Finance data is available for download. It would be neat to see an analysis of what industries prefer what candidates.

  32. Which private colleges are the best value?

  33. Which public colleges are the best value?

  34. Cigarette data by state. Kentucky smokes the most, with West Virginia as a close second. Given the massive social harm of tobacco, a good analysis could very well save a lot of lives.    

  35. Want to build a Reddit recommendation engine? (Or, better yet, how about just a filter for the stupid-but-popular opinions?) Well, here's the data a Redditor is using to do just that. The recommendation engine, I mean.

  36. Global health data. This would be great for identifying high-impact ways to improve world health, like the Schistosomiasis Control Initiative, which is one of GiveWells top-rated charities.

  37. United States crime from 1960 to 2012. I'd like to see a graph of rape per capita over time (which, from a brief peek at the data, is dropping.) And then add the data for prison rape, which is morally repugnant but apparently a-okay to joke about on television.

  38. How about launching a Yelp-for-bathrooms?

  39. Did you know that the best-selling item in Canadian grocery stores is Kraft Dinner? I wonder how it sells in Belgium or Taiwan. Here├?┬ó??s some supermarket data from there.

  40. Data on usage of the Firefox web browser. Records things like a number of tabs used, time active, a number of private tabs opened. While that last point might allow for some titillating finds, it might be neat to see how accurate self-reports of time on the internet compared to the actual data.

  41. This one is super cool: Mozilla has put together a data set of the more than 200,000 bugs found in Mozilla and Eclipse. I would love to see a breakdown of what bugs are the most common and how they can be prevented. Software solutions would be worth a lot of money. Programming languages could be designed around them.

  42. If you're interested in the design of scheduling algorithms (I am!), Google has released a data set of the sort of jobs that they're running on their clusters. Developing algorithms against this data set might help future proof your discoveries. After all, tomorrow's desktop might look a lot like today's data center.

  43. Techcrunch released a dataset with more than 400,000 company, investor, and entrepreneur profiles, along with an additional 45,000 investment rounds. This might be a good way to reverse engineer what the market├?┬ó??s looking for and what investors are funding.

  44. 1.25 million delicious.com bookmarks.

  45. Where are the United States's major military installations located?

  46. Who receives H1-B visas? Might be interesting to know if some countries are more likely to get into the program or which companies "consume" the majority of the visas.

  47. The Twitter users most likely to be followed by users of Hacker News.

  48. Here are all the earthquakes between 1000 and 1903. Feeding them to a neural net and seeing what kind of predictions you get out might be neat.

  49. I've often wondered if the people who take personality tests online are more neurotic than the population at large. There's a lot of data from a series of online personality tests available here, so you could compare their answers to those from the population at large, find out, and then send me an email.

  50. And, finally, something I would have loved as a kid: the list to end all lists of naughty words.

Source: HOB