Deep Inside Data Science


Big data is a big deal. That’s why so many companies are working to deploy analytical systems that can track and collect the data they need — if they haven’t already. With it, you can learn a lot about customer behaviours, habits and tendencies, your own products and services, and much more. It can also provide insights into the future, like how to tailor specific marketing campaigns or what new products you should launch.

From 2011 to 2013, more data had been created in those two years than in the entire history of the human race. And that was years ago. It’s exploded even more since then. By the year 2020, there will be an estimated 44 trillion gigabytes.

As valuable as this information can be, not everyone has the capacity to collect and access — or even analyse — this much information. What’s the solution if you don’t have access to a system that can facilitate the data for you? What if you don’t have access to data banks or databases? Where can you go? Where can small businesses get the information they need?

Believe it or not, there are many websites on the internet you can use to reference and collect data. These online resources are readily available to anyone and include a plethora of information.

Where to Find the Data You Need

Depending on the type of data you need, there are different places you can go. To make this easier, we separated the resources by category.

General Data Sources

1. Carnegie Mellon Data and Story Library: The website may be dated and plain in terms of design, but there’s a lot of information available here. The library includes a variety of data files and stories that help explain basic statistics methods and similar concepts.

2. This massive resource is the home of the United States government’s open-source data collection. You can find research, tools and data on a huge list of topics, from agriculture to public safety.

3. UC Berkeley Data Lab: Offered as part of the Berkeley University of California library system, this treasure trove of information includes a huge selection of resources. You can find research and data on economics, politics, science, business, health and much more.

4. UCLA Statistics Data Sets: This data haven includes a variety of resources and information that UCLA employs during statistics labs and assignments.

5. Comprehensive contact lists for consumers based on a variety of factors, such as demographics, family income, location and more.

6. Internet UPC Database: This is a massive database of UPC codes. Even though the UPC was effectively retired, it continues to be useful thanks to people uploading new data.

7. Amazon Public Data Sets: Access the massive blocks of data that Amazon uses to make informed decisions. The data is publicly available, and anyone can access it for free.

9. Programmable Web: A lot of sites and data providers make information available via APIs. Twitter, Google and Yahoo are great examples of this. Programmable Web is a detailed catalog where you can find these APIs.

10. New York Public Library e-Journals: Includes a huge, searchable database of educational journals hosted through the New York Public Library. There’s a lot of good information here if you take the time to look.

Geographic Data

11. USGS Earthquake Catalog: Everything you need to know about past or present earthquakes can be found in this online catalog. You can narrow down your search for an event by world region, data and even magnitude.

12. GeoCommons: A massive collection of geographic data stored in the GeoJSON format. The archive can be searched and browsed, and you can preview, download and even open the data in ArcGIS — an online mapping tool.

13. OpenStreetMap: A Google Maps-type data and map service, created with help from an active community. It is crowdsourced, which means people like you or me are free to update and use data from the platform.

14. TIGER: TIGER stands for Topologically Integrated Geographic Encoding and Referencing and includes data from the United States Census Bureau. The data contains references roads, railroads, zip codes, rivers and more.

15. Flickr Shapefiles: Every picture uploaded to Flickr is geotagged with location data. It uses that to build “shapes” or boundaries, which provide a more accurate contour of a specific location.

News, World and Health Data Sources

16. The New York Times: The complete NYT archive can be searched and viewed online, which includes over 13 million articles. It dates all the way back to 1851 and includes information up until present day. You do need to have an active subscription to access the content, however.

17. Wall Street Journal: Like the NYT, the Wall Street Journal has an article archive for content its published over the years.

18. The Guardian: No matter how you feel about the publication, the Guardian offers a lot of freely available data you can export as Google spreadsheets. It could be an incredibly useful resource depending on what you’re looking for.

19. World Health Organisation: Anything you could ever want to know about global health or world news can be found right here, in the World Health Organisation’s database.

20. DataSF: The city of San Francisco launched a database full of information you can use to your advantage. Hopefully other cities will do the same.

THANK YOU — Kayla Matthews, for sharing the much needed information with aspiring Data Scientists.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store