Skip to Content

Finding Data on the Internet

RevoJoe's picture

The following list of data sources has been modified as of 3/18/14. Most of the data sets listed below are free, however, some are not.

If an (R) appears after source this means that the data are already in R format or there exist R commands for directly importing the data from R. (See http://www.quantmod.com/examples/intro/ for some code.) Otherwise, i have limited the list to data sources for which there is a reasonably simple process for importing csv files. What follows is a list of data sources organized into categories that are not mutually exclusive but which reflect what's out there.

Economics

American Economic Ass. (AEA): http://www.aeaweb.org/RFE/toc.php?show=complete
Gapminder: http://www.gapminder.org/data/
UMD:: http://inforumweb.umd.edu/econdata/econdata.html
World bank: http://data.worldbank.org/indicator

Data Science Practice

This section contains data sets used in the book "Doing Data Science" by Rachel Schutt and Cathy O'Neil (O'Reilly 2014)
Datasets on the book site: https://github.com/oreillymedia/doing_data_science
Enron Email Dataset: http://www.cs.cmu.edu/~enron/
GetGlue (time stamped events: users rating TV shows): http://bit.ly/1aL8XS0
Titanic Survival Data Set: http://bit.ly/1kJ4pkF
Half a million Hubway rides: http://hubwaydatachallenge.org/trip-history-data/

Finance

CBOE Futures Exchange: http://cfe.cboe.com/Data/
Google Finance: https://www.google.com/finance (R)
Google Trends: http://www.google.com/trends?q=google&ctab=0&geo=all&date=all&sort=0
St Louis Fed: http://research.stlouisfed.org/fred2/ (R)
NASDAQ: https://data.nasdaq.com/
OANDA: http://www.oanda.com/ (R)
Quandl: http://www.quandl.com/
Yahoo Finance: http://finance.yahoo.com/ (R)

Government

Archived national government statistics: http://www.archive-it.org/
Australia: http://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/3301.02009?OpenDocument
Canada: http://www.data.gc.ca/default.asp?lang=En&n=5BCD274E-1
DataMarket: http://datamarket.com/
FDA: https://open.fda.gov/index.html
Fed Stats: http://www.fedstats.gov/cgi-bin/A2Z.cgi
Guardian world governments: http://www.guardian.co.uk/world-government-data
HUD: http://www.huduser.org/portal/datasets/pdrdatas.html
London, U.K. data: http://data.london.gov.uk/catalogue
New Zealand: http://www.stats.govt.nz/tools_and_services/tools/TableBuilder/tables-by...
NYC data: http://nycplatform.socrata.com/
OECD: http://www.oecd.org/document/0,3746,en_2649_201185_46462759_1_1_1_1,00.html
RITA: http://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp
San Francisco Data sets: http://datasf.org/
U.K. Government Data: http://data.gov.uk/data
United Nations: http://data.un.org/
U.S. Federal Government Data Catalog: http://catalog.data.gov/dataset
U.S. Federal Government Agencies: http://www.data.gov/metric
US CDC Public Health datasets: http://www.cdc.gov/nchs/data_access/ftp_data.htm
The World Bank: http://wdronline.worldbank.org/
UK 2011 Census Open Atlas Project: http://www.alex-singleton.com/2011-census-open-atlas-project/

Health Care

Gapminder: http://www.gapminder.org/data/

Machine Learning

Amazon Web Services Data: http://aws.amazon.com/datasets
Airlines Data (2009 ASA Challenge): http://stat-computing.org/dataexpo/2009/the-data.html
Airports and their locations: http://www.infochimps.com/datasets/airports-and-their-locations
AppliedPredictiveModeling (R package): http://bit.ly/16wyvkG
Australian Weather: http://www.bom.gov.au/climate/dwo/
Causality Workbench: http://www.causality.inf.ethz.ch/repository.php
Edge data for US domestic flights 1990 to 2009: http://www.infochimps.com/datasets/us-domestic-flights-from-1990-to-2009
Infochimps (Tag = Bigdata): http://www.infochimps.com/tags/bigdata?page=1
Kaggle competition data: http://www.kaggle.com/
KDNuggets competition site: www.kdnuggets.com/datasets/
The Koblenz Network Collection: http://konect.uni-koblenz.de/
Machine Learning Data Set Repository: http://mldata.org/
Medicare Data File: http://go.cms.gov/19xxPN4
Microsoft Research: http://research.microsoft.com/apps/dp/dl/downloads.aspx
Million Song Dataset: http://blog.echonest.com/post/3639160982/million-song-dataset
More song datasets: http://labrosa.ee.columbia.edu/millionsong/pages/additional-datasets
MovieLens Data Sets: http://datahub.io/dataset/movielens
RDataMining.com R and Data Mining ebook data: http://www.rdatamining.com/data
The Revolution Analytics Collection: http://www.revolutionanalytics.com/subscriptions/datasets/
Social Networking: http://www.cs.cmu.edu/~jelsas/data/ancestry.com/
UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/
53.5 billion clicks: http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset

Networks

Stanford Large Network Dataset Collection: http://snap.stanford.edu/data/

Public Domain Collections

Data360: http://www.data360.org/index.aspx
Datamob.org: http://datamob.org/datasets
Factual: http://www.factual.com/topics/browse
Freebase: http://www.freebase.com/
Google: http://www.google.com/publicdata/directory
infochimps: http://www.infochimps.com/
numbray: http://numbrary.com/
Quora: http://www.quora.com/Data/Where-can-I-find-large-datasets-open-to-the-pu...
RS Collection 100+ : http://rs.io/2014/05/29/list-of-data-sets.html
Sample R data sets: http://stat.ethz.ch/R-manual/R-patched/library/datasets/html/00Index.html (R)
SourceForge Research Data: http://www.nd.edu/~oss/Data/data.html
StatSci.org: http://www.statsci.org/datasets.html
UFO Reports: http://www.nuforc.org/webreports.html
Wikileaks 911 pager intercepts: http://911.wikileaks.org/files/index.html
Stats4Stem.org: R data sets: http://www.stats4stem.org/data-sets.html (R)
The Washington Post List: http://www.washingtonpost.com/wp-srv/metro/data/datapost.html

Science

Agricultural Experiments: http://www.inside-r.org/packages/cran/agridat/docs/agridat (R)
Climate data: http://www.cru.uea.ac.uk/cru/data/temperature/#datter
and ftp://ftp.cmdl.noaa.gov/
Gene Expression Omnibus: http://www.ncbi.nlm.nih.gov/geo/
Geo Spatial Data: http://geodacenter.asu.edu/datalist/
Human Microbiome Project: http://www.hmpdacc.org/reference_genomes/reference_genomes.php
MIT Cancer Genomics Data: http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi
NASA: http://nssdc.gsfc.nasa.gov/nssdc/obtaining_data.html
NIH Microarray data: ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE6532/ (R)
Protein structure: http://www.infobiotic.net/PSPbenchmarks/
Public Gene Data: http://www.pubgene.org/
Stanford Microarray Data: http://smd.stanford.edu//

Social Sciences

General Social Survey: http://www3.norc.org/GSS+Website/
ICPSR: http://www.icpsr.umich.edu/icpsrweb/ICPSR/access/index.jsp
Pew Research: http://www.pewinternet.org/datasets/pages/2/
SNAP: http://snap.stanford.edu/data/index.html
UCLA Social Sciences Archive: http://dataarchives.ss.ucla.edu/Home.DataPortals.htm
UPJOHN INST: http://www.upjohn.org/erdc/erdc.html

Time Series

Time Series data Library: http://robjhyndman.com/TSDL/

Universities

Carnegie Mellon University Enron email: http://www.cs.cmu.edu/~enron/
Carnegie Mellon University StatLab: http://lib.stat.cmu.edu/datasets/
Keel Repository: http://sci2s.ugr.es/keel/datasets.php
Carnegie Mellon University JASA data archive: http://lib.stat.cmu.edu/jasadata/
Ohio State University Financial data: http://fisher.osu.edu/fin/osudata.htm
UC Berkeley: http://ucdata.berkeley.edu/
UCLA: http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data
UC Riverside Time Series: http://www.cs.ucr.edu/~eamonn/time_series_data/
University of Toronto: http://www.cs.toronto.edu/~delve/data/datasets.html

Comments

robjhyndman's picture

You could add my Time Series Data Library: http://robjhyndman.com/TSDL/

RevoJoe's picture

Thank you Rob!

muenchen.bob's picture

Joe,

Thanks for this totally awesome list! The best part is that wonderful little (R). That should save me a bunch of time when preparing teaching examples.

Cheers,
Bob Muenchen

RevoJoe's picture

Thank you Bob. I am especially keen on adding more of those little (R)'s myself.

jporzak's picture

Joe,

Thanks for doing this!

This Ancestry.com forum archive was just released this morning:
http://www.cs.cmu.edu/~jelsas/data/ancestry.com/
looks like a great data set for both text mining and social network analysis.

-Jim

RevoJoe's picture

Thank you Jim.
The Ancestry set is on the list.
Joe

dnorton1's picture

Hey Joe,

Here is some data for large networks.
http://snap.stanford.edu/data/index.html

Cheers,
Derek

Sergey's picture

Great list!
My 2 cents: there are tones of data from different providers aggregated at the Hans Rosling's website: http://www.gapminder.org/data
The files are in Excel format, which is not a problem to import into R
- Sergey

ches's picture

There is a sizable thread on this subject on Quora with many, many sources:

http://www.quora.com/Data/Where-can-I-get-large-datasets-open-to-the-public

There are also links there to a Reddit thread and other similar collections.

azhang's picture

Thanks for posting. Just noticed though the link to Google Finance actually points to Yahoo Finance.

kinneybowles's picture

finding data on the internet requires enterprise search