To me, Data Science is data analysis in the age of the internet.
We now have the ability to easily transmit (i) software and (ii) data; while we have always shared software and data, the new found ease of transmission makes for things like github and API data requests. In both software and data, the speed and ease of transmission allows for a richer web of dependencies and interactions which entirely transform the way we do many types of data analysis.
The following are loose definitions that help me navigate the exciting features of Data Science. The following questions are useful starting points for discussion: As “statistical generalists” (like me and hopefully you), where can we help most quickly? Where should we ask for help? Where must we learn a lot more before trying to help?
A data playground is an identifier that is (1) meaningful and relatable and (2) widely used by disparate groups/organizations that share data.
Exercise: Search in google for the terms 20500 or 1861440687. What do those numbers mean? What is the dp that makes them meaningful? Why is 1861440687 particularly interesting in the year 2016?
Common feature of data playgrounds: You often first experience the playground when it appears as a factor variable with lots of levels (20++). If you see a lot of levels, ask if there is more information about each of those different levels. If so, then dig in!
Another common feature, there is code that is specialized around these playgrounds.
Web contextualized data is a data set (or data point) that exists in a broad constellation of information. “Web” does not necessarily refer to the www, but rather the fact that it is linked to other data sources by at least one data playground. Via the playground we can contextualize the data we already have. That is, we can find more information to further study or understand our current data.
common (neither necessary nor sufficient!) features of wcd:
Here are two definitions of “medium” from the New Oxford American Dictionary (via the mac app “dictionary”):
“a particular form of storage for digitized information, such as magnetic tape or discs.”
Another is:
“the intervening substance through which impressions are conveyed to the senses or a force acts on objects at a distance: radio communication needs no physical medium between the two stations | the medium between the cylinders is a vacuum.”
A statistical medium (sm) is a highly recurring data structure that has an ecosystem of algorithms, software (e.g. packages), and statistical techniques that are focused on the specific idiosyncrasies of the medium.
If you find yourself working on a popular sm, then you should absolutely find the most popular packages in R. People spend their lives developing these packages and they help you do very basic things that require lots of (error prone!) coding. Whenever you can, “steal”. Also, remember to share (so that others can steal it).
A pipeline is the final product for a data scientist. It is code that entirely reproduce the final analysis:
The most interesting pipelines start by importing interesting data. For example,
While these example pipelines might be “autonomous”, they still “communicate”: They purchase stocks or makes tweets! Even in “autonomous” settings, we must be good statisticians! As such, we should necessarily create diagnostics (for our own eyes) that tell us how the pipeline is performing. So, even an “autonomous” pipeline will include step 4. As such, the pipeline should communicate with “two audiences”.
As we develop our pipeline, we will necessarily iterate forward and backward through the pipeline, developing the separate pieces in non-consecutive order. In order to quickly iterate, we need to develop the ability to think and code in concise syntax. The base R syntax is excessively broad; the tidyverse (which we will learn) and higher levels of programing more generally aim to streamline 80% of the concepts into short syntax. (story: for loop, match, inner_join). With agile syntax, it will be easy to update code, incorporate new pieces, etc.
You will develop:
Because 80% of problems are very similar, we will focus on doing these with agility. Zen tip: our aim is “agility” (not “speed”) because the aim is efficiency to aid concentration (not efficiency to aid speed).
We will behave as generalists who have not yet specialized, but it is to your advantage if you already have some “specialized knowledge”; it will help you to figure out what is interesting.
What do you care about? Why do you think it is interesting? What kind of data is there on this topic? Is there a playground?