email: full name @ stat dot wisc dot edu
TuTh 9:30AM - 10:45AM
Pre-covid office: 1239 MSC
Lecture Notes
\(~\)
This is a graph:
In this class, we will study graphs that are data (e.g. social networks)
\(~\)
In this course, we will use the tools of {data science with graphs} to analyze this massive corpus of ~220 million academic papers (vertices/nodes) that includes the papers most major journals. The goal is to understand a little bit about {how researchers in other academic disciplines use statistics}.
This data is 100GB+ and gives the citations (edges) between the articles, in addition to the paper’s title, abstract, journal, year, authors, and a hyperlink to the paper pdf. My lectures and your assignments will be project based. Through this investigation, we will learn how to do data science with graphs.
Class projects are open ended but are encouraged to focus on the massive corpus. There will not be exams or weekly homework assignments. All classes and office hours will be online due to covid19. If you are not a Statistics graduate student, but interested in this course, you should join us! While a few lectures will be more technical, you won’t need to do math problems in homeworks or exams. If you are not currently a student at UW Madison, but you would like to be a part of this course, please contact me.
How do researchers in other academic disciplines use statistics?
Data Science with graphs
Studying the broad use of statistics
Possible Questions
Course outline
Assignments
When statistics is used in academic research, it is typically not performed by individuals in Statistics Departments. It rarely appears in “Statistics Journals.” Through the massive corpus, we aim to study:
Target Corpus = {all papers in mainstream academic journals that use data analysis / statistics / quantitative techniques}.
The aim is to describe the use of statistics, not judge whether the use is correct or incorrect. This is an area that is ripe for discovery. Multiple Possible Questions are listed below. To investigate, we will navigate the massive corpus via the connections among the articles. In addition to being connected by citations, they are also connected by
The tools of data science with graphs allow us to study these connections. To chart a course.
We are connected by friendships. Webpages are connected by hyperlinks. Text documents are connected by the words they have in common. Financial assets are connected to underlying factors and to each other. Proteins are connected by chemical interactions. And journal articles are connected by citations and co-authorship. The language of graphs and networks allow us to find common ground among these and various other domains; we seek to understand not merely the elements (the person, the webpage, the protein, the paper), but the broad structure of their relationships. We aim to more clearly understand the underlying process that created their broad structure. This is modern multivariate statistics. This is data science with graphs.
Over the past five years, this field has started to solidify into four essential primitives. A full analysis is typically a composition of these four techniques:
In this course, if you find an additional primitives, then let’s add it to this list.
The first part of the course will focus on learning how to use these tools, to enable course projects. While you are working on your course project, we will discuss why these tools work so well; this will be more technical in nature.
We will construct our Target Corpus by taking a targeted sample of the massive corpus.
Our first tasks will be to identify which of the 220 million papers fall within “mainstream science” (very broadly defined). This could be found by aggregating the paper-paper citations to journal-journal citations, then taking a targeted sample of the journal graph to identify the mainstream journals. In this step, we can also cluster the journal graph to identify disciplines.
Then, we want to identify those articles that use data analysis / statistics / quantitative techniques. Identifying these papers is another targeted sampling task, but no longer with the citation graph. Instead, we will use the vocabulary graph (which abstracts contain which words). If we have a set of inclusion words, then any abstract connected to one of these words is included. This is “one-hop” snowball sampling (a type of targeted sampling). But how do we construct the list of inclusion terms? What happens if we go more than one hop?
Targeted Sampling in the word graph is one approach. Alternatively, if we have a rough guess of our target corpus (e.g. does it contain any numbers grepl("\\d",...)
), then we can contextualize abstracts with numbers vs no numbers via the words they use. Said another way, “what words are more likely to appear in abstracts that also contain numbers?”. What might be some other ways?
Once we construct our target corpus, we will study how they appear to use statistics. Possible questions include
Introduction (1 week)
The tools of data science with graphs.
This focuses on “how to use the basic tools” and should be sufficient to start imagining course projects.
The remaining three categories of things will not necessarily be in order:
Project proposals (2 weeks, interspersed in why stuff works)
Why stuff works (4 weeks, various topics). This material will be more technical.
Project presentations (2 weeks)
There are five assignments in the course.
Blog post: Pick a technique that is related to class. Illustrate how to use this technique in R. Ideally, this would be loading a couple key packages, loading some data, and showing how to apply the key functions to the data. Illustrate any diagnostics. Make visualizations. Describe how to interpret the results. Each blog post should be a unique combination of (technique, data type). You may work in groups of 1, 2, or 3. Post on github and/or on your personal webspace. This can be a warm-up to your course project, or it can be entirely different.
Project proposal: A 15 minute presentation in the middle of the semester. As soon as possible, schedule this with Karl. This should be done by the first week of November.
Project report: A written report that answers a single empirical question using the tools from class. Abstract, definitions, figures, etc.
Project website: This could be as simple as a github repo and a static Rmarkdown site. Ideally it would include some interactive analysis/visualizations. You might imagine your project report as akin to documentation for this website.
Project Presentation: Present your project in class.
If you would like to pursue any of these things, that would be great!