Stat 992: Modern Multivariate Statistics (aka Data science with graphs)

UW-Madison

Fall 2020

Karl Rohe

email: full name @ stat dot wisc dot edu
TuTh 9:30AM - 10:45AM
Pre-covid office: 1239 MSC
Lecture Notes

\(~\)

This is a graph:

In this class, we will study graphs that are data (e.g. social networks)

\(~\)

In this course, we will use the tools of {data science with graphs} to analyze this massive corpus of ~220 million academic papers (vertices/nodes) that includes the papers most major journals. The goal is to understand a little bit about {how researchers in other academic disciplines use statistics}.

This data is 100GB+ and gives the citations (edges) between the articles, in addition to the paper’s title, abstract, journal, year, authors, and a hyperlink to the paper pdf. My lectures and your assignments will be project based. Through this investigation, we will learn how to do data science with graphs.

Class projects are open ended but are encouraged to focus on the massive corpus. There will not be exams or weekly homework assignments. All classes and office hours will be online due to covid19. If you are not a Statistics graduate student, but interested in this course, you should join us! While a few lectures will be more technical, you won’t need to do math problems in homeworks or exams. If you are not currently a student at UW Madison, but you would like to be a part of this course, please contact me.

How do researchers in other academic disciplines use statistics?
Data Science with graphs
Studying the broad use of statistics
Possible Questions
Course outline
Assignments

How do researchers in other academic disciplines use statistics?

When statistics is used in academic research, it is typically not performed by individuals in Statistics Departments. It rarely appears in “Statistics Journals.” Through the massive corpus, we aim to study:

Target Corpus = {all papers in mainstream academic journals that use data analysis / statistics / quantitative techniques}.

The aim is to describe the use of statistics, not judge whether the use is correct or incorrect. This is an area that is ripe for discovery. Multiple Possible Questions are listed below. To investigate, we will navigate the massive corpus via the connections among the articles. In addition to being connected by citations, they are also connected by

  • the ideas they discuss, a shared vocabulary,
  • authors in common,
  • shared journal and year, and
  • academic discipline.

The tools of data science with graphs allow us to study these connections. To chart a course.

Data Science with graphs

We are connected by friendships. Webpages are connected by hyperlinks. Text documents are connected by the words they have in common. Financial assets are connected to underlying factors and to each other. Proteins are connected by chemical interactions. And journal articles are connected by citations and co-authorship. The language of graphs and networks allow us to find common ground among these and various other domains; we seek to understand not merely the elements (the person, the webpage, the protein, the paper), but the broad structure of their relationships. We aim to more clearly understand the underlying process that created their broad structure. This is modern multivariate statistics. This is data science with graphs.

Over the past five years, this field has started to solidify into four essential primitives. A full analysis is typically a composition of these four techniques:

  1. Sampling; using the graph to “target” a set of nodes or using the graph to find/reach other nodes.
  2. aggregating/blocking; combining vertices
  3. clustering/factoring; finding groups or latent structure
  4. contextualizing; using external features to understand how one group is different.

In this course, if you find an additional primitives, then let’s add it to this list.

The first part of the course will focus on learning how to use these tools, to enable course projects. While you are working on your course project, we will discuss why these tools work so well; this will be more technical in nature.

Studying the broad use of statistics

We will construct our Target Corpus by taking a targeted sample of the massive corpus.

Our first tasks will be to identify which of the 220 million papers fall within “mainstream science” (very broadly defined). This could be found by aggregating the paper-paper citations to journal-journal citations, then taking a targeted sample of the journal graph to identify the mainstream journals. In this step, we can also cluster the journal graph to identify disciplines.

Then, we want to identify those articles that use data analysis / statistics / quantitative techniques. Identifying these papers is another targeted sampling task, but no longer with the citation graph. Instead, we will use the vocabulary graph (which abstracts contain which words). If we have a set of inclusion words, then any abstract connected to one of these words is included. This is “one-hop” snowball sampling (a type of targeted sampling). But how do we construct the list of inclusion terms? What happens if we go more than one hop?

Targeted Sampling in the word graph is one approach. Alternatively, if we have a rough guess of our target corpus (e.g. does it contain any numbers grepl("\\d",...)), then we can contextualize abstracts with numbers vs no numbers via the words they use. Said another way, “what words are more likely to appear in abstracts that also contain numbers?”. What might be some other ways?

Possible Questions

Once we construct our target corpus, we will study how they appear to use statistics. Possible questions include

  • What techniques are most often mentioned in the abstracts?
  • How does this depend on discipline? How does it change over time?
  • Can we identify styles-of-analysis (techniques that are often used together)?
  • Are these styles specific to discipline? How do they evolve over time?
  • Is “bayes” increasing? How about “machine learning”? In what disciplines?
  • What is the structure of their citation patterns in our target corpus? Do they cite papers in “Statistics Journals” directly? What is the role of “methodology articles” published outside of mainstream statistics journals? Can we find those methodology articles? How?
  • What is the time lag between new techniques developed in “Statistics Journals” and broader use of those techniques? Does this actually ever happen? Lasso, FDR are recent positive examples. Sparse PCA is a negative example. How does this diffusion happen?

Course outline

Introduction (1 week)

  • themes of data science
  • graph notation,
  • examples,
  • empirical regularities (network science),
  • overview of topics we will cover

The tools of data science with graphs.
This focuses on “how to use the basic tools” and should be sufficient to start imagining course projects.

  • factoring, aggregating (3 weeks)
  • sampling (1 week)
  • contextualizing (1 class)

The remaining three categories of things will not necessarily be in order:

Project proposals (2 weeks, interspersed in why stuff works)
Why stuff works (4 weeks, various topics). This material will be more technical.
Project presentations (2 weeks)

Assignments

There are five assignments in the course.

  1. Blog post: Pick a technique that is related to class. Illustrate how to use this technique in R. Ideally, this would be loading a couple key packages, loading some data, and showing how to apply the key functions to the data. Illustrate any diagnostics. Make visualizations. Describe how to interpret the results. Each blog post should be a unique combination of (technique, data type). You may work in groups of 1, 2, or 3. Post on github and/or on your personal webspace. This can be a warm-up to your course project, or it can be entirely different.

  2. Project proposal: A 15 minute presentation in the middle of the semester. As soon as possible, schedule this with Karl. This should be done by the first week of November.

    • you should have a group of 3-4,
    • data set should be loaded/accessed already,
    • define data available,
    • Ask an empirical question about this data source,
    • propose a path to answering this question.
  3. Project report: A written report that answers a single empirical question using the tools from class. Abstract, definitions, figures, etc.

  4. Project website: This could be as simple as a github repo and a static Rmarkdown site. Ideally it would include some interactive analysis/visualizations. You might imagine your project report as akin to documentation for this website.

  5. Project Presentation: Present your project in class.

Data sets

  1. a massive citation network of academic publications (220M papers, 100Gb+ with meta data). It has an API, but does not (yet!) have an R wrapper for the API.
  2. Twitter following graph (infinite data, slowly accessible via API)
  3. Wikipedia (17GB compressed, 58GB uncompressed), we want the file pages-articles-multistream.xml.bz2. Also, page count data is available, pagecounts-ez. And so is edge-count data. These last two describe the actions of wikipedia users (not merely the presence of hyperlinks).

Things that would be good to do:

If you would like to pursue any of these things, that would be great!

  1. How good is the coverage of papers? Which disciplines are missing? What happens to papers behind a paywall? What do 100 random papers look like? How wild is it?
  2. How good is the edge extraction? How does it compare to Jin and Jin’s graph?
  3. Build an R api wrapper for the semanticscholar api to make an abstract_graph for aPPR to do quick targeted sampling.
  4. Author disambiguation! Turns out, lots of researchers have the exact same name. However, if we assume that authors in mainstream science do not want disambiguation problems, then they might add initials/names, etc to ensure this does not happen. However, they will only do this within their research cluster. We can use this insight to disambiguate authors with the same name. Can we automate this? Doing it by hand for a couple of researchers would likely be easy. Pull the citation graph around all papers authored by one name. For the papers cited and the paper citing, aggregate them into disciplines (via the clustering in the Journal-Journal graph). Now the hard question: How many different people are there? There could be 1, or 2, or …. What are the risks of dividing up interdisciplinary researchers into “different researchers”?
  5. Combining journal names. “Annals of Statistics” and “The Annals of Statistics” are the same journal. So is “annals of stat”. How often does this happen? Can we use the journal-journal graph to find these occurances?
  6. DOI and PMID are two unique identifiers of papers that are given in our data. What happens if we only study papers that have a DOI or PMID? My guess is that their data will be much cleaner. How many papers would we lose? What disciplines?
  7. More ambitious: Can we match the authors in the data to the twitter following graph? Start with one discipline. Take a targeted sample of that discipline on twitter. Fuzzy string matching on user names should get a lot of matches. If you want to go further than this, then please let me know. There are a host of privacy issues that come up. Moreover, it is much more involved of a project. One could start here and we would contact the authors of that paper.

Class 1

  1. Review course description, outline, and projects.
  2. Themes of data science.

Homework:

  1. Make sure your version of R is up to date.