GitHub Contributions Graph: Analyzing PageRank & Proving The 6 Handshakes Theory

Vadim Markovtsev, source{d}

Read this on your device

About me

Lead Engineer, Machine Learning @ source{d}
Education: compiler technologies
Switched to ML/DS in 2013
Python, C/C++, Go, HTML5

About source{d}

< 20 people in Madrid, Spain
Developed src-d/go-git, Git implementation in Go
Fetch all repositories from GitHub/BitBucket/etc.
Do crazy ML and DS on top of them
Project Babelfish extracts Abstract Syntax Trees from them

About this talk

Based on the blog post
3 parts
Math inside
Python code inside
Data inside

Part I

GitHub in numbers octoverse.github.com

5.8M+

active users

Like @acabunoc, Lead Developer for Open Source Engagement at the Mozilla Foundation who is developing new ways to engage contributors on open source projects.

331k+

active organizations

Like the Dat project who are working openly to develop a peer to peer tool for distributing datasets in science, journalism and government.

19.4M+

active repositories

Like the Federal Source Code Policy on GitHub released by The White House, who this year solicited community feedback on GitHub.

10.7M+

active issues

Like this one opened in the Go programming language repo that used open source data on Google’s BiqQuery to help decide whether to add new functionality to a Go standard library.

Where to take the data?

Commit datasets:

GHTorrent (145M)
commits table in Google BigQuery - equivalent to GHTorrent
GitHub Archive (stream, unknown)
source{d}'s 452M dataset with de-fuzzy-forking

BigQuery will charge you money.

Problem

There are no GitHub user <-> email associations in both datasets.

We need to do identity matching.

            mcuadros@gmail.com
            maximo@sourced.tech
            mcuadros@maximos-macbook.local
            mcuadros@maximos-mac-pro.local
            root@localhost

How to build the graph?

Binary undirected edges
Weighted directed edges
- Implies contribution volume estimation

At large scale, it is useless to distinguish developers and repositories.

Binary undirected edges

Gephi

gephi.org
Open source, Java
Most popular graph visualization software, has GUI
And very... special
Does not handle millions of nodes

Armin Ronacher,
2 levels

Armin is in the center.
8k nodes, 11k edges.

3 levels

200k nodes, 300k edges.

Heatmap

Color depends on how many hops the dot is distant from the root.

Sequence of 4 circles with increasing redness

Rob Pike, 3 levels, zoomed

Oops

Actually Rob Pike did not contribute to cmars/oo and cmars/tools.

Author forgery is perfectly possible on GitHub.

Weighted directed edges

Adjacency matrix

C_ij is the number of commits done by developer i to repository j

C_ij = C_ji

Non-zero elements: <10^-7

Adjacency matrix

We normalize values in every column.

$$\sum_i C_{ij}=1$$

The matrix is no longer symmetric.

Developer: measure of attention
Repository: measure if impact

LargeVis

lferry007/LargeVis
C++, Python
Handles millions of nodes
Unmaintained

LargeVis

Write every non-zero matrix element (edge) to a text file
Apply LargeVis command-line application
23M nodes, 28M edges, 32 cores: <64GB RAM, 2 days
Plot the resulting node coordinates

Resulting node coordinates - 264MB.

Problem

How to plot 22,000,000 dots?

Brute-force scatter plot
Brute-force scatter plot with semi-transparent dots
Heatmap density estimatiion
- numpy.histogram2d + matplotlib.pyplot.imshow
- matplotlib.pyplot.hexbin

Jupyter notebook.

Brute-force scatter plot

& transparency (0.1)

Heatmap (log)

Part II

Statement

Given that two developers "know" each other if they contributed to the same project, how far are they from each other on average, if there is a connection?

Plan

Find all the connected components of the graph.
Pick the core component, or simply “the core”. Calculate the size of the representative sample of the pairs of nodes.
Calculate the distances between sampled node pairs.
Plot the histogram, draw the conclusion.

Connected components

We run a series of graph traversals

Any type works: DFS, BFS, RFS

Python, 22M nodes - 5 min

Result: 10200912, 10229, ...

THE CORE

Core statistics

The size is 10200912 / 22000000 < 50% of the whole graph

33% of the developers; this is 11% of the whole GitHub userbase

Distances

There are \[\frac{N_{core}(N_{core} + 1)}{2}\] possible distance pairs

That is, 2.3 trillions (10¹²)

Do we need to evaluate them all? No!

Representative sample

Statistics for the win! Let's define our random variable as the distance between two random developers.

For sufficiently large N_sample, the distribution of \[{\displaystyle {\hat {p_d}}} = \frac{X_d}{N_{sample}}\] will be closely approximated by a normal distribution.

Sample size on Wikipedia.

Representative sample

Margin of error: 1% (lower is better)
Confidence level: 95% (higher is better)
Result: 10,000 evaluations

Distances calculation

Which algorithm to choose?

Floyd–Warshall works in O(V³) time and eats O(V²) memory
Dijkstra is O(E + V log V) time and O(V) space
▶ Bidirectional search - similar to Dijkstra, but optimized for pairs ◀

We need to avoid the "generational explosion" as much as possible.

Result

Part III

Statement

What developers and repositories are the most important?

Importance

"Importance" is defined recursively: we gain ego if we contribute to repositories to which other important developers contribute.

\[x_ {AR} = \frac{w_ 1 x_ 1 + w_ 2 x_ 2 + w_ 3 x_ 3 + ...}{\lambda} = \frac{1}{\lambda} \sum w_ i x_ i\]

We do not distinguish developers and repositories here.

Return of the adjacency matrix

$$\sum_i C_{ij}=1$$$$C_{ij} \neq C_{ji}$$

Perron–Frobenius theorem

If $C_{ij} > 0$ then $\vec{x}$ is the first eigenvector of C and $x_i$ is positive.

This is not our case: only 10^-7 elements are positive, the rest are zeros.

Google in 1998

Larry and Sergey had exactly the same objective and problems with WWW.

C_ij is 1 if web page i links to j and 0 otherwise.

They invented the trick to make C have all elements positive.

They called it PageRank.

PageRank

$$ G = \beta C + \frac{1-\beta}{N}\left( \begin{array}{cccc} 1 & 1 & ... & 1 \\ 1 & 1 & ... & 1 \\ ... & ... & \ddots & ... \\ 1 & 1 & ... & 1 \end{array} \right) $$

β is the dampening or random walk factor, usually 0.85.

Our matrix becomes acyclic and irreducible, aka the Google matrix.

Power iteration

An efficient algorithm to calculate the first eigenvector.

$$x_{i+1} = G\cdot x_i$$

Sparse friendly.

22M nodes need less than 50 iterations.

GitHub specific: there are no "dead ends".

Power iteration

        def page_rank(m, beta=0.85, niter=50):
    N = m.shape[0]
    x = numpy.ones(N, dtype=numpy.float32) / N
    for i in range(niter):
        x_next = m.dot(x) * beta + (1 - beta) / N
        xdiff = numpy.linalg.norm(x - x_next, ord=1)
        x = x_next
    return x

Convergence

Who is the most important on GitHub?

Ryan Baumann aka ryanfb (at the time the research was conducted - Jan 2017)

Who is the most important on GitHub?

Ryan created more than 1,000 repositories, and each repository gives a small amount of importance.

Link farm failure.

Identity matching is not perfect, so the leaderboard is polluted with such anomalies.

PageRank works to compare specific people.

Who is the most important on GitHub?

Person	PageRank, 10^-5
Linus Torvalds	2.7
Armin Ronacher	0.8
Maximo Cuadros	0.26
Vadim Markovtsev	0.16

What is the most important on GitHub?

Conclusion

GitHub contributions graph can be visualized and has a structure
11% of GitHub users form the (strongly) connected core
85% of the core is ≤6 hops from each other
PageRank allows for measuring the socially validated impact

Materials

Dataset on data.world, pickled graph
Code: visualization, the rest in the blog post
CS246
Papers: 1 (130k nodes, CSMR 2013), 2 (? nodes, 2016)
This presentation - shwr

Contacts

We are hiring!