GitHub Contributions Graph: Analyzing PageRank & Proving The 6 Handshakes Theory

Vadim Markovtsev, source{d}.

GitHub Contributions Graph: Analyzing PageRank & Proving The 6 Handshakes Theory

Vadim Markovtsev, source{d}

GitHub contributions graph - 22M nodes

Read this on your device

About me

About source{d}

About this talk

Part I

GitHub in numbers octoverse.github.com

5.8M+
active users

Like @acabunoc, Lead Developer for Open Source Engagement at the Mozilla Foundation who is developing new ways to engage contributors on open source projects.

331k+
active organizations

Like the Dat project who are working openly to develop a peer to peer tool for distributing datasets in science, journalism and government.

19.4M+
active repositories

Like the Federal Source Code Policy on GitHub released by The White House, who this year solicited community feedback on GitHub.

10.7M+
active issues

Like this one opened in the Go programming language repo that used open source data on Google’s BiqQuery to help decide whether to add new functionality to a Go standard library.

Where to take the data?

Commit datasets:

  1. GHTorrent (145M)
  2. commits table in Google BigQuery - equivalent to GHTorrent
  3. GitHub Archive (stream, unknown)
  4. source{d}'s 452M dataset with de-fuzzy-forking

BigQuery will charge you money.

Problem

There are no GitHub user <-> email associations in both datasets.

We need to do identity matching.

            mcuadros@gmail.com
            maximo@sourced.tech
            mcuadros@maximos-macbook.local
            mcuadros@maximos-mac-pro.local
            root@localhost
        

How to build the graph?

At large scale, it is useless to distinguish developers and repositories.

Binary undirected edges

Gephi

Armin Ronacher,
2 levels

Armin is in the center.
8k nodes, 11k edges.

3 levels

200k nodes, 300k edges.

Heatmap

Color depends on how many hops the dot is distant from the root.

Sequence of 4 circles with increasing redness

Rob Pike, 3 levels, zoomed

Oops

Actually Rob Pike did not contribute to cmars/oo and cmars/tools.

Author forgery is perfectly possible on GitHub.

Weighted directed edges

Adjacency matrix

Cij is the number of commits done by developer i to repository j

Cij = Cji

Non-zero elements: <10-7

Adjacency matrix

We normalize values in every column.

$$\sum_i C_{ij}=1$$

The matrix is no longer symmetric.

LargeVis

LargeVis

Resulting node coordinates - 264MB.

Problem

How to plot 22,000,000 dots?

Jupyter notebook.

Brute-force scatter plot

& transparency (0.1)

Heatmap (log)

Part II

Statement

Given that two developers "know" each other if they contributed to the same project, how far are they from each other on average, if there is a connection?

Plan

  1. Find all the connected components of the graph.
  2. Pick the core component, or simply “the core”. Calculate the size of the representative sample of the pairs of nodes.
  3. Calculate the distances between sampled node pairs.
  4. Plot the histogram, draw the conclusion.

Connected components

We run a series of graph traversals

Any type works: DFS, BFS, RFS

Python, 22M nodes - 5 min

Result: 10200912, 10229, ...

THE CORE

Core statistics

The size is 10200912 / 22000000 < 50% of the whole graph

33% of the developers; this is 11% of the whole GitHub userbase

Distances

There are \[\frac{N_{core}(N_{core} + 1)}{2}\] possible distance pairs

That is, 2.3 trillions (1012)

Do we need to evaluate them all? No!

Representative sample

Statistics for the win! Let's define our random variable as the distance between two random developers.

For sufficiently large Nsample, the distribution of \[{\displaystyle {\hat {p_d}}} = \frac{X_d}{N_{sample}}\] will be closely approximated by a normal distribution.

Sample size on Wikipedia.

Representative sample

Distances calculation

Which algorithm to choose?

We need to avoid the "generational explosion" as much as possible.

Result

Part III

Statement

What developers and repositories are the most important?

Importance

"Importance" is defined recursively: we gain ego if we contribute to repositories to which other important developers contribute.

\[x_ {AR} = \frac{w_ 1 x_ 1 + w_ 2 x_ 2 + w_ 3 x_ 3 + ...}{\lambda} = \frac{1}{\lambda} \sum w_ i x_ i\]

We do not distinguish developers and repositories here.

Return of the adjacency matrix

$$\sum_i C_{ij}=1$$$$C_{ij} \neq C_{ji}$$

Perron–Frobenius theorem

If \(C_{ij} > 0\) then \(\vec{x}\) is the first eigenvector of C and \(x_i\) is positive.

This is not our case: only 10-7 elements are positive, the rest are zeros.

Google in 1998

Larry and Sergey had exactly the same objective and problems with WWW.

Cij is 1 if web page i links to j and 0 otherwise.

They invented the trick to make C have all elements positive.

They called it PageRank.

PageRank

$$ G = \beta C + \frac{1-\beta}{N}\left( \begin{array}{cccc} 1 & 1 & ... & 1 \\ 1 & 1 & ... & 1 \\ ... & ... & \ddots & ... \\ 1 & 1 & ... & 1 \end{array} \right) $$

β is the dampening or random walk factor, usually 0.85.

Our matrix becomes acyclic and irreducible, aka the Google matrix.

Power iteration

An efficient algorithm to calculate the first eigenvector.

$$x_{i+1} = G\cdot x_i$$

Sparse friendly.

22M nodes need less than 50 iterations.

GitHub specific: there are no "dead ends".

Power iteration

        def page_rank(m, beta=0.85, niter=50):
    N = m.shape[0]
    x = numpy.ones(N, dtype=numpy.float32) / N
    for i in range(niter):
        x_next = m.dot(x) * beta + (1 - beta) / N
        xdiff = numpy.linalg.norm(x - x_next, ord=1)
        x = x_next
    return x
        

Convergence

Who is the most important on GitHub?

Ryan Baumann aka ryanfb (at the time the research was conducted - Jan 2017)

Who is the most important on GitHub?

Ryan created more than 1,000 repositories, and each repository gives a small amount of importance.

Link farm failure.

Identity matching is not perfect, so the leaderboard is polluted with such anomalies.

PageRank works to compare specific people.

Who is the most important on GitHub?

Person PageRank, 10-5
Linus Torvalds 2.7
Armin Ronacher 0.8
Maximo Cuadros 0.26
Vadim Markovtsev 0.16

What is the most important on GitHub?

  1. nguyendtu/patchwork
  2. coeligena/homebrew-customized-copy
  3. KenStanley/reflections
  4. torvalds/linux
  5. enkidevs/commit
  6. gentoo/wikiclone
  7. Homebrew/homebrew-core
  8. renoirb/test
  9. SopraConsulting/CocoapodsSpecs
  10. openSUSE/salt
  11. caskroom/homebrew-cask
  12. borisyankov/DefinitelyTyped
  13. openstack/openstack
  14. TheOdinProject/curriculum

Conclusion

Materials

Contacts

We are hiring!