Vadim Markovtsev, source{d}.

Vadim Markovtsev, source{d}

- Lead Engineer, Machine Learning @ source{d}
- Education: compiler technologies
- Switched to ML/DS in 2013
- Python, C/C++, Go, HTML5

- < 20 people in Madrid, Spain
- Developed src-d/go-git, Git implementation in Go
- Fetch all repositories from GitHub/BitBucket/etc.
- Do crazy ML and DS on top of them
- Project Babelfish extracts Abstract Syntax Trees from them

- Based on the blog post
- 3 parts
- Math inside
- Python code inside
- Data inside

5.8M+

active users

Like @acabunoc, Lead Developer for Open Source Engagement at the Mozilla Foundation who is developing new ways to engage contributors on open source projects.

331k+

active organizations

Like the Dat project who are working openly to develop a peer to peer tool for distributing datasets in science, journalism and government.

19.4M+

active repositories

Like the Federal Source Code Policy on GitHub released by The White House, who this year solicited community feedback on GitHub.

10.7M+

active issues

Like this one opened in the Go programming language repo that used open source data on Google’s BiqQuery to help decide whether to add new functionality to a Go standard library.

Commit datasets:

- GHTorrent (145M)
- commits table in Google BigQuery - equivalent to GHTorrent
- GitHub Archive (stream, unknown)
- source{d}'s 452M dataset with de-fuzzy-forking

BigQuery will charge you money.

There are no GitHub user <-> email associations in both datasets.

We need to do identity matching.

`mcuadros@gmail.com`

`maximo@sourced.tech`

`mcuadros@maximos-macbook.local`

`mcuadros@maximos-mac-pro.local`

`root@localhost`

- Binary undirected edges
- Weighted directed edges
- Implies contribution volume estimation

At large scale, it is useless to distinguish developers and repositories.

Binary undirected edges

- gephi.org
- Open source, Java
- Most popular graph visualization software, has GUI
- And very... special
- Does not handle millions of nodes

2 levels

Armin is in the center.

8k nodes, 11k edges.

200k nodes, 300k edges.

Color depends on how many hops the dot is distant from the root.

Actually Rob Pike did not contribute to cmars/oo and cmars/tools.

Author forgery is perfectly possible on GitHub.

Weighted directed edges

C_{ij} is the number of commits done by developer i to repository j

C_{ij} = C_{ji}

Non-zero elements: <10^{-7}

We normalize values in every column.

$$\sum_i C_{ij}=1$$

The matrix is no longer symmetric.

- Developer: measure of attention
- Repository: measure if impact

- lferry007/LargeVis
- C++, Python
- Handles millions of nodes
- Unmaintained

- Write every non-zero matrix element (edge) to a text file
- Apply LargeVis command-line application
- 23M nodes, 28M edges, 32 cores: <64GB RAM, 2 days
- Plot the resulting node coordinates

How to plot 22,000,000 dots?

- Brute-force scatter plot
- Brute-force scatter plot with semi-transparent dots
- Heatmap density estimatiion
- numpy.histogram2d + matplotlib.pyplot.imshow
- matplotlib.pyplot.hexbin

Given that two developers "know" each other if they contributed to the same project, how far are they from each other on average, if there is a connection?

- Find all the connected components of the graph.
- Pick the core component, or simply “the core”. Calculate the size of the representative sample of the pairs of nodes.
- Calculate the distances between sampled node pairs.
- Plot the histogram, draw the conclusion.

We run a series of graph traversals

Any type works: DFS, BFS, RFS

Python, 22M nodes - 5 min

Result: 10200912, 10229, ...

The size is 10200912 / 22000000 < 50% of the whole graph

33% of the developers; this is 11% of the whole GitHub userbase

There are \[\frac{N_{core}(N_{core} + 1)}{2}\] possible distance pairs

That is, 2.3 trillions (10^{12})

Do we need to evaluate them all? No!

Statistics for the win! Let's define our random variable as the distance between two random developers.

For sufficiently large N_{sample}, the distribution of \[{\displaystyle {\hat {p_d}}} = \frac{X_d}{N_{sample}}\] will be closely approximated by a normal distribution.

- Margin of error: 1% (lower is better)
- Confidence level: 95% (higher is better)
- Result: 10,000 evaluations

Which algorithm to choose?

- Floyd–Warshall works in O(V
^{3}) time and eats O(V^{2}) memory - Dijkstra is O(E + V log V) time and O(V) space
- ▶ Bidirectional search - similar to Dijkstra, but optimized for pairs ◀

We need to avoid the "generational explosion" as much as possible.

What developers and repositories are the most important?

"Importance" is defined recursively: we gain ego if we contribute to repositories to which other important developers contribute.

\[x_ {AR} = \frac{w_ 1 x_ 1 + w_ 2 x_ 2 + w_ 3 x_ 3 + ...}{\lambda} = \frac{1}{\lambda} \sum w_ i x_ i\]

We do not distinguish developers and repositories here.

$$\sum_i C_{ij}=1$$$$C_{ij} \neq C_{ji}$$

If \(C_{ij} > 0\) then \(\vec{x}\) is the first eigenvector of C and \(x_i\) is positive.

This is not our case: only 10^{-7} elements are positive, the rest are zeros.

Larry and Sergey had exactly the same objective and problems with WWW.

C_{ij} is 1 if web page i links to j and 0 otherwise.

They invented the trick to make C have all elements positive.

They called it PageRank.

$$ G = \beta C + \frac{1-\beta}{N}\left( \begin{array}{cccc} 1 & 1 & ... & 1 \\ 1 & 1 & ... & 1 \\ ... & ... & \ddots & ... \\ 1 & 1 & ... & 1 \end{array} \right) $$

β is the dampening or random walk factor, usually 0.85.

Our matrix becomes acyclic and irreducible, aka the Google matrix.

An efficient algorithm to calculate the first eigenvector.

$$x_{i+1} = G\cdot x_i$$

Sparse friendly.

22M nodes need less than 50 iterations.

GitHub specific: there are no "dead ends".

`def page_rank(m, beta=0.85, niter=50):`

`N = m.shape[0]`

`x = numpy.ones(N, dtype=numpy.float32) / N`

`for i in range(niter):`

`x_next = m.dot(x) * beta + (1 - beta) / N`

`xdiff = numpy.linalg.norm(x - x_next, ord=1)`

`x = x_next`

`return x`

Ryan Baumann aka ryanfb (at the time the research was conducted - Jan 2017)

Ryan created more than 1,000 repositories, and each repository gives a small amount of importance.

Link farm failure.

Identity matching is not perfect, so the leaderboard is polluted with such anomalies.

PageRank works to compare specific people.

Person | PageRank, 10^{-5} |
---|---|

Linus Torvalds | 2.7 |

Armin Ronacher | 0.8 |

Maximo Cuadros | 0.26 |

Vadim Markovtsev | 0.16 |

- nguyendtu/patchwork
- coeligena/homebrew-customized-copy
- KenStanley/reflections
- torvalds/linux
- enkidevs/commit
- gentoo/wikiclone
- Homebrew/homebrew-core
- renoirb/test
- SopraConsulting/CocoapodsSpecs
- openSUSE/salt
- caskroom/homebrew-cask
- borisyankov/DefinitelyTyped
- openstack/openstack
- TheOdinProject/curriculum

- GitHub contributions graph can be visualized and has a structure
- 11% of GitHub users form the (strongly) connected core
- 85% of the core is ≤6 hops from each other
- PageRank allows for measuring the socially validated impact

- Dataset on data.world, pickled graph
- Code: visualization, the rest in the blog post
- CS246
- Papers: 1 (130k nodes, CSMR 2013), 2 (? nodes, 2016)
- This presentation - shwr

We are hiring!