Mining software development history: approaches and challenges

Mining software development history

Vadim Markovtsev

Plan

Intro ➙ Couples ➙ Lines ➙ Time series ➙ Tools

Introduction

Git

Source code

a = b * 2

Graphs

graphs everywhere

So you have got graphs...

Measures, e.g. PageRank
- SourceCred
Node embeddings
- node2vec
Clustering, community detection
Gated Graph Neural Networks

Diffs

By line: O(N×D)
By tree node: O(N×D)
By tree node which you really want*: NP-hard, active research

*with "move" operation

AST-annotated line diffs

Couples

Idea

If developers often change the same files, there must be something in common.
If files often appear in the same commits, there must be something in common.

word2vec

Context: sentences
Embedding: optimize Pointwise Mutual Information
There are algorithms for explicit co-occurrence matrix
- GloVe
- Swivel

Let's optimize

Pointwise Mutual Information

Problems

Optimal vector size?
Convergence on <10k is bad
Convergence on <50D is bad
GPU is useless on <100 items

t-SNE over the embeddings

Lines

Line burndown

git blame foo.go

●2014

func foo() {
  println("bar")
}

●2015

func foo() {
  println("bar")
}
func qux() {
  println("baz")
}

●2016

func foo() {
  println("waldo")
}
const X = 10
func spam() {
  println("baz")
}

Line burndown

Linux

Kaplan-Meier, Erik Bernhardsson

Ownership

Measuring ownership is hard

Overwrites

To a man with a hammer...

Swivel on the similarity matrix
t-SNE to visualize the embeddings

Lines importance

import sys
import smart_foo
 
def foo(x: Any) -> int:
    log("called foo %s", x)
    # now the hardcore part
    if smart_foo.complex_cond(x) < 50:
        return 50
    return 0

Diff 🡒 commit classification

Time series

CPython contributors

*some are hidden, e.g. Guido

CPython contributors

Use cases

How many active contributors are today?
How did contributors come and leave?
(enterprise) Who worked with who?

Plan

Define the distance
Order by distance (1D)
Cluster by distance (2D)
Query nearest neighbors

Euclidean distance

d₁ = d₂

Dynamic Time Warping

Fast Dynamic Time Warping

Original complexity: O(n²)

Approximation's complexity: O(n)

Python package: fastdtw

"Distance" matrix: O(n²)

Agglomerative clustering?

`seaborn.clustermap?`

Make Patterns Pop Out of Heatmaps with Seriation
by Nicolas Kruchten

Let's get scientific

$$ \min \sum{d_{i,i+1}} $$

Seriation

Ideal for heatmaps and 1D ordering of series
Python package: seriate
Is not always better than agglomerative clustering

HDBSCAN

aka I could not be bothered more by clustering

Beyond 1D

Both t-SNE and UMAP work with an explicit distance matrix
t-SNE works better for 2D (kNN+rank loss)
UMAP works better for 3+D
Best embedding dimensionality: 14 (85% 20NN)

Tools

Code Lake

github.com/src-d/gitbase

SELECT refs.repository_id
FROM refs
NATURAL JOIN commits
WHERE commits.commit_author_name = 'Johnny Bravo'
    AND refs.ref_name = 'HEAD';

Parsing code

Imports for each commit

SELECT repository_id, commit_hash, file_path,
    uast_extract(uast(blob_content,
                      language(file_path),
                      '//uast:Import/Path'),
                 "Value") AS imports
FROM commit_files NATURAL JOIN blobs
WHERE language(file_path) = 'Go'
    AND array_length(imports) > 0;

github.com/src-d/hercules

Command-line playground for trying new ideas
Data processing: Go (go-git)
Data analysis: Python

Examples

# Lines evolution
hercules --burndown | labours -m burndown-project

# Couples
hercules --couples | labours -m couples

# Commit time series
hercules --devs | labours -m devs

Datasets

Why?

What to clone?
Cloning many repositories is hard
We don't always need source code

GDPR ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★

Software Heritage Graph Dataset

85mm repositories, 1.1m unique commits
PostgreSQL, Parquet
zenodo.org/record/2583978

Public Git Archive

260k repositories
Git packfile-s
≥50🟊 on GitHub (early 2019)
pga.sourced.tech

Summary

Embed similarity matrices with Swivel
Embed distance matrices with t-SNE (2-3D) and UMAP
Compare time series with Fast Dynamic Time Warping
- But remember about the △ inequality
Order sequences and heatmaps with Seriation
Mining software development history is fun

Mining Software Repositories

msrconf.org

source{d} Community Edition

sourced.tech/products/ community-edition/

Thank you

bit.ly/2JwO2f7

Mining software development history

Plan

Introduction

Git

Source code

Graphs

graphs everywhere

So you have got graphs...

Diffs

AST-annotated line diffs

Couples

Idea

word2vec

Let's optimizePointwise Mutual Information

Problems

t-SNE over the embeddings

Lines

Line burndown

Line burndown

Linux

Ownership

Measuring ownership is hard

Overwrites

To a man with a hammer...

Lines importance

Diff 🡒 commit classification

Time series

CPython contributors

CPython contributors

CPython contributors

Use cases

Plan

Euclidean distance

d1 = d2

Dynamic Time Warping

Fast Dynamic Time Warping

"Distance" matrix: O(n2)

Agglomerative clustering?seaborn.clustermap?

Let's get scientific

$$ \min \sum{d_{i,i+1}} $$

Seriation

HDBSCAN

Beyond 1D

Tools

Code Lake

github.com/src-d/gitbase

Parsing code

Imports for each commit

github.com/src-d/hercules

Examples

Datasets

Why?

GDPR ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★

Software Heritage Graph Dataset

Public Git Archive

Summary

Summary

Mining Software Repositories

source{d} Community Edition

sourced.tech/products/ community-edition/

Thank you

Let's optimize

Pointwise Mutual Information

d₁ = d₂

"Distance" matrix: O(n²)

Agglomerative clustering?

`seaborn.clustermap?`