Mining software development history: approaches and challenges

Vadim Markovtsev, source{d}.

Mining software development history

Vadim Markovtsev

Plan

Intro    Couples    Lines    Time series    Tools

Introduction

Git

Source code

a = b * 2

Graphs

graphs everywhere

So you have got graphs...

Diffs

*with "move" operation

AST-annotated line diffs

Couples

Idea

word2vec

Let's optimize

Pointwise Mutual Information

Problems

t-SNE over the embeddings

Lines

Line burndown

git blame foo.go

2014
func foo() {
  println("bar")
}
2015
func foo() {
  println("bar")
}
func qux() {
  println("baz")
}
2016
func foo() {
  println("waldo")
}
const X = 10
func spam() {
  println("baz")
}

Line burndown

Linux

Kaplan-Meier, Erik Bernhardsson

Ownership

Measuring ownership is hard

Overwrites

To a man with a hammer...

  1. Swivel on the similarity matrix
  2. t-SNE to visualize the embeddings

Lines importance

import sys
import smart_foo
 
def foo(x: Any) -> int:
    log("called foo %s", x)
    # now the hardcore part
    if smart_foo.complex_cond(x) < 50:
        return 50
    return 0

Diff 🡒 commit classification

Time series

CPython contributors

*some are hidden, e.g. Guido

CPython contributors

CPython contributors

Use cases

Plan

  1. Define the distance
  2. Order by distance (1D)
  3. Cluster by distance (2D)
  4. Query nearest neighbors

Euclidean distance

d1 = d2

Dynamic Time Warping

Fast Dynamic Time Warping

Original complexity: O(n2)

Approximation's complexity: O(n)

Python package:  fastdtw

"Distance" matrix: O(n2)

Agglomerative clustering?

seaborn.clustermap?

Make Patterns Pop Out of Heatmaps with Seriation
by Nicolas Kruchten

Let's get scientific

$$ \min \sum{d_{i,i+1}} $$

Seriation

HDBSCAN

aka I could not be bothered more by clustering

Beyond 1D

Tools

Code Lake

github.com/src-d/gitbase

SELECT refs.repository_id
FROM refs
NATURAL JOIN commits
WHERE commits.commit_author_name = 'Johnny Bravo'
    AND refs.ref_name = 'HEAD';

Parsing code

Imports for each commit

SELECT repository_id, commit_hash, file_path,
    uast_extract(uast(blob_content,
                      language(file_path),
                      '//uast:Import/Path'),
                 "Value") AS imports
FROM commit_files NATURAL JOIN blobs
WHERE language(file_path) = 'Go'
    AND array_length(imports) > 0;

github.com/src-d/hercules

Examples

# Lines evolution
hercules --burndown | labours -m burndown-project
# Couples
hercules --couples | labours -m couples
# Commit time series
hercules --devs | labours -m devs

Datasets

Why?

GDPR

Software Heritage Graph Dataset

Public Git Archive

Summary

Summary

Mining Software Repositories

msrconf.org

source{d} Community Edition

sourced.tech/products/ community-edition/

Thank you

bit.ly/2JwO2f7