Mining software development history
Vadim Markovtsev
Plan
Intro ➙ Couples ➙ Lines ➙ Time series ➙ Tools
Git
Source code
a = b * 2
So you have got graphs...
Diffs
*with "move" operation
Idea
- If developers often change the same files, there must be something in common.
- If files often appear in the same commits, there must be something in common.
word2vec
- Context: sentences
- Embedding: optimize Pointwise Mutual Information
- There are algorithms for explicit co-occurrence matrix
Let's optimize
Pointwise Mutual Information
Problems
- Optimal vector size?
- Convergence on <10k is bad
- Convergence on <50D is bad
- GPU is useless on <100 items
t-SNE over the embeddings
Line burndown
git blame foo.go
●2014
func foo() {
println("bar")
}
●2015func foo() {
println("bar")
}
func qux() {
println("baz")
}
●2016func foo() {
println("waldo")
}
const X = 10
func spam() {
println("baz")
}
Line burndown
Linux
Measuring ownership is hard
To a man with a hammer...
- Swivel on the similarity matrix
- t-SNE to visualize the embeddings
import sys
import smart_foo
def foo(x: Any) -> int:
log("called foo %s", x)
# now the hardcore part
if smart_foo.complex_cond(x) < 50:
return 50
return 0
Diff 🡒 commit classification
CPython contributors
*some are hidden, e.g. Guido
Use cases
- How many active contributors are today?
- How did contributors come and leave?
- (enterprise) Who worked with who?
Plan
- Define the distance
- Order by distance (1D)
- Cluster by distance (2D)
- Query nearest neighbors
Dynamic Time Warping
Fast Dynamic Time Warping
Original complexity: O(n2)
Approximation's complexity: O(n)
Python package: fastdtw
Agglomerative clustering?
seaborn.clustermap?
Seriation
- Ideal for heatmaps and 1D ordering of series
- Python package:
seriate
- Is not always better than agglomerative clustering
HDBSCAN
aka I could not be bothered more by clustering
Beyond 1D
- Both t-SNE and UMAP work with an explicit distance matrix
- t-SNE works better for 2D (kNN+rank loss)
- UMAP works better for 3+D
- Best embedding dimensionality: 14 (85% 20NN)
github.com/src-d/gitbase
SELECT refs.repository_id
FROM refs
NATURAL JOIN commits
WHERE commits.commit_author_name = 'Johnny Bravo'
AND refs.ref_name = 'HEAD';
Imports for each commit
SELECT repository_id, commit_hash, file_path,
uast_extract(uast(blob_content,
language(file_path),
'//uast:Import/Path'),
"Value") AS imports
FROM commit_files NATURAL JOIN blobs
WHERE language(file_path) = 'Go'
AND array_length(imports) > 0;
github.com/src-d/hercules
- Command-line playground for trying new ideas
- Data processing: Go (go-git)
- Data analysis: Python
Examples
# Lines evolution
hercules --burndown | labours -m burndown-project
# Couples
hercules --couples | labours -m couples
# Commit time series
hercules --devs | labours -m devs
Why?
- What to clone?
- Cloning many repositories is hard
- We don't always need source code
GDPR
★
★
★
★
★
★
★
★
★
★
★
★
Software Heritage Graph Dataset
Public Git Archive
Summary
- Embed similarity matrices with Swivel
- Embed distance matrices with t-SNE (2-3D) and UMAP
- Compare time series with Fast Dynamic Time Warping
- But remember about the △ inequality
- Order sequences and heatmaps with Seriation
- Mining software development history is fun
source{d} Community Edition