Code as Data and Machine Learning on Source Code

Vadim Markovtsev, source{d}.

#CodeAsData
#MLonCode

Vadim Markovtsev

@vadimlearning

Plan

Code as Data    MLonCode    Tools

Code as Data

Source code

a = b * 2

Git

Graphs

graphs everywhere

So you have got graphs...

Huge codebases

LoC, 10⁶
Chrome 20
Windows 10 50
Facebook 70
Eclipse Foundation 160
Apache Foundation 190
Google 2,000
GitHub

Huge codebases

LoC, 10⁶
Chrome
20
Windows 10
50
Facebook
70
Eclipse Foundation
160
Apache Foundation
190
Google
2,000
GitHub
>56,000

Code Lake

github.com/src-d/gitbase

SELECT refs.repository_id
FROM refs
NATURAL JOIN commits
WHERE commits.commit_author_name = 'Johnny Bravo'
    AND refs.ref_name = 'HEAD';

Parsing code

Imports for each commit

SELECT repository_id, commit_hash, file_path,
    uast_extract(uast(blob_content,
                      language(file_path),
                      '//uast:Import/Path'),
                 "Value") AS imports
FROM commit_files NATURAL JOIN blobs
WHERE language(file_path) = 'Go'
    AND array_length(imports) > 0;

Diffs

*with "move" operation

AST-annotated line diffs

Flexible, interoperable, fast: choose any two

Flexible Interoperable Fast
Spark
Gitbase
PostgreSQL 😐
ClickHouse
* graph DB ?

1 GB Git -> 1.5 TB SQL

Flattened AST

SELECT r.repo, meta.stars FROM (
    DISTINCT repo FROM uasts WHERE lang='go'
    AND value LIKE 'gopkg.in/src-d/%'
    AND type='Import' AND repo NOT LIKE 'src-d/%'
) r JOIN meta ON r.repo=meta.repo;

sourced.tech/products/
community-edition/

MLonCode

Couples

Idea

word2vec

Let's optimize

Pointwise Mutual Information

t-SNE over the embeddings

Lines

Line burndown

git blame foo.go

2014
func foo() {
  println("bar")
}
2015
func foo() {
  println("bar")
}
func qux() {
  println("baz")
}
2016
func foo() {
  println("waldo")
}
const X = 10
func spam() {
  println("baz")
}

Line burndown

Linux

Kaplan-Meier, Erik Bernhardsson

Ownership

Overwrites

To a man with a hammer...

  1. Swivel on the similarity matrix
  2. t-SNE to visualize the embeddings

Time series

CPython contributors

*some are hidden, e.g. Guido

CPython contributors

CPython contributors

Use cases

Plan

  1. Define the distance
  2. Order by distance (1D)
  3. Cluster by distance (2D)
  4. Query nearest neighbors

Euclidean distance

d1 = d2

Dynamic Time Warping

Fast Dynamic Time Warping

Original complexity: O(n2)

Approximation's complexity: O(n)

Python package:  fastdtw

"Distance" matrix: O(n2)

Agglomerative clustering?

seaborn.clustermap?

Make Patterns Pop Out of Heatmaps with Seriation
by Nicolas Kruchten

Let's get scientific

$$ \min \sum{d_{i,i+1}} $$

Seriation

HDBSCAN

aka I could not be bothered more by clustering

Beyond 1D

Identifier embeddings

Code naturalness

class ???:
    def connect(self, dbname, user, password, host, port):
        # ...
    def query(self, sql):
        # ...
    def close(self):
        # ...
class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None
        

Splitting and normalization

            _tcp_socket_connect -> [tcp, socket, connect]
            AuthenticationError -> [authentication, error]
            authentication, authenticate -> authenticate
        
class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None
        

database, connect2, user2, password2, host2, port2, tcp, socket2, authenticate2, error, close

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None
        

connect2, user2, password2, host2, port2, tcp, socket2, authenticate2, error, close

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None
        

connect, user, password, host, port

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None
        

tcp, socket, connect, host, port

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None
        

authenticate2, user, password, error, socket, close

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None
        

authenticate, user, password

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None
        

authenticate, error, socket, close

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None
        

authenticate, error

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None
        

socket, close

Swivel one more time

Nearest to “foo”

Analogies

“bug” - “test” + “expect” = “suppress”

“database” - “query” + “tune” = “settings”

“send” - “receive” + “pop” = “push”

Typos

source{d} Enterprise Edition

Tools

github.com/src-d/hercules

github.com/src-d/ml-core

github.com/src-d/ml-mining

>>> from sourced.ml.core.algorithms.token_parser import TokenParser
>>> parser = TokenParser(use_nn=True)
>>> list(parser.split("progressbar"))
['progress', 'bar']
>>> list(parser.split("bigLITTLE"))
['big', 'LITTLE']
>>> list(parser.split("PreSet"))
['preset']

Examples

# Lines evolution
hercules --burndown | labours -m burndown-project
# Couples
hercules --couples | labours -m couples
# Commit time series
hercules --devs | labours -m devs

Datasets

Why?

GDPR

Software Heritage Graph Dataset

Public Git Archive

Summary

Summary

Mining Software Repositories

msrconf.org

Thank you

bit.ly/3402LH5