Personal code review assistants and Machine Learning on Source Code

Vadim Markovtsev
@vadimlearning

source{d}
#MLonCode

Plan

Origins ➙ Lookout ➙ SDK ➙ #MLonCode

Origins

Why does reviewing code suck?

Need to understand somebody else's code.
💩

Need to have experience.
🤓

Need to have empathy.
🤬

Boring.
😴

Why boring?

Same faults again and again.
Minor, pedantic faults.
No immediate feedback.

Σ≈4 million

Solution: automate what we can.

We do not want this

Better to suggest less, but make no mistakes.

Lookout

Architecture

Push event

PR event

docs.sourced.tech/lookout

SDK

src-d/lookout-sdk

Single source of gRPC definitions
Low-level API: Go, Python
Low-level examples

src-d/lookout-sdk-ml

High-level Python API
Stateful analyzers
Integrated with source{d} ML ecosystem

Rule of 👍

High-level API

class MyAnalyzer(Analyzer):
    @classmethod
    def train(cls, ...) -> AnalyzerModel:
        # ...

    def analyze(self, ...) -> [Comment]:
        # do something with self.model

Behind the scenes

gRPC servers and clients
Pooling and threading
Database of trained models
Caches
Deferred training at first time
Retraining
Logging, metrics

MLonCode

Naturalness hypothesis

Because coding is an act of communication, one might expect large code corpora to have rich patterns, similar to natural language, thus allowing software engineering tools to exploit probabilistic ML models.

Allamanis et.al.

id2vec bit.ly/2UHo1B0

Send is to receive as push is to pop.

Markovtsev et.al.

AST GGNN bit.ly/2V1B8wo

code2vec code2vec.org

Code LSH bit.ly/2v7tKAS

Automatic format repair

Augmented token stream

a = b * 2

Machine Learning

Rules

a≤5 Λ b≤1 Λ c ⇒ α
a≤5 Λ 1<b<4 ⇒ β
5<a<10 Λ c ⇒ γ
a>5 Λ c Λ b>2 ⇒ α

Rules optimization

a>5 Λ c Λ b>2 Λ d Λ a>10 ⇒ α(merge)

a>10 Λ c Λ b>2 Λ d ⇒ α(redundant)

a>10 Λ c Λ d ⇒ α(feature exclusion)

a>10 Λ c ⇒ α(confidence threshold)

Result

-40% ~ -50% less rules @93%

Typos correction

Idea #1

We trust existing code.

Idea #2

Let's use the whole world's open source code.

Idea #3

Piece-wise invariance: FooBar = bar_foo

Typo correction plan

Extract all identifiers in the open source world
Split each token into parts with a char-biLSTM
Build the vocabulary
Train xgboost

Summary

Assisted code review is fun
Assisted code review + Lookout = ♥
#MLonCode is dope
ML can make code reviews less boring

Thank you

bit.ly/2Iv8nBC

Personal code review assistants and Machine Learning on Source Code

Plan

Origins

Why does reviewing code suck?

Need to understand somebody else's code.💩

Need to have experience.🤓

Need to have empathy.🤬

Boring.😴

Why boring?

Σ≈4 million

Solution: automate what we can.

We do *not* want this

Better to suggest less, but make no mistakes.

Lookout

Architecture

Push event

Push event

Push event

Push event

Push event

Push event

Push event

PR event

PR event

docs.sourced.tech/lookout

SDK

src-d/lookout-sdk

src-d/lookout-sdk-ml

Rule of 👍

High-level API

Behind the scenes

MLonCode

Naturalness hypothesis

id2vec bit.ly/2UHo1B0

AST GGNN bit.ly/2V1B8wo

code2vec code2vec.org

Code LSH bit.ly/2v7tKAS

Automatic format repair

Augmented token stream

Machine Learning

Rules

Rules optimization

Result

Typos correction

Idea #1

We trust existing code.

Idea #2

Let's use the whole world's open source code.

Idea #3

Piece-wise invariance: FooBar = bar_foo

Typo correction plan

Summary

Summary

Thank you

Need to understand somebody else's code.
💩

Need to have experience.
🤓

Need to have empathy.
🤬

Boring.
😴

We do not want this