Personal code review assistants and Machine Learning on Source Code

Vadim Markovtsev, source{d}.

Personal code review assistants and Machine Learning on Source Code

Vadim Markovtsev
@vadimlearning

source{d}
#MLonCode

Plan

Origins    Lookout    SDK    #MLonCode

Origins

Why does reviewing code suck?

Need to understand somebody else's code.
💩

Need to have experience.
🤓

Need to have empathy.
🤬

Boring.
😴

Why boring?

Σ≈4 million

Solution: automate what we can.

We do *not* want this

Source.

Better to suggest less, but make no mistakes.

Lookout

Architecture

Push event

Push event

Push event

Push event

Push event

Push event

Push event

PR event

PR event

docs.sourced.tech/lookout

SDK

src-d/lookout-sdk

src-d/lookout-sdk-ml

Rule of 👍

High-level API

class MyAnalyzer(Analyzer):
    @classmethod
    def train(cls, ...) -> AnalyzerModel:
        # ...

    def analyze(self, ...) -> [Comment]:
        # do something with self.model

Behind the scenes

MLonCode

Naturalness hypothesis

Because coding is an act of communication, one might expect large code corpora to have rich patterns, similar to natural language, thus allowing software engineering tools to exploit probabilistic ML models.

Allamanis et.al.

id2vec bit.ly/2UHo1B0

Send is to receive as push is to pop.

Markovtsev et.al.

AST GGNN bit.ly/2V1B8wo

code2vec code2vec.org

Code LSH bit.ly/2v7tKAS

Automatic format repair

Augmented token stream

a = b * 2

Machine Learning

Rules

Rules optimization

a>5 Λ c Λ b>2 Λ d Λ a>10 ⇒ α(merge)
a>10 Λ c Λ b>2 Λ d ⇒ α(redundant)
a>10 Λ c Λ d ⇒ α(feature exclusion)
a>10 Λ c ⇒ α(confidence threshold)

Result

Typos correction

Idea #1

We trust existing code.

Idea #2

Let's use the whole world's open source code.

Idea #3

Piece-wise invariance: FooBar = bar_foo

Typo correction plan

  1. Extract all identifiers in the open source world
  2. Split each token into parts with a char-biLSTM
  3. Build the vocabulary
  4. Train xgboost

Summary

Summary

Thank you

bit.ly/2Iv8nBC