Machine Learning on Source Code

Vadim Markovtsev, source{d}.

#MLonCode

Vadim Markovtsev, source{d}

Machine learning on source code

Origins

Projects similar to MariaDB/server?

Similar code detection

provides us

Another example: DeepCode.ai.
class foobar:
    def connecttoserver(self):
        myserverhost = globalconfig.server.host
        
class FooBar:
    def connect_to_server(self):
        myServerHost = globalConfig.server.host
        

Your code is a crime scene

Tools for MLonCode

MLonCode logo

Datasets

PGA

Details in the paper.

PGA index

source{d} engine

source{d} engine

>>> from sourced.engine import Engine
>>> engine = Engine(spark, "/path/to/siva/files", "siva")
>>> engine.repositories.references.head_ref \
    .commits.tree_entries.blobs \
    .classify_languages() \
    .select("blob_id", "path", "lang") \
    .show()

How to parse

Universal AST

dashboard.bblf.sh

Python client

pip3 install bblfsh
python3 -m bblfsh -f /path/to/file
        
import bblfsh
client = bblfsh.BblfshClient("0.0.0.0:9432")
client.parse("/file").uast
        
bblfsh/client-python

Integration

>>> engine.repositories.references.head_ref \
    .commits.tree_entries.blobs \
    .classify_languages() \
    .filter('lang = "Python"') \
    .extract_uasts() \
    .query_uast('//*[@roleIdentifier]') \
    .extract_tokens("result", "tokens") \
    .select("blob_id", "path", "tokens")
        

Powerful

sourced-ml

Requirements

Modelforge

GitHub

Summary

Thank you

Contacts: