Machine Learning on Source Code

Vadim Markovtsev, dotAI 2018

#MLonCode

Vadim Markovtsev

@vadimlearning

đź“ś Code assistants can be smarter

How?

🚀 Machine Learning

GitHub

Machine Learning on Source Code

MLonCode

Code naturalness

class ???:
    def connect(self, dbname, user, password, host, port):
        # ...
    def query(self, sql):
        # ...
    def close(self):
        # ...
class Database:
            def connect(self, dbname, user, password, host, port):
                # ...
            def query(self, sql):
                # ...
            def close(self):
                # ...
        
class Foo:
            def bar(self, qux):
                # ...
            def baz(self, waldo):
                # ...
            def do(self, really):
                # ...
        

Identifier embeddings

\( \begin{split} V_1 \Leftrightarrow & \,\texttt{"foo"} \\ \\ V_2 \Leftrightarrow & \,\texttt{"bar"} \\ \\ V_3 \Leftrightarrow & \,\texttt{"integrate"} \end{split} \)

\( distance(V_1, V_2) < distance(V_1, V_3) \)

\( distance(V_i, V_j) = \arccos \frac{V_i \cdot V_j}{\left\lVert V_i \right\rVert \left\lVert V_j \right\rVert} \)

Scalar product
Norm

How to estimate \(V_i \cdot V_j\) ?

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None
        

Splitting and normalization

            _tcp_socket_connect -> [tcp, socket, connect]
            AuthenticationError -> [authentication, error]
            authentication, authenticate -> authenticate
        
class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None
        

database, connect2, user2, password2, host2, port2, tcp, socket2, authenticate2, error, close

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None
        

connect2, user2, password2, host2, port2, tcp, socket2, authenticate2, error, close

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None
        

connect, user, password, host, port

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None
        

tcp, socket, connect, host, port

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None
        

authenticate2, user, password, error, socket, close

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None
        

authenticate, user, password

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None
        

authenticate, error, socket, close

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None
        

authenticate, error

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None
        

socket, close

Incidence matrix \(C_{ij}\)

Pointwise Mutual Information (PMI)

$$ V_i \cdot V_j = PMI_{ij} = \log\frac{C_{ij} \sum C}{\sum_{k = 1}^N C_{ik}\sum_{k = 1}^N C_{jk}} $$

Representation Learning on Explicit Matrix

Stochastic Gradient Descent

Swivel

Dataset Extraction

Public Git Archive
pga.sourced.tech

Results

Nearest to “foo”

Analogies

“bug” - “test” + “expect” = “suppress”

“database” - “query” + “tune” = “settings”

“send” - “receive” + “pop” = “push”

Typos

$$ \min \left( distance(X, \text{"settings"}) + distance(X, \text{"get"}) + distance(X, \text{"key"}) \right) $$

set

save

step

sneak

sum

ssl

Conclusion

#MLonCode is fun🤓

Thank you

bit.ly/2jWassK