Hercules and His Labours

Vadim Markovtsev, source{d}.

Hercules and His Labours

Vadim Markovtsev, source{d}

Hercules Killing the Lernean Hydra

Read this on your device

Objective

git blame foo.py

2014
class Foo:
  def bar(self):
    print("!")
2015
class Foo:
  def bar(self):
    print("?")
 
  def baz(self):
    print("!")
2016
class FooBarBaz:
  def bar(self):
    """..."""
    print("yo")
 
  def baz(self):
    print("!")

Objective

History

Problems

Diff route

How to blame by Alberto Cortes.
  1. Trace the commits chain backwards starting from HEAD.
    <root> D 🠔 E 🠔 F 🠔 G 🠔 H 🠔 I HEAD
  2. Diff starting from the root along the path from (1).
    <root> D 🠖 E 🠖 F 🠖 G 🠖 H 🠖 I HEAD

Diff route

Where are my branches?

			      A---B---C
     /         \
D---E---F---G---H---I HEAD
     ----------> time
 way <----------
		

Git branches are pointers, and the history is lost. Merges are ordered.

Diff route

Which way to go?

			      A---B---C topic
     /         \
D---E---F---G---H---I master
     ----------> time
 way <----------
		

Blaming using both branches is hard and even impossible sometimes.

Diff route

There can be "octopus" merges.

			      A---B---C topic1
     /         \
D---E---F---G---H---I master
         \     /
          J---K topic2
		

Diff route

Working solution: treat merges as single commits.

                C
D---E---F---G---H---I master
                K
		

We lose some information.

Performance

Complexity matters:

git checkout <commit> && git blame <file>

takes O(n) steps where n is the number of commits before <commit>.

Thus the whole procedure takes O(n2).

We can reduce it to O(n) using the incremental algorithm!

Performance

Incremental blame is not possible using the libgit2 and cgit API.

Performance

go-git was designed for blame analytics from scratch.

It allows to cache the intermediate results.

diff-tree implementation is a state-of-the-art.

Performance

How to store the line intervals?

class FooBarBaz:
  def bar(self):
    """..."""
    print("yo")
 
  def baz(self):
    print("!")

Performance

How to store the line intervals?

class FooBarBaz:
  def bar(self):
    """..."""
    print("yo")
 
  def baz(self):
    print("!")

Performance

How to store the line intervals?

Performance

apache/spark@7026ee23

Performance

In such cases the binary tree has O(1) update complexity.

We have chosen the binary tree.

Anyway, the profile shows that the diff is the bottleneck.

Performance

Git stores commits as snapshots, we need to diff ourselves.

Diff algorithms:

Myers algorithm is optimal for analytics.

We tried to accelerate it with NVIDIA CUDA but failed.

Trying Hercules

  1. Install Go language
  2. go get gopkg.in/src-d/hercules.v1/cmd/hercules
  3. bin/hercules https://github.com/tensorflow/tensorflow | python3 src/gopkg.in/src-d/hercules.v1/labours.py --resample=month

Trying Hercules

... labours.py --resample=month --relative

Trying Hercules

hercules https://github.com/torvalds/linux

2 hours, 130GB RSS

Future

Thank you

My contacts: