Vadim Markovtsev, source{d}.
Vadim Markovtsev, source{d}
git blame foo.py
class Foo:
def bar(self):
print("!")
class Foo:
def bar(self):
print("?")
def baz(self):
print("!")
class FooBarBaz:
def bar(self):
"""..."""
print("yo")
def baz(self):
print("!")
<root> D 🠔 E 🠔 F 🠔 G 🠔 H 🠔 I HEAD
<root> D 🠖 E 🠖 F 🠖 G 🠖 H 🠖 I HEAD
Where are my branches?
A---B---C
/ \
D---E---F---G---H---I HEAD
----------> time
way <----------
Git branches are pointers, and the history is lost. Merges are ordered.
Which way to go?
A---B---C topic
/ \
D---E---F---G---H---I master
----------> time
way <----------
Blaming using both branches is hard and even impossible sometimes.
There can be "octopus" merges.
A---B---C topic1
/ \
D---E---F---G---H---I master
\ /
J---K topic2
Working solution: treat merges as single commits.
C
D---E---F---G---H---I master
K
We lose some information.
Complexity matters:
git checkout <commit> && git blame <file>
takes O(n) steps where n is the number of commits before
<commit>
.
Thus the whole procedure takes O(n2).
We can reduce it to O(n) using the incremental algorithm!
Incremental blame is not possible using the libgit2 and cgit API.
go-git was designed for blame analytics from scratch.
It allows to cache the intermediate results.
diff-tree implementation is a state-of-the-art.
How to store the line intervals?
class FooBarBaz:
def bar(self):
"""..."""
print("yo")
def baz(self):
print("!")
How to store the line intervals?
class FooBarBaz:
def bar(self):
"""..."""
print("yo")
def baz(self):
print("!")
How to store the line intervals?
In such cases the binary tree has O(1) update complexity.
We have chosen the binary tree.
Anyway, the profile shows that the diff is the bottleneck.
Git stores commits as snapshots, we need to diff ourselves.
Diff algorithms:
Myers algorithm is optimal for analytics.
We tried to accelerate it with NVIDIA CUDA but failed.
go get gopkg.in/src-d/hercules.v1/cmd/hercules
bin/hercules https://github.com/tensorflow/tensorflow | python3 src/gopkg.in/src-d/hercules.v1/labours.py --resample=month
... labours.py --resample=month --relative
hercules https://github.com/torvalds/linux
2 hours, 130GB RSS
My contacts: