Full software development cycle analytics in near realβtime
Vadim Markovtsev
Athenian
Software development lifecycle
- Plan coding.
- Start coding.
- Open the pull request.
- Pass the review.
- Pass the CI.
- Merge the pull request.
- Release the pull request.
- Deploy the pull request.
Money estimations?
- Bull shit in 99% of the cases.
- Require a great cost model specific to the client.
- Require knowing the salaries.
Confidence intervals
- The distributions are neither normal, log-normal, nor any other well-known.
- Some are "semi-discrete" β₯ 0.
- We use the bootstrap method.
- Very fast after optimizations.
- Cached randomness.
- Control the number of iterations.
- 95%: Β±50%
Focus
- Support percentiles everywhere.
- Average vs. median.
- Average on 95th percentile.
- Stale garbage.
GitHub API
- REST
- Best suited for casual or specific analysis.
- GraphQL
- Best suited for mirroring GitHub DB.
- Rate limits.
Node ID
from base64 import b64decode
print(b64decode(
"MDY6Q29tbWl0MjI1OTI0NDI3OjA1MTkyMDUwOWJmMTk1N2Y"
"4NDIzMDdmMzdkZGI0NDFiZmQyNTMwMGY="))
06:Commit225924427:051920509bf1957f842307f37ddb441bfd25300f
Two kinds of data retrieval
- Cold start.
- Event stream.
- Must register a GitHub application.
- Must re-scan periodically.
Gotcha!
- Node ID format changes.
- Merges happen after closures.
- PRs close before opening.
- HTTP 500.
- Incomplete.
- Eventual consistency.
Eventual consistency
Incoming events:
- Somebody committed
aabbccd
as a child of dccbbaa
in repo XXX.
- Commit
aabbccd
has message "Fix".
- Commit
aabbccd
belongs to pull request PPP.
- New pull request PPP in repo XXX titled "merge me".
Releases
- What is a release?
- A tag.
- A branch merge.
- A hash sent by the client.
- Which PRs were released?
- GitHub does not show this for a reason.
git push --force
Including git rebase
- Changes the hashes.
- May change the diff-s.
- May change the messages.
git push --force
Including git rebase
- Changes the hashes.
- Usually does not change the diff-s.
- Usually does not change the messages.
Continuous Integration
Continuous Integration
- π Check suites point at commits.
- π Same commit may belong to several pull requests.
- π Check suites may launch after PR is closed.
- π Check suites may launch before PR is opened.
- π Check suites may glitch.
- E.g., stay enqueued forever.
- π No finish timestamp for legacy API check runs.
JIRA API
- No event stream. Pull vs. push.
- Next gen vs. classic. Migrations.
- So many fields may be custom.
- Frequent re-opens.
- Super strict privacy.
- Only user name by default.
- No emails.
JIRA identity matching
Client X
- 200 developers.
- Growing x2 in one year.
- Strong advocate of Athenian.
- >$1,000 MRR.
What we did
Dedicated custom analysis of their data showed:
- They are growing very fast.
- The PRs frequency stays the same.
- Changed lines stay the same.
When to use Pandas
- Have requirements on response time?
- Work with many small dataframes?
- Have memory constraints?
How to use Pandas
Learn how to iterate rows fast.
for a, b in zip(df["a"].values, df["b"].values):
print(a, b)
How to use Pandas
Learn how to select rows fast.
df[df["a"] > 10]
# versus
df.take(np.flatnonzero(df["a"].values > 10))
How to use Pandas
Learn how to groupby fast.
for key, sub_df in df.groupby("a", sort=False):
# ...
# versus
df.groupby("a", sort=False).grouper.groups
# dict key -> row indexes
How to use Pandas
Ensure your dtype is not object
.
How to live without Pandas
95% ops can be rewritten in pure numpy.
- datetime64
- indexing
- sort_values
- groupby
How to live without Pandas
>>> sort_values("a")
order = np.argsort(array_a)
index = order
array_a = array_a[order]
array_b = array_b[order]
How to live without Pandas
>>> groupby("a")
keys, indexes, counts = np.unique(
array_a, return_inverse=True, return_counts=True)
order = np.argsort(indexes)
borders = np.cumsum(counts[:-1])
groups_a = np.split(array_a[order], borders)
groups_b = np.split(array_b[order], borders)
How to live without Pandas
>>> groupby("a").mean()
keys, indexes, counts = np.unique(
array_a, return_inverse=True, return_counts=True)
masks = np.zeros((len(keys), len(df)), bool)
arr_y = np.repeat(np.arange(len(keys)), counts)
arr_x = np.argsort(indexes)
masks[arr_y, arr_x] = True
mean_a = np.mean(array_a, where=masks, axis=1)
Tips for Python
- Use a profiler, e.g. py-spy.
- Vectorize everything.
- Cython.
- Precompute everything.
- Cache everything.
- Using a DB? Prepare to optimize queries and suffer.
- Choose incremental algorithms.
- KISS.
I can do the same with my script
- β It will cover 80% of what you need.
- β It will not be correct.
- β Write-only.
- β Charts.
The book
Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations.
Gene Kim, Jez Humble, and Nicole Forsgren.