Hacker News new | past | comments | ask | show | jobs | submit login
How short can Git abbreviate? (2013) (cuviper.com)
55 points by pncnmnp 10 days ago | hide | past | favorite | 11 comments





Today's histogram of commit abbreviation lengths:

  $ git rev-list --all --abbrev=0 --abbrev-commit | awk '{ a[length] += 1 } END { for (len in a) print len, a[len] }'
  5 84
  6 692222
  7 527029
  8 43802
  9 2791
  10 181
  11 8

Just FYI: showing hash prefix collision counts is a built-in feature of the Fossil SCM. e.g. the collisions of one particularly well-known repo can be seen at:

<https://sqlite.org/src/hash-collisions>

and fossil's own can be seen at:

<https://fossil-scm.org/home/hash-collisions>


For grins, I recently wrote a brute force program to inject a nonce into a shell script that also was the crc32 hash of a resulting modified script containing itself. One script had 2 hash collisions that satisfied this property in the entire search domain.

oh my goodness, TIL nonce has a technical meaning. It has a very different meaning in Australia.

https://en.wikipedia.org/wiki/Cryptographic_nonce

In general, it's a made up one time value. In this case, it satisfies an additional property.



I myself improve my crypto code with rock spiders all the time!

Despite finding out about "nonce" as a technical term when I was reading around PHP like 3 years ago, I still can't not react to it every time lmao

I realize the author probably won't see this since the post is ancient, but it looks like there's a bug on the output: The "how many total objects are ambiguous at that abbreviation [length]" isn't quite right. It's actually "how many total possible words of that length can reference 2 or more existing objects, (and therefore are insufficient to disambiguate the matching objects of that repo)"

Reminds me of this project: https://github.com/trichner/gitc0ffee which is used to take a commit, append some header to it to force the hash to collide with some prefix, such as `c0ffee`

>6 character prefix: less than a second

>8 character prefix: in the order of one or more minutes

No affiliation with this, but I tested it, and it was fast!


I've actually used this as an interview question for several years (and has been my first/only "interview-GPT" question), as it's a problem I ran into at work, can be solved by someone new to programming, and has lots of headroom and alternates for efficiency, optimization, and alternative implementation discussions.

The use case I ran into was an initial + updated "dictionary of values" use case. Imagine: `{ "foo:bar:baz": 1.23, "bleep:bloop:blah": 4.56, ...etc.. }`.

I wanted to send the first batch as a "full dictionary", and then send updated batches as: `{ "aa": 1.23, "bb": 4.56 }` whereby the client should already be able to reverse: `aa` => `foo:bar:baz`, and `bb`: `bleep:bloop:blah` since they could md5 the original key, and then match that to the "compressed" unique key.

For the interview question, asking "what's the minimum unique prefix length", and exactly in the context of "how short can you abbreviate a git commit" was an excellent, low-friction, easily understandable problem.

I was shocked when the initial GPT-3 nailed the naive implementation, and in retrospect, not-surprised when its superficially-correct "optimized" solution was only superficially correct. ;-) Better than 50% of the interviewees I'd asked this question of.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: