DIY_GIT_in_Python/how_to/Change_04.md

- hash-object: Save object

Let's create the first non-trivial command. This command will take a file and
store it in our '.ugit' directory for later retrieval. In Git's lingo, this
feature is called "the object database". It allows us to store and retrieve
arbitrary blobs, which are called "objects". As far as the Object Database is
concerned, the content of the object doesn't have any meaning (just like a
filesystem doesn't care about the internal structure of a file).

Because this command needs the '.ugit' directory, it must be run from the same
directory where you did 'ugit init'.

Note that this is a very low-level Git building block and we're not talking yet
about versions or commits or any other things that you might have heard about,
we're just talking about an interface for storing some raw bytes.

So we can store an object, but how would we refer to it later? We could ask the
user to provide a name along with the object and retrieve the object later using
the name, but there is a nicer way: We can refer to the object using its hash.

If you haven't heard about hashes and hash functions, I suggest that you pause
and do some reading on it. In summary, a hash function can take a blob of
arbitrary length and produce a small "fingerprint" with a fixed length. Some
hash functions such as SHA-1 guarantee that different blobs are very very very
likely to produce different fingerprints (so likely, that Git assumes it's
guaranteed). Let's try some strings to see an example:

```
$ echo -n this is cool | sha1sum
60f51187e76a9de0ff3df31f051bde04da2da891

$ echo -n this is cooler | sha1sum
f3c953b792f9ab39d1be0bdab7ab5f8350593004
```

You can see that hashing the phrases "this is cool" and "this is cooler" gives
completely different hashes even though the difference between the phrases is
small.

We're going to use the hash as the name of object (we'll call this name an
"OID"* - object ID).

So the flow of the command hash-object is:

  + Get the path of the file to store.
  + Read the file.
  + Hash the content of the file using SHA-1.
  + Store the file under ".ugit/objects/{the SHA-1 hash}".

This type of storage is called content-addressable storage because the "address"
that we use to find a blob is based on the content of the blob itself. (In
contrast to name-addressable storage, such as a typical filesystem, where you
address a particular file by its name, regardless of its content).
Content-addressable storage has nice properties when synchronizing data between
different computers - if two repositories have an object with the same OID we
can be sure that they are the same object. Also since two different objects are
practically guaranteed to have different OIDs, we can't have naming clashes
between objects.

When real Git stores objects it does a few extra things, such as writing the
size of the object to the file as well, compressing them and dividing the
objects into 256 directories. This is done to avoid having directories with huge
number of files, which can hurt performance. We're not going to do this in ugit for simplicity.
Add change 04 instructions 2024-02-12 19:33:37 +01:00			`- hash-object: Save object`

			`Let's create the first non-trivial command. This command will take a file and`
			`store it in our '.ugit' directory for later retrieval. In Git's lingo, this`
			`feature is called "the object database". It allows us to store and retrieve`
			`arbitrary blobs, which are called "objects". As far as the Object Database is`
			`concerned, the content of the object doesn't have any meaning (just like a`
			`filesystem doesn't care about the internal structure of a file).`

			`Because this command needs the '.ugit' directory, it must be run from the same`
			`directory where you did 'ugit init'.`

			`Note that this is a very low-level Git building block and we're not talking yet`
			`about versions or commits or any other things that you might have heard about,`
			`we're just talking about an interface for storing some raw bytes.`

			`So we can store an object, but how would we refer to it later? We could ask the`
			`user to provide a name along with the object and retrieve the object later using`
			`the name, but there is a nicer way: We can refer to the object using its hash.`

			`If you haven't heard about hashes and hash functions, I suggest that you pause`
			`and do some reading on it. In summary, a hash function can take a blob of`
			`arbitrary length and produce a small "fingerprint" with a fixed length. Some`
			`hash functions such as SHA-1 guarantee that different blobs are very very very`
			`likely to produce different fingerprints (so likely, that Git assumes it's`
			`guaranteed). Let's try some strings to see an example:`

			```
			`$ echo -n this is cool \| sha1sum`
			`60f51187e76a9de0ff3df31f051bde04da2da891`

			`$ echo -n this is cooler \| sha1sum`
			`f3c953b792f9ab39d1be0bdab7ab5f8350593004`
			```

			`You can see that hashing the phrases "this is cool" and "this is cooler" gives`
			`completely different hashes even though the difference between the phrases is`
			`small.`

			`We're going to use the hash as the name of object (we'll call this name an`
			`"OID"* - object ID).`

			`So the flow of the command hash-object is:`

			`+ Get the path of the file to store.`
			`+ Read the file.`
			`+ Hash the content of the file using SHA-1.`
			`+ Store the file under ".ugit/objects/{the SHA-1 hash}".`

			`This type of storage is called content-addressable storage because the "address"`
			`that we use to find a blob is based on the content of the blob itself. (In`
			`contrast to name-addressable storage, such as a typical filesystem, where you`
			`address a particular file by its name, regardless of its content).`
			`Content-addressable storage has nice properties when synchronizing data between`
			`different computers - if two repositories have an object with the same OID we`
			`can be sure that they are the same object. Also since two different objects are`
			`practically guaranteed to have different OIDs, we can't have naming clashes`
			`between objects.`

			`When real Git stores objects it does a few extra things, such as writing the`
			`size of the object to the file as well, compressing them and dividing the`
			`objects into 256 directories. This is done to avoid having directories with huge`
			`number of files, which can hurt performance. We're not going to do this in ugit for simplicity.`