Add change 04 instructions

2024-02-12 19:33:37 +01:00
parent 1f7354666b
commit c647f99e5c
1 changed files with 63 additions and 0 deletions
--- a/how_to/Change_04.md
+++ b/how_to/Change_04.md
@@ -0,0 +1,63 @@
+- hash-object: Save object
+
+Let's create the first non-trivial command. This command will take a file and
+store it in our '.ugit' directory for later retrieval. In Git's lingo, this
+feature is called "the object database". It allows us to store and retrieve
+arbitrary blobs, which are called "objects". As far as the Object Database is
+concerned, the content of the object doesn't have any meaning (just like a
+filesystem doesn't care about the internal structure of a file).
+
+Because this command needs the '.ugit' directory, it must be run from the same
+directory where you did 'ugit init'.
+
+Note that this is a very low-level Git building block and we're not talking yet
+about versions or commits or any other things that you might have heard about,
+we're just talking about an interface for storing some raw bytes.
+
+So we can store an object, but how would we refer to it later? We could ask the
+user to provide a name along with the object and retrieve the object later using
+the name, but there is a nicer way: We can refer to the object using its hash.
+
+If you haven't heard about hashes and hash functions, I suggest that you pause
+and do some reading on it. In summary, a hash function can take a blob of
+arbitrary length and produce a small "fingerprint" with a fixed length. Some
+hash functions such as SHA-1 guarantee that different blobs are very very very
+likely to produce different fingerprints (so likely, that Git assumes it's
+guaranteed). Let's try some strings to see an example:
+
+```
+$ echo -n this is cool | sha1sum
+60f51187e76a9de0ff3df31f051bde04da2da891
+
+$ echo -n this is cooler | sha1sum
+f3c953b792f9ab39d1be0bdab7ab5f8350593004
+```
+
+You can see that hashing the phrases "this is cool" and "this is cooler" gives
+completely different hashes even though the difference between the phrases is
+small.
+
+We're going to use the hash as the name of object (we'll call this name an
+"OID"* - object ID).
+
+So the flow of the command hash-object is:
+
+  + Get the path of the file to store.
+  + Read the file.
+  + Hash the content of the file using SHA-1.
+  + Store the file under ".ugit/objects/{the SHA-1 hash}".
+
+This type of storage is called content-addressable storage because the "address"
+that we use to find a blob is based on the content of the blob itself. (In
+contrast to name-addressable storage, such as a typical filesystem, where you
+address a particular file by its name, regardless of its content).
+Content-addressable storage has nice properties when synchronizing data between
+different computers - if two repositories have an object with the same OID we
+can be sure that they are the same object. Also since two different objects are
+practically guaranteed to have different OIDs, we can't have naming clashes
+between objects.
+
+When real Git stores objects it does a few extra things, such as writing the
+size of the object to the file as well, compressing them and dividing the
+objects into 256 directories. This is done to avoid having directories with huge
+number of files, which can hurt performance. We're not going to do this in ugit for simplicity.