Save hash object

Add change 04 instructions
2024-02-12 19:35:14 +01:00 · 2024-02-12 19:33:37 +01:00
3 changed files with 82 additions and 0 deletions
--- a/how_to/Change_04.md
+++ b/how_to/Change_04.md
@@ -0,0 +1,63 @@
+- hash-object: Save object
+
+Let's create the first non-trivial command. This command will take a file and
+store it in our '.ugit' directory for later retrieval. In Git's lingo, this
+feature is called "the object database". It allows us to store and retrieve
+arbitrary blobs, which are called "objects". As far as the Object Database is
+concerned, the content of the object doesn't have any meaning (just like a
+filesystem doesn't care about the internal structure of a file).
+
+Because this command needs the '.ugit' directory, it must be run from the same
+directory where you did 'ugit init'.
+
+Note that this is a very low-level Git building block and we're not talking yet
+about versions or commits or any other things that you might have heard about,
+we're just talking about an interface for storing some raw bytes.
+
+So we can store an object, but how would we refer to it later? We could ask the
+user to provide a name along with the object and retrieve the object later using
+the name, but there is a nicer way: We can refer to the object using its hash.
+
+If you haven't heard about hashes and hash functions, I suggest that you pause
+and do some reading on it. In summary, a hash function can take a blob of
+arbitrary length and produce a small "fingerprint" with a fixed length. Some
+hash functions such as SHA-1 guarantee that different blobs are very very very
+likely to produce different fingerprints (so likely, that Git assumes it's
+guaranteed). Let's try some strings to see an example:
+
+```
+$ echo -n this is cool | sha1sum
+60f51187e76a9de0ff3df31f051bde04da2da891
+
+$ echo -n this is cooler | sha1sum
+f3c953b792f9ab39d1be0bdab7ab5f8350593004
+```
+
+You can see that hashing the phrases "this is cool" and "this is cooler" gives
+completely different hashes even though the difference between the phrases is
+small.
+
+We're going to use the hash as the name of object (we'll call this name an
+"OID"* - object ID).
+
+So the flow of the command hash-object is:
+
+  + Get the path of the file to store.
+  + Read the file.
+  + Hash the content of the file using SHA-1.
+  + Store the file under ".ugit/objects/{the SHA-1 hash}".
+
+This type of storage is called content-addressable storage because the "address"
+that we use to find a blob is based on the content of the blob itself. (In
+contrast to name-addressable storage, such as a typical filesystem, where you
+address a particular file by its name, regardless of its content).
+Content-addressable storage has nice properties when synchronizing data between
+different computers - if two repositories have an object with the same OID we
+can be sure that they are the same object. Also since two different objects are
+practically guaranteed to have different OIDs, we can't have naming clashes
+between objects.
+
+When real Git stores objects it does a few extra things, such as writing the
+size of the object to the file as well, compressing them and dividing the
+objects into 256 directories. This is done to avoid having directories with huge
+number of files, which can hurt performance. We're not going to do this in ugit for simplicity.
--- a/ugit/cli.py
+++ b/ugit/cli.py
@@ -18,9 +18,18 @@ def parse_args():
    init_parser = commands.add_parser("init")
    init_parser.set_defaults(func=init)

+    hash_object_parser = commands.add_parser("hash-object")
+    hash_object_parser.set_defaults(func=hash_object)
+    hash_object_parser.add_argument("file")
+
    return parser.parse_args()


 def init(args):
    data.init()
    print(f"Initialized empty ugit repository in {Path.cwd()}/{data.GIT_DIR}")
+
+
+def hash_object(args):
+    with open(args.file, "rb") as f:
+        print(data.hash_object(f.read()))
--- a/ugit/data.py
+++ b/ugit/data.py
@@ -1,7 +1,17 @@
 from pathlib import Path

+import hashlib
+
 GIT_DIR = ".ugit"


 def init():
    Path.mkdir(GIT_DIR)
+    Path.mkdir(f"{GIT_DIR}/objects")
+
+
+def hash_object(data):
+    oid = hashlib.sha1(data).hexdigest()
+    with open(f"{GIT_DIR}/objects/{oid}", "wb") as out:
+        out.write(data)
+    return oid
Author	SHA1	Message	Date
daviddoji	71abdf3454	Save hash object	2024-02-12 19:35:14 +01:00
daviddoji	c647f99e5c	Add change 04 instructions	2024-02-12 19:33:37 +01:00