Save hash object

Add change 04 instructions
2024-02-12 19:35:14 +01:00 · 2024-02-12 19:33:37 +01:00
3 changed files with 82 additions and 0 deletions
--- a/how_to/Change_04.md
+++ b/how_to/Change_04.md
@@ -0,0 +1,63 @@
 - hash-object: Save object
 Let's create the first non-trivial command. This command will take a file and
 store it in our '.ugit' directory for later retrieval. In Git's lingo, this
 feature is called "the object database". It allows us to store and retrieve
 arbitrary blobs, which are called "objects". As far as the Object Database is
 concerned, the content of the object doesn't have any meaning (just like a
 filesystem doesn't care about the internal structure of a file).
 Because this command needs the '.ugit' directory, it must be run from the same
 directory where you did 'ugit init'.
 Note that this is a very low-level Git building block and we're not talking yet
 about versions or commits or any other things that you might have heard about,
 we're just talking about an interface for storing some raw bytes.
 So we can store an object, but how would we refer to it later? We could ask the
 user to provide a name along with the object and retrieve the object later using
 the name, but there is a nicer way: We can refer to the object using its hash.
 If you haven't heard about hashes and hash functions, I suggest that you pause
 and do some reading on it. In summary, a hash function can take a blob of
 arbitrary length and produce a small "fingerprint" with a fixed length. Some
 hash functions such as SHA-1 guarantee that different blobs are very very very
 likely to produce different fingerprints (so likely, that Git assumes it's
 guaranteed). Let's try some strings to see an example:
 ```
 $ echo -n this is cool | sha1sum
 60f51187e76a9de0ff3df31f051bde04da2da891
 $ echo -n this is cooler | sha1sum
 f3c953b792f9ab39d1be0bdab7ab5f8350593004
 ```
 You can see that hashing the phrases "this is cool" and "this is cooler" gives
 completely different hashes even though the difference between the phrases is
 small.
 We're going to use the hash as the name of object (we'll call this name an
 "OID"* - object ID).
 So the flow of the command hash-object is:
  + Get the path of the file to store.
  + Read the file.
  + Hash the content of the file using SHA-1.
  + Store the file under ".ugit/objects/{the SHA-1 hash}".
 This type of storage is called content-addressable storage because the "address"
 that we use to find a blob is based on the content of the blob itself. (In
 contrast to name-addressable storage, such as a typical filesystem, where you
 address a particular file by its name, regardless of its content).
 Content-addressable storage has nice properties when synchronizing data between
 different computers - if two repositories have an object with the same OID we
 can be sure that they are the same object. Also since two different objects are
 practically guaranteed to have different OIDs, we can't have naming clashes
 between objects.
 When real Git stores objects it does a few extra things, such as writing the
 size of the object to the file as well, compressing them and dividing the
 objects into 256 directories. This is done to avoid having directories with huge
 number of files, which can hurt performance. We're not going to do this in ugit for simplicity.
--- a/ugit/cli.py
+++ b/ugit/cli.py
@@ -18,9 +18,18 @@ def parse_args():
    init_parser = commands.add_parser("init")
    init_parser.set_defaults(func=init)
    hash_object_parser = commands.add_parser("hash-object")
    hash_object_parser.set_defaults(func=hash_object)
    hash_object_parser.add_argument("file")
    return parser.parse_args()
 def init(args):
    data.init()
    print(f"Initialized empty ugit repository in {Path.cwd()}/{data.GIT_DIR}")
 def hash_object(args):
    with open(args.file, "rb") as f:
        print(data.hash_object(f.read()))
--- a/ugit/data.py
+++ b/ugit/data.py
@@ -1,7 +1,17 @@
 from pathlib import Path
 import hashlib
 GIT_DIR = ".ugit"
 def init():
    Path.mkdir(GIT_DIR)
    Path.mkdir(f"{GIT_DIR}/objects")
 def hash_object(data):
    oid = hashlib.sha1(data).hexdigest()
    with open(f"{GIT_DIR}/objects/{oid}", "wb") as out:
        out.write(data)
    return oid
Author	SHA1	Message	Date
daviddoji	71abdf3454	Save hash object	2024-02-12 19:35:14 +01:00
daviddoji	c647f99e5c	Add change 04 instructions	2024-02-12 19:33:37 +01:00