Compare commits

...

2 Commits

Author SHA1 Message Date
David Doblas Jiménez 71abdf3454 Save hash object 2024-02-12 19:35:14 +01:00
David Doblas Jiménez c647f99e5c Add change 04 instructions 2024-02-12 19:33:37 +01:00
3 changed files with 82 additions and 0 deletions

63
how_to/Change_04.md Normal file
View File

@ -0,0 +1,63 @@
- hash-object: Save object
Let's create the first non-trivial command. This command will take a file and
store it in our '.ugit' directory for later retrieval. In Git's lingo, this
feature is called "the object database". It allows us to store and retrieve
arbitrary blobs, which are called "objects". As far as the Object Database is
concerned, the content of the object doesn't have any meaning (just like a
filesystem doesn't care about the internal structure of a file).
Because this command needs the '.ugit' directory, it must be run from the same
directory where you did 'ugit init'.
Note that this is a very low-level Git building block and we're not talking yet
about versions or commits or any other things that you might have heard about,
we're just talking about an interface for storing some raw bytes.
So we can store an object, but how would we refer to it later? We could ask the
user to provide a name along with the object and retrieve the object later using
the name, but there is a nicer way: We can refer to the object using its hash.
If you haven't heard about hashes and hash functions, I suggest that you pause
and do some reading on it. In summary, a hash function can take a blob of
arbitrary length and produce a small "fingerprint" with a fixed length. Some
hash functions such as SHA-1 guarantee that different blobs are very very very
likely to produce different fingerprints (so likely, that Git assumes it's
guaranteed). Let's try some strings to see an example:
```
$ echo -n this is cool | sha1sum
60f51187e76a9de0ff3df31f051bde04da2da891
$ echo -n this is cooler | sha1sum
f3c953b792f9ab39d1be0bdab7ab5f8350593004
```
You can see that hashing the phrases "this is cool" and "this is cooler" gives
completely different hashes even though the difference between the phrases is
small.
We're going to use the hash as the name of object (we'll call this name an
"OID"* - object ID).
So the flow of the command hash-object is:
+ Get the path of the file to store.
+ Read the file.
+ Hash the content of the file using SHA-1.
+ Store the file under ".ugit/objects/{the SHA-1 hash}".
This type of storage is called content-addressable storage because the "address"
that we use to find a blob is based on the content of the blob itself. (In
contrast to name-addressable storage, such as a typical filesystem, where you
address a particular file by its name, regardless of its content).
Content-addressable storage has nice properties when synchronizing data between
different computers - if two repositories have an object with the same OID we
can be sure that they are the same object. Also since two different objects are
practically guaranteed to have different OIDs, we can't have naming clashes
between objects.
When real Git stores objects it does a few extra things, such as writing the
size of the object to the file as well, compressing them and dividing the
objects into 256 directories. This is done to avoid having directories with huge
number of files, which can hurt performance. We're not going to do this in ugit for simplicity.

View File

@ -18,9 +18,18 @@ def parse_args():
init_parser = commands.add_parser("init")
init_parser.set_defaults(func=init)
hash_object_parser = commands.add_parser("hash-object")
hash_object_parser.set_defaults(func=hash_object)
hash_object_parser.add_argument("file")
return parser.parse_args()
def init(args):
data.init()
print(f"Initialized empty ugit repository in {Path.cwd()}/{data.GIT_DIR}")
def hash_object(args):
with open(args.file, "rb") as f:
print(data.hash_object(f.read()))

View File

@ -1,7 +1,17 @@
from pathlib import Path
import hashlib
GIT_DIR = ".ugit"
def init():
Path.mkdir(GIT_DIR)
Path.mkdir(f"{GIT_DIR}/objects")
def hash_object(data):
oid = hashlib.sha1(data).hexdigest()
with open(f"{GIT_DIR}/objects/{oid}", "wb") as out:
out.write(data)
return oid