Compare commits
2 Commits
1f7354666b
...
71abdf3454
| Author | SHA1 | Date | |
|---|---|---|---|
| 71abdf3454 | |||
| c647f99e5c |
63
how_to/Change_04.md
Normal file
63
how_to/Change_04.md
Normal file
@@ -0,0 +1,63 @@
|
||||
- hash-object: Save object
|
||||
|
||||
Let's create the first non-trivial command. This command will take a file and
|
||||
store it in our '.ugit' directory for later retrieval. In Git's lingo, this
|
||||
feature is called "the object database". It allows us to store and retrieve
|
||||
arbitrary blobs, which are called "objects". As far as the Object Database is
|
||||
concerned, the content of the object doesn't have any meaning (just like a
|
||||
filesystem doesn't care about the internal structure of a file).
|
||||
|
||||
Because this command needs the '.ugit' directory, it must be run from the same
|
||||
directory where you did 'ugit init'.
|
||||
|
||||
Note that this is a very low-level Git building block and we're not talking yet
|
||||
about versions or commits or any other things that you might have heard about,
|
||||
we're just talking about an interface for storing some raw bytes.
|
||||
|
||||
So we can store an object, but how would we refer to it later? We could ask the
|
||||
user to provide a name along with the object and retrieve the object later using
|
||||
the name, but there is a nicer way: We can refer to the object using its hash.
|
||||
|
||||
If you haven't heard about hashes and hash functions, I suggest that you pause
|
||||
and do some reading on it. In summary, a hash function can take a blob of
|
||||
arbitrary length and produce a small "fingerprint" with a fixed length. Some
|
||||
hash functions such as SHA-1 guarantee that different blobs are very very very
|
||||
likely to produce different fingerprints (so likely, that Git assumes it's
|
||||
guaranteed). Let's try some strings to see an example:
|
||||
|
||||
```
|
||||
$ echo -n this is cool | sha1sum
|
||||
60f51187e76a9de0ff3df31f051bde04da2da891
|
||||
|
||||
$ echo -n this is cooler | sha1sum
|
||||
f3c953b792f9ab39d1be0bdab7ab5f8350593004
|
||||
```
|
||||
|
||||
You can see that hashing the phrases "this is cool" and "this is cooler" gives
|
||||
completely different hashes even though the difference between the phrases is
|
||||
small.
|
||||
|
||||
We're going to use the hash as the name of object (we'll call this name an
|
||||
"OID"* - object ID).
|
||||
|
||||
So the flow of the command hash-object is:
|
||||
|
||||
+ Get the path of the file to store.
|
||||
+ Read the file.
|
||||
+ Hash the content of the file using SHA-1.
|
||||
+ Store the file under ".ugit/objects/{the SHA-1 hash}".
|
||||
|
||||
This type of storage is called content-addressable storage because the "address"
|
||||
that we use to find a blob is based on the content of the blob itself. (In
|
||||
contrast to name-addressable storage, such as a typical filesystem, where you
|
||||
address a particular file by its name, regardless of its content).
|
||||
Content-addressable storage has nice properties when synchronizing data between
|
||||
different computers - if two repositories have an object with the same OID we
|
||||
can be sure that they are the same object. Also since two different objects are
|
||||
practically guaranteed to have different OIDs, we can't have naming clashes
|
||||
between objects.
|
||||
|
||||
When real Git stores objects it does a few extra things, such as writing the
|
||||
size of the object to the file as well, compressing them and dividing the
|
||||
objects into 256 directories. This is done to avoid having directories with huge
|
||||
number of files, which can hurt performance. We're not going to do this in ugit for simplicity.
|
||||
@@ -18,9 +18,18 @@ def parse_args():
|
||||
init_parser = commands.add_parser("init")
|
||||
init_parser.set_defaults(func=init)
|
||||
|
||||
hash_object_parser = commands.add_parser("hash-object")
|
||||
hash_object_parser.set_defaults(func=hash_object)
|
||||
hash_object_parser.add_argument("file")
|
||||
|
||||
return parser.parse_args()
|
||||
|
||||
|
||||
def init(args):
|
||||
data.init()
|
||||
print(f"Initialized empty ugit repository in {Path.cwd()}/{data.GIT_DIR}")
|
||||
|
||||
|
||||
def hash_object(args):
|
||||
with open(args.file, "rb") as f:
|
||||
print(data.hash_object(f.read()))
|
||||
|
||||
10
ugit/data.py
10
ugit/data.py
@@ -1,7 +1,17 @@
|
||||
from pathlib import Path
|
||||
|
||||
import hashlib
|
||||
|
||||
GIT_DIR = ".ugit"
|
||||
|
||||
|
||||
def init():
|
||||
Path.mkdir(GIT_DIR)
|
||||
Path.mkdir(f"{GIT_DIR}/objects")
|
||||
|
||||
|
||||
def hash_object(data):
|
||||
oid = hashlib.sha1(data).hexdigest()
|
||||
with open(f"{GIT_DIR}/objects/{oid}", "wb") as out:
|
||||
out.write(data)
|
||||
return oid
|
||||
|
||||
Reference in New Issue
Block a user