Compare commits
2 Commits
1f7354666b
...
71abdf3454
| Author | SHA1 | Date | |
|---|---|---|---|
| 71abdf3454 | |||
| c647f99e5c |
63
how_to/Change_04.md
Normal file
63
how_to/Change_04.md
Normal file
@@ -0,0 +1,63 @@
|
|||||||
|
- hash-object: Save object
|
||||||
|
|
||||||
|
Let's create the first non-trivial command. This command will take a file and
|
||||||
|
store it in our '.ugit' directory for later retrieval. In Git's lingo, this
|
||||||
|
feature is called "the object database". It allows us to store and retrieve
|
||||||
|
arbitrary blobs, which are called "objects". As far as the Object Database is
|
||||||
|
concerned, the content of the object doesn't have any meaning (just like a
|
||||||
|
filesystem doesn't care about the internal structure of a file).
|
||||||
|
|
||||||
|
Because this command needs the '.ugit' directory, it must be run from the same
|
||||||
|
directory where you did 'ugit init'.
|
||||||
|
|
||||||
|
Note that this is a very low-level Git building block and we're not talking yet
|
||||||
|
about versions or commits or any other things that you might have heard about,
|
||||||
|
we're just talking about an interface for storing some raw bytes.
|
||||||
|
|
||||||
|
So we can store an object, but how would we refer to it later? We could ask the
|
||||||
|
user to provide a name along with the object and retrieve the object later using
|
||||||
|
the name, but there is a nicer way: We can refer to the object using its hash.
|
||||||
|
|
||||||
|
If you haven't heard about hashes and hash functions, I suggest that you pause
|
||||||
|
and do some reading on it. In summary, a hash function can take a blob of
|
||||||
|
arbitrary length and produce a small "fingerprint" with a fixed length. Some
|
||||||
|
hash functions such as SHA-1 guarantee that different blobs are very very very
|
||||||
|
likely to produce different fingerprints (so likely, that Git assumes it's
|
||||||
|
guaranteed). Let's try some strings to see an example:
|
||||||
|
|
||||||
|
```
|
||||||
|
$ echo -n this is cool | sha1sum
|
||||||
|
60f51187e76a9de0ff3df31f051bde04da2da891
|
||||||
|
|
||||||
|
$ echo -n this is cooler | sha1sum
|
||||||
|
f3c953b792f9ab39d1be0bdab7ab5f8350593004
|
||||||
|
```
|
||||||
|
|
||||||
|
You can see that hashing the phrases "this is cool" and "this is cooler" gives
|
||||||
|
completely different hashes even though the difference between the phrases is
|
||||||
|
small.
|
||||||
|
|
||||||
|
We're going to use the hash as the name of object (we'll call this name an
|
||||||
|
"OID"* - object ID).
|
||||||
|
|
||||||
|
So the flow of the command hash-object is:
|
||||||
|
|
||||||
|
+ Get the path of the file to store.
|
||||||
|
+ Read the file.
|
||||||
|
+ Hash the content of the file using SHA-1.
|
||||||
|
+ Store the file under ".ugit/objects/{the SHA-1 hash}".
|
||||||
|
|
||||||
|
This type of storage is called content-addressable storage because the "address"
|
||||||
|
that we use to find a blob is based on the content of the blob itself. (In
|
||||||
|
contrast to name-addressable storage, such as a typical filesystem, where you
|
||||||
|
address a particular file by its name, regardless of its content).
|
||||||
|
Content-addressable storage has nice properties when synchronizing data between
|
||||||
|
different computers - if two repositories have an object with the same OID we
|
||||||
|
can be sure that they are the same object. Also since two different objects are
|
||||||
|
practically guaranteed to have different OIDs, we can't have naming clashes
|
||||||
|
between objects.
|
||||||
|
|
||||||
|
When real Git stores objects it does a few extra things, such as writing the
|
||||||
|
size of the object to the file as well, compressing them and dividing the
|
||||||
|
objects into 256 directories. This is done to avoid having directories with huge
|
||||||
|
number of files, which can hurt performance. We're not going to do this in ugit for simplicity.
|
||||||
@@ -18,9 +18,18 @@ def parse_args():
|
|||||||
init_parser = commands.add_parser("init")
|
init_parser = commands.add_parser("init")
|
||||||
init_parser.set_defaults(func=init)
|
init_parser.set_defaults(func=init)
|
||||||
|
|
||||||
|
hash_object_parser = commands.add_parser("hash-object")
|
||||||
|
hash_object_parser.set_defaults(func=hash_object)
|
||||||
|
hash_object_parser.add_argument("file")
|
||||||
|
|
||||||
return parser.parse_args()
|
return parser.parse_args()
|
||||||
|
|
||||||
|
|
||||||
def init(args):
|
def init(args):
|
||||||
data.init()
|
data.init()
|
||||||
print(f"Initialized empty ugit repository in {Path.cwd()}/{data.GIT_DIR}")
|
print(f"Initialized empty ugit repository in {Path.cwd()}/{data.GIT_DIR}")
|
||||||
|
|
||||||
|
|
||||||
|
def hash_object(args):
|
||||||
|
with open(args.file, "rb") as f:
|
||||||
|
print(data.hash_object(f.read()))
|
||||||
|
|||||||
10
ugit/data.py
10
ugit/data.py
@@ -1,7 +1,17 @@
|
|||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
|
||||||
|
import hashlib
|
||||||
|
|
||||||
GIT_DIR = ".ugit"
|
GIT_DIR = ".ugit"
|
||||||
|
|
||||||
|
|
||||||
def init():
|
def init():
|
||||||
Path.mkdir(GIT_DIR)
|
Path.mkdir(GIT_DIR)
|
||||||
|
Path.mkdir(f"{GIT_DIR}/objects")
|
||||||
|
|
||||||
|
|
||||||
|
def hash_object(data):
|
||||||
|
oid = hashlib.sha1(data).hexdigest()
|
||||||
|
with open(f"{GIT_DIR}/objects/{oid}", "wb") as out:
|
||||||
|
out.write(data)
|
||||||
|
return oid
|
||||||
|
|||||||
Reference in New Issue
Block a user