64 lines
3.0 KiB
Markdown
64 lines
3.0 KiB
Markdown
|
- hash-object: Save object
|
||
|
|
||
|
Let's create the first non-trivial command. This command will take a file and
|
||
|
store it in our '.ugit' directory for later retrieval. In Git's lingo, this
|
||
|
feature is called "the object database". It allows us to store and retrieve
|
||
|
arbitrary blobs, which are called "objects". As far as the Object Database is
|
||
|
concerned, the content of the object doesn't have any meaning (just like a
|
||
|
filesystem doesn't care about the internal structure of a file).
|
||
|
|
||
|
Because this command needs the '.ugit' directory, it must be run from the same
|
||
|
directory where you did 'ugit init'.
|
||
|
|
||
|
Note that this is a very low-level Git building block and we're not talking yet
|
||
|
about versions or commits or any other things that you might have heard about,
|
||
|
we're just talking about an interface for storing some raw bytes.
|
||
|
|
||
|
So we can store an object, but how would we refer to it later? We could ask the
|
||
|
user to provide a name along with the object and retrieve the object later using
|
||
|
the name, but there is a nicer way: We can refer to the object using its hash.
|
||
|
|
||
|
If you haven't heard about hashes and hash functions, I suggest that you pause
|
||
|
and do some reading on it. In summary, a hash function can take a blob of
|
||
|
arbitrary length and produce a small "fingerprint" with a fixed length. Some
|
||
|
hash functions such as SHA-1 guarantee that different blobs are very very very
|
||
|
likely to produce different fingerprints (so likely, that Git assumes it's
|
||
|
guaranteed). Let's try some strings to see an example:
|
||
|
|
||
|
```
|
||
|
$ echo -n this is cool | sha1sum
|
||
|
60f51187e76a9de0ff3df31f051bde04da2da891
|
||
|
|
||
|
$ echo -n this is cooler | sha1sum
|
||
|
f3c953b792f9ab39d1be0bdab7ab5f8350593004
|
||
|
```
|
||
|
|
||
|
You can see that hashing the phrases "this is cool" and "this is cooler" gives
|
||
|
completely different hashes even though the difference between the phrases is
|
||
|
small.
|
||
|
|
||
|
We're going to use the hash as the name of object (we'll call this name an
|
||
|
"OID"* - object ID).
|
||
|
|
||
|
So the flow of the command hash-object is:
|
||
|
|
||
|
+ Get the path of the file to store.
|
||
|
+ Read the file.
|
||
|
+ Hash the content of the file using SHA-1.
|
||
|
+ Store the file under ".ugit/objects/{the SHA-1 hash}".
|
||
|
|
||
|
This type of storage is called content-addressable storage because the "address"
|
||
|
that we use to find a blob is based on the content of the blob itself. (In
|
||
|
contrast to name-addressable storage, such as a typical filesystem, where you
|
||
|
address a particular file by its name, regardless of its content).
|
||
|
Content-addressable storage has nice properties when synchronizing data between
|
||
|
different computers - if two repositories have an object with the same OID we
|
||
|
can be sure that they are the same object. Also since two different objects are
|
||
|
practically guaranteed to have different OIDs, we can't have naming clashes
|
||
|
between objects.
|
||
|
|
||
|
When real Git stores objects it does a few extra things, such as writing the
|
||
|
size of the object to the file as well, compressing them and dividing the
|
||
|
objects into 256 directories. This is done to avoid having directories with huge
|
||
|
number of files, which can hurt performance. We're not going to do this in ugit for simplicity.
|