Git Knowledge
Kip Landergren
(Updated: )
My Git knowledge base explaining key concepts, git internals, and a step-by-step demonstration of operation.
Contents
Overview
As a Distributed Version Control System
Git is about:
- being truly distributed:
- no central repository required
- the ability to safely exchange content and history between a local repository and a remote repository
- allowing easy collaboration with others
- a local user’s responsibility to integrate with a remote repository’s content
- verifiable, immutable snapshots—via commits—of a repository’s state over time
As Software
Git seeks to be:
- fast
- lightweight
- plumbable
As a Source Code Manager
Git is about:
- creating a repository containing:
- all the contents of a project’s top-level directory
- a special
.git/
directory for Git usage
- transferring the repository to a user’s local machine and configuring it for use, via cloning
- checking out a specific point in history to start work from
- allowing a user to make project file changes with whatever tool they wish
- staging those changed files that, together, represent a body of work
- creating a commit that:
- points to the project’s new state of top-level directory contents (created by taking the parent commit’s top-level directory and applying the staged files, including any removals or moves)
- points to any parent commits from which this commit descended
- records the author of the new state
- records the committer of the commit
- includes a message by the author describing the changes from any parent commit’s state
- forming chains of these commits, through their parent commit lineage, into branches
- joining branches via merge commits, where multiple parent commits are pointed to
- chronicling the project’s history, via these linked commits, as a directed, acyclic graph
- exchanging this local history with a remote repository that other user’s may access
Core Idea
Key Concepts
Dependent concepts:
- Distributed Version Control
- Cryptographic Hash Functions
- Graph Theory
- Content-Addressable Filesystem
Three conceptual areas to keep in mind while using Git:
- the working tree , or working directory, containing the files and directories on disk
- the index , or staging area, where Git stores intermediate data related to the operations being performed
- the object database where Git stores finalized data related to repository snapshots and history
Hashing
The cryptographic hash function SHA-1, and soon SHA-256, is used to hash the contents of all Git objects. This hash, or object name, is then used to write out a file in Git’s object database containing the bytes of the actual object.
The reasons behind this choice are:
- fast and sufficiently unique key generation for object content lookup and storage
- integrity checking becomes easy: if the content has changed the hash will not match
- ability to digitally sign objects to verify their authenticity
- short (40 hex digit) and reliable string to refer to objects via out of band communication (e.g. bug report system including the hash of the last commit)
Staging
Git is all about recording interconnected snapshots—commits—of the working tree’s contents. The specific content that you want Git to record as a commit can be put into an intermediate, or staging, area while work is being done or as decisions of what to include fluctuate. This process is called “staging”.
The intermediate area has a specific name: the index. The index used by many Git operations but is often specifically referred to for staging as “adding a file to the index” or “removing a file from the index”.
It is important to note that changes (diffs) themselves are not being staged, entire changed files are being staged.
Staging a file includes:
- adding
- modifying
- moving (renaming)
- deleting
Committing
Changed files in the working tree are staged in the index and, on commit, converted into a immutable, verifiable, and retrievable blob and tree objects that comprise the snapshot of the repository at that point in time. Commits also include references to any parent commits which were the snapshots before changes were made.
The network of commits, through their parental lineage, form the history of the repository from its root commit to its tip:
---time-->
o---o---o---o---o---o---x---o
↑ \ / ↑
root o---o---o---o---o tip
o - normal commit
x - merge commit
The addition of this parent commit reference, and the date associated with the author and committer, mean that even if a repository returns to some previous state—e.g. the same tree as a previous commit—the history will reflect that this is a new point in the repository’s history.
Important: the contents of the working tree are not committed, only what is staged in the index.
This means that you could stage a file, delete it from the working tree, and still have Git include it in a commit operation. Any staged change needs to be reverted to be excluded from commit.
Branching
A branch is a linear path—a sequence of commits—through the repository history, with the most recent commit of that branch being known as the head or tip. A repository history may have commits existing more recently than the commit a branch head points to.
Remotes and Tracking
When a repository is cloned, the cloned-from location is considered the remote repository, and by default is referred to as origin. Multiple remotes can be configured per repository.
A remote-tracking branch is a special local reference that is set to the value of the head of the branch on the remote. It cannot be checked out directly for modification, but you can configure a different local branch to track it.
So for a branch foo on origin, its corresponding remote-tracking branch would be origin/foo and be stored in .git/refs/remotes/origin/foo. If you wanted to work on a local branch named foo that would fetch and merge from origin/foo, you would configure foo to track origin/foo:
git checkout -b foo --track origin/foo
This tracking configuration is stored in .git/config for use in local development and is not transferred to the remote repository.
Keep in mind: your new local branch foo “tracks” the “remote-tracking branch” origin/foo which itself “tracks” the remote origin’s branch foo. But because of the way Git’s tooling works, the fact that there is a local origin/foo is obscured, and it feels like you are working directly with the remote’s foo branch.
More info on this process is available in git-branch(1), git-checkout(1), and git-fetch(1).
Merging
Merging is the process of joining two or more branches of a repository’s history into a single commit. Git takes specific care to maximize the speed and reliability of merging file contents and applies multiple strategies—like a three-way merge—to accomplish.
Internals
Objects
Objects are stored in the object database and referable by the SHA-1 hash of their contents.
*modify bar.rb!*
commit e9d... <---parent--- commit da3...
/ \
tree f43... tree 9ae...
blob 13b... foo.rb -------------┐ blob 7fc... bar.rb -┐
blob a6b... bar.rb -┐ | ┌- blob 13b... foo.rb |
| | | |
v v v v
[BYTES a6b] [BYTES 13b] [BYTES 7fc]
Blobs
A blob contains file contents as raw bytes.
Trees
A tree contains information about a directory’s:
- subdirectories, via references to other tree objects
- contained files:
- filemode
- path name
- the blob containing file contents
Commits
A commit includes:
- a tree object reference, representing the directory and file contents of the project at commit
- an optional parent commit reference:
- no commit information for an initial commit
- a single parent commit for a normal commit
- multiple parent commits for a merge commit
- an author, who is responsible for the changes
- a committer, who made the commit
- a message, representing the changes from the previous commit(s)
Tags
Tags come in two forms:
- a lightweight tag, which is a reference to a specific commit; literally a creation under .git/refs/tags
- a annotated tag, which is a full Git object containing a tagger and message, stored in the object database
Lightweight tags are useful in development to mark referable points in the history that you may want to switch back and forth to.
References
Anything under .git/refs/. These include:
- local branches, under .git/refs/heads/
- remotes, under .git/refs/remotes/
- tags, under .git/refs/tags/
By Demonstration
Note: the following goes through the files backing a Git repo, and does not attempt to bootstrap understanding. A comprehensive overview of files and directories is available in gitrepository-layout(5). A similar walk through of Git internals is available in Chapter 10 of Pro Git .
A Tour of the Initial Repository
Fresh project, without any version control:
$ tree -a --noreport git-experiment.jha/
git-experiment.jha/
└── README.md
Initialize the repository:
$ git init ./git-experiment.jha
Initialized empty Git repository in /path/to/git-experiment.jha/.git/
Let’s look at what Git created:
$ tree -a -F --noreport git-experiment.jha/
git-experiment.jha/
├── .git
│ ├── HEAD
│ ├── config
│ ├── description
│ ├── hooks/
│ │ ├── applypatch-msg.sample*
│ │ ├── commit-msg.sample*
│ │ ├── fsmonitor-watchman.sample*
│ │ ├── post-update.sample*
│ │ ├── pre-applypatch.sample*
│ │ ├── pre-commit.sample*
│ │ ├── pre-merge-commit.sample*
│ │ ├── pre-push.sample*
│ │ ├── pre-rebase.sample*
│ │ ├── pre-receive.sample*
│ │ ├── prepare-commit-msg.sample*
│ │ ├── push-to-checkout.sample*
│ │ └── update.sample*
│ ├── info/
│ │ └── exclude
│ ├── objects/
│ │ ├── info/
│ │ └── pack/
│ └── refs/
│ ├── heads/
│ └── tags/
└── README.md
Some terms:
- the working tree is everything in git-experiment.jha/, sans .git/
- the .git/ directory contains Git’s administrative and control files
- the repository is considered the combination of:
- the working tree
- the .git/ administrative and control directory
Let’s breakdown the files we see.
.git/HEAD stores the value of HEAD, a symbolic reference, that always points to the head of the current checkout:
git-experiment.jha $ cat .git/HEAD
ref: refs/heads/main
In this case it points to refs/heads/main, which does not exist yet.
.git/config stores the local repository config values:
git-experiment.jha $ cat .git/config
[core]
repositoryformatversion = 0
filemode = true
bare = false
logallrefupdates = true
ignorecase = true
precomposeunicode = true
.git/description is used by gitweb, the web frontend that ships with Git, and unless you are using it I believe it may be ignored:
git-experiment.jha $ cat .git/description
Unnamed repository; edit this file 'description' to name the repository.
.git/hooks/ contains sample hooks; see githooks(5) for more.
git-experiment.jha $ ls -F1 .git/hooks/
applypatch-msg.sample*
commit-msg.sample*
fsmonitor-watchman.sample*
post-update.sample*
pre-applypatch.sample*
pre-commit.sample*
pre-merge-commit.sample*
pre-push.sample*
pre-rebase.sample*
pre-receive.sample*
prepare-commit-msg.sample*
push-to-checkout.sample*
update.sample*
.git/info/exclude stores patterns for excluding files specific to the local repository. For patterns that should be applied to every clone of a repository, look into gitignore(5).
git-experiment.jha $ cat .git/info/exclude
# git ls-files --others --exclude-from=.git/info/exclude
# Lines that start with '#' are comments.
# For a project mostly in C, the following would be a good set of
# exclude patterns (uncomment them if you want to use them):
# *.[oa]
# *~
.git/objects/ contains data and information related to the object store. Right now the store is empty, but as we perform Git operations we will see objects being created here. Its current subdirectories are:
- .git/objects/info/ which will contain additional information about the store
- .git/objects/pack/ which is used when compressing objects
.git/refs/ contains references, which are files that name objects (and those objects are stored in the object store). We have not made any objects yet so no references exist. The directories created by default are:
- .git/refs/heads, which store a file, per branch, that name the commit serving as the tip or head of that branch
- .git/refs/tags, which store a file, per tag, that name a commit of interest, like a specific release version
Basic Operations
Ask Git what its understanding is of our repository:
git-experiment.jha $ git status
On branch main
No commits yet
Untracked files:
(use "git add <file>..." to include in what will be committed)
README.md
nothing added to commit but untracked files present (use "git add" to track)
This tells us a few things:
- we are on a branch named main. Note: this is the default branch name I configured via init.defaultBranch for new repositories. Further reading in git-config(1).
- there have been no commits
- git sees one untracked file, README.md
- nothing as been added (“staged”) for commit
Staging
Let’s add README.md to the index, or stageREADME.md, for commit:
git-experiment.jha $ git add README.md
git-experiment.jha $ git status
On branch main
No commits yet
Changes to be committed:
(use "git rm --cached <file>..." to unstage)
new file: README.md
Git is now telling us:
- still on branch main
- still no commits
- there are no longer any untracked files
- README.md is recognized as a new file, and is considered a change to be committed
Git recommends using git rm --cached <file>
to unstage; what does this refer to?
First: “cached” refers to both the older name of the index, which is “cache”, and the actual process by which git stores data in the index, effectively caching it. Second, staging or unstaging a change means to add or remove file contents from the index.
Let’s inspect how our repository has changed on disk:
$ tree -a --noreport -I 'config|description|hooks|exclude' git-experiment.jha/
git-experiment.jha/
├── .git
│ ├── HEAD
│ ├── index
│ ├── info/
│ ├── objects/
│ │ ├── 26/
│ │ │ └── 17c87dce8b25f1c67acd220677749e0e3b3f81
│ │ ├── info/
│ │ └── pack/
│ └── refs/
│ ├── heads/
│ └── tags/
└── README.md
We have a few new files:
- .git/index which acts as a cache for data Git uses; among those uses is storing information about staged data for the next commit
- .git/objects/ now contains .git/objects/26/17c87dce8b25f1c67acd220677749e0e3b3f81
That latter directory and file pair is our first Git object! It represents a hexadecimal digest of the README.md contents we staged for commit.
Digging Deeper Into Objects
The two-character directory name—26—is one of 256 possible options (2 hexademical digits → 16 × 16 = 256) and a strategy Git uses to ensure that these object directories are uniformly distributed while growing. As we create more Git objects through our actions, new directories and files will appear.
Let’s ask Git what type of object is named by that digest:
git-experiment.jha $ git cat-file -t 2617c87dce8b25f1c67acd220677749e0e3b3f81
blob
A blob object represents an untyped sequence of bytes, typically file contents.
Looking inside that blob object:
git-experiment.jha $ git cat-file blob 2617c87dce8b25f1c67acd220677749e0e3b3f81
# experiment
And confirm that it makes sense:
git-experiment.jha $ cat README.md
# experiment
git-experiment.jha $ cat README.md | git hash-object --stdin
2617c87dce8b25f1c67acd220677749e0e3b3f81
it does: we get the same 2617c87dce8b25f1c67acd220677749e0e3b3f81 digest back that is stored in the object database.
Abbreviated Object Names
In the above example we can shorten our object name to 2617c87, or even 2617, provided that we do not have collisions using this shortened version. I have not had success using less than 4 digits, so I assume there is a minimum.
7 hexadecimal digits is the default abbreviated form of an object name, and Git will automatically increase this during object display to ensure unique addressing.
Committing
Let’s now assume we are satisfied that the state of the index is worth demarcating as an important snapshot in the history of our development.
To do this, we commit the snapshot with a message “initial commit” via:
git-experiment.jha $ git commit -m 'initial commit'
[main (root-commit) d3466f9] initial commit
1 file changed, 1 insertion(+)
create mode 100644 README.md
This output is Git telling us:
- the branch main now contains a root-commit—the first commit in a history—with abbreviated object name d3466f9, and message initial commit
- 1 file was “changed” through the creation of README.md
- 1 line was inserted (“# experiment”)
- the filemode is 100644, which refers to a normal file
Let’s inspect how our repository has changed on disk:
$ tree -a -F --noreport -I 'config|description|hooks|exclude' git-experiment.jha/
git-experiment.jha/
├── .git/
│ ├── COMMIT_EDITMSG
│ ├── HEAD
│ ├── index
│ ├── info/
│ ├── logs/
│ │ ├── HEAD
│ │ └── refs/
│ │ └── heads/
│ │ └── main
│ ├── objects/
│ │ ├── 26/
│ │ │ └── 17c87dce8b25f1c67acd220677749e0e3b3f81
│ │ ├── 59/
│ │ │ └── b6bc826b7d4af749f6059e159145fefb840f4c
│ │ ├── d3/
│ │ │ └── 466f9fe2e0db4bca597823cd5602e401ed1337
│ │ ├── info/
│ │ └── pack/
│ └── refs/
│ ├── heads/
│ │ └── main
│ └── tags/
└── README.md
.git/COMMIT_EDITMSG is a file used by Git (and hooks) to manipulate the commit message. We passed our message via the -m
flag and therefore did not encounter a case where having access to .git/COMMIT_EDITMSG would be useful. Git does show it with our message though:
git-experiment.jha $ cat .git/COMMIT_EDITMSG
initial commit
Going in order within .git/objects/, let’s inspect the new objects.
First is a tree object that lists our original blob we inspected during staging:
git-experiment.jha $ git cat-file -t 59b6b
tree
git-experiment.jha $ git ls-tree 59b6b
100644 blob 2617c87dce8b25f1c67acd220677749e0e3b3f81 README.md
Second is a commit object, representing our snapshot:
git-experiment.jha $ git cat-file -t d3466
commit
git-experiment.jha $ git cat-file commit d3466
tree 59b6bc826b7d4af749f6059e159145fefb840f4c
author Kip Landergren <klandergren@users.noreply.github.com> 1636743452 -0800
committer Kip Landergren <klandergren@users.noreply.github.com> 1636743452 -0800
initial commit
This commit object is telling us:
- Git’s understanding of the project working tree—what we staged, not necessarily the working tree on disk (remember: we can have unstaged changes and untracked files)—is stored in 59b6bc8, which we inspected above, and that tree, in turn, refers to the README.md content blob 2617c87
- the author is me, followed by the timestamp
- the committer is me, followed by the timestamp
- it has our message “initial commit”
- that’s it!
Note: the values for author and committer were specified by me previously; more info on how to do this in git-config(1).
And finally, let’s inspect .git/refs/heads/main:
git-experiment.jha $ cat .git/refs/heads/main
d3466f9fe2e0db4bca597823cd5602e401ed1337
This tells us that the head of branch main—the most recent commit of branch (but not necessarily the most recent commit in the project’s history)—is the object named d3466f9, which we saw above was the initial commit!
The inspection of the main branch head tells us important information:
- branches are just pointers to commits
- this, in turn, shows that branches are cheap to create, modify, and manipulate: no data is copied when branching
- the history of the project, demarcated by commits, are a set of pointers that themselves point to immutable objects (trees and blobs)