Git Knowledge

Kip Landergren

(Updated: )

My Git knowledge base explaining key concepts, git internals, and a step-by-step demonstration of operation.

Contents

Overview

As a Distributed Version Control System

Git is about:

As Software

Git seeks to be:

As a Source Code Manager

Git is about:

Core Idea

“...Git is fundamentally a content-addressable filesystem with a VCS\ [(Version Control System)] user interface written on top of it.”

Key Concepts

Dependent concepts:

Three conceptual areas to keep in mind while using Git:

Hashing

The cryptographic hash function SHA-1, and soon SHA-256, is used to hash the contents of all Git objects. This hash, or object name, is then used to write out a file in Git’s object database containing the bytes of the actual object.

The reasons behind this choice are:

Staging

Git is all about recording interconnected snapshots—commits—of the working tree’s contents. The specific content that you want Git to record as a commit can be put into an intermediate, or staging, area while work is being done or as decisions of what to include flucuate. This process is called “staging”.

The intermediate area has a specific name: the index. The index used by many Git operations but is often specifically referred to for staging as “adding a file to the index” or “removing a file from the index”.

It is important to note that changes (diffs) themselves are not being staged, entire changed files are being staged.

Staging a file includes:

Committing

Changed files in the working tree are staged in the index and, on commit, converted into a immutable, verifiable, and retrievable blob and tree objects that comprise the snapshot of the repository at that point in time. Commits also include references to any parent commits which were the snapshots before changes were made.

The network of commits, through their parental lineage, form the history of the repository from its root commit to its tip:

          ---time-->
 o---o---o---o---o---o---x---o
 ↑    \                 /    ↑
root   o---o---o---o---o    tip

o - normal commit
x - merge commit

The addition of this parent commit reference, and the date associated with the author and committer, mean that even if a repository returns to some previous state—e.g. the same tree as a previous commit—the history will reflect that this is a new point in the repository’s history.

Important: the contents of the working tree are not committed, only what is staged in the index.

This means that you could stage a file, delete it from the working tree, and still have Git include it in a commit operation. Any staged change needs to be reverted to be excluded from commit.

Branching

A branch is a linear path—a sequence of commits—through the repository history, with the most recent commit of that branch being known as the head or tip. A repository history may have commits existing more recently than the commit a branch head points to.

Remotes and Tracking

When a repository is cloned, the cloned-from location is considered the remote repository, and by default is referred to as origin. Multiple remotes can be configured per repository.

A remote-tracking branch is a special local reference that is set to the value of the head of the branch on the remote. It cannot be checked out directly for modification, but you can configure a different local branch to track it.

So for a branch foo on origin, its corresponding remote-tracking branch would be origin/foo and be stored in .git/refs/remotes/origin/foo. If you wanted to work on a local branch named foo that would fetch and merge from origin/foo, you would configure foo to track origin/foo:

git checkout -b foo --track origin/foo

This tracking configuration is stored in .git/config for use in local development and is not transferred to the remote repository.

Keep in mind: your new local branch foo “tracks” the “remote-tracking branch” origin/foo which itself “tracks” the remote origin’s branch foo. But because of the way Git’s tooling works, the fact that there is a local origin/foo is obscured, and it feels like you are working directly with the remote’s foo branch.

More info on this process is available in git-branch(1), git-checkout(1), and git-fetch(1).

Merging

Merging is the process of joining two or more branches of a repository’s history into a single commit. Git takes specific care to maximize the speed and reliability of merging file contents and applies multiple strategies—like a three-way merge—to accomplish.

Internals

Objects

Objects are stored in the object database and referrable by the SHA-1 hash of their contents.

                  *modify bar.rb!*

    commit e9d...  <---parent---  commit da3...
        /                                  \
  tree f43...                          tree 9ae...
blob 13b... foo.rb -------------┐    blob 7fc... bar.rb -┐
blob a6b... bar.rb -┐           | ┌- blob 13b... foo.rb  |
                    |           | |                      |
                    v           v v                      v
               [BYTES a6b] [BYTES 13b]          [BYTES 7fc]

Blobs

A blob contains file contents as raw bytes.

Trees

A tree contains information about a directory’s:

Commits

A commit includes:

Tags

Tags come in two forms:

Lightweight tags are useful in development to mark referrable points in the history that you may want to switch back and forth to.

References

Anything under .git/refs/. These include:

By Demonstration

Note: the following goes through the files backing a Git repo, and does not attempt to bootstrap understanding. A comprehensive overview of files and directories is available in gitrepository-layout(5). A similar walk through of Git internals is available in Chapter 10 of Pro Git .

A Tour of the Initial Repository

Fresh project, without any version control:

$ tree -a --noreport git-experiment.jha/
git-experiment.jha/
└── README.md

Initialize the repository:

$ git init ./git-experiment.jha
Initialized empty Git repository in /path/to/git-experiment.jha/.git/

Let’s look at what Git created:

$ tree -a -F --noreport git-experiment.jha/
git-experiment.jha/
├── .git
│   ├── HEAD
│   ├── config
│   ├── description
│   ├── hooks/
│   │   ├── applypatch-msg.sample*
│   │   ├── commit-msg.sample*
│   │   ├── fsmonitor-watchman.sample*
│   │   ├── post-update.sample*
│   │   ├── pre-applypatch.sample*
│   │   ├── pre-commit.sample*
│   │   ├── pre-merge-commit.sample*
│   │   ├── pre-push.sample*
│   │   ├── pre-rebase.sample*
│   │   ├── pre-receive.sample*
│   │   ├── prepare-commit-msg.sample*
│   │   ├── push-to-checkout.sample*
│   │   └── update.sample*
│   ├── info/
│   │   └── exclude
│   ├── objects/
│   │   ├── info/
│   │   └── pack/
│   └── refs/
│       ├── heads/
│       └── tags/
└── README.md

Some terms:

Let’s breakdown the files we see.

.git/HEAD stores the value of HEAD, a symbolic reference, that always points to the head of the current checkout:

git-experiment.jha $ cat .git/HEAD
ref: refs/heads/main

In this case it points to refs/heads/main, which does not exist yet.

.git/config stores the local repository config values:

git-experiment.jha $ cat .git/config
[core]
	repositoryformatversion = 0
	filemode = true
	bare = false
	logallrefupdates = true
	ignorecase = true
	precomposeunicode = true

.git/description is used by gitweb, the web frontend that ships with Git, and unless you are using it I believe it may be ignored:

git-experiment.jha $ cat .git/description
Unnamed repository; edit this file 'description' to name the repository.

.git/hooks/ contains sample hooks; see githooks(5) for more.

git-experiment.jha $ ls -F1 .git/hooks/
applypatch-msg.sample*
commit-msg.sample*
fsmonitor-watchman.sample*
post-update.sample*
pre-applypatch.sample*
pre-commit.sample*
pre-merge-commit.sample*
pre-push.sample*
pre-rebase.sample*
pre-receive.sample*
prepare-commit-msg.sample*
push-to-checkout.sample*
update.sample*

.git/info/exclude stores patterns for excluding files specific to the local repository. For patterns that should be applied to every clone of a repository, look into gitignore(5).

git-experiment.jha $ cat .git/info/exclude
# git ls-files --others --exclude-from=.git/info/exclude
# Lines that start with '#' are comments.
# For a project mostly in C, the following would be a good set of
# exclude patterns (uncomment them if you want to use them):
# *.[oa]
# *~

.git/objects/ contains data and information related to the object store. Right now the store is empty, but as we perform Git operations we will see objects being created here. Its current subdirectories are:

.git/refs/ contains references, which are files that name objects (and those objects are stored in the object store). We have not made any objects yet so no references exist. The directories created by default are:

Basic Operations

Ask Git what its understanding is of our repository:

git-experiment.jha $ git status
On branch main

No commits yet

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	README.md

nothing added to commit but untracked files present (use "git add" to track)

This tells us a few things:

Staging

Let’s add README.md to the index, or stageREADME.md, for commit:

git-experiment.jha $ git add README.md
git-experiment.jha $ git status
On branch main

No commits yet

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)
	new file:   README.md

Git is now telling us:

Git recommends using git rm --cached <file> to unstage; what does this refer to?

First: “cached” refers to both the older name of the index, which is “cache”, and the actual process by which git stores data in the index, effectively caching it. Second, staging or unstaging a change means to add or remove file contents from the index.

Let’s inspect how our repository has changed on disk:

$ tree -a --noreport -I 'config|description|hooks|exclude' git-experiment.jha/
git-experiment.jha/
├── .git
│   ├── HEAD
│   ├── index
│   ├── info/
│   ├── objects/
│   │   ├── 26/
│   │   │   └── 17c87dce8b25f1c67acd220677749e0e3b3f81
│   │   ├── info/
│   │   └── pack/
│   └── refs/
│       ├── heads/
│       └── tags/
└── README.md

We have a few new files:

That latter directory and file pair is our first Git object! It represents a hexadecimal digest of the README.md contents we staged for commit.

Digging Deeper Into Objects

The two-character directory name—26—is one of 256 possible options (2 hexademical digits → 16 × 16 = 256) and a strategy Git uses to ensure that these object directories are uniformly distributed while growing. As we create more Git objects through our actions, new directories and files will appear.

Let’s ask Git what type of object is named by that digest:

git-experiment.jha $ git cat-file -t 2617c87dce8b25f1c67acd220677749e0e3b3f81
blob

A blob object represents an untyped sequence of bytes, typically file contents.

Looking inside that blob object:

git-experiment.jha $ git cat-file blob 2617c87dce8b25f1c67acd220677749e0e3b3f81
# experiment

And confirm that it makes sense:

git-experiment.jha $ cat README.md
# experiment
git-experiment.jha $ cat README.md | git hash-object --stdin
2617c87dce8b25f1c67acd220677749e0e3b3f81

it does: we get the same 2617c87dce8b25f1c67acd220677749e0e3b3f81 digest back that is stored in the object database.

Abbreviated Object Names

In the above example we can shorten our object name to 2617c87, or even 2617, provided that we do not have collisions using this shortened version. I have not had success using less than 4 digits, so I assume there is a minimum.

7 hexadecimal digits is the default abbreviated form of an object name, and Git will automatically increase this during object display to ensure unique addressing.

Committing

Let’s now assume we are satisified that the state of the index is worth demarcating as an important snapshot in the history of our development.

To do this, we commit the snapshot with a message “initial commit” via:

git-experiment.jha $ git commit -m 'initial commit'
[main (root-commit) d3466f9] initial commit
 1 file changed, 1 insertion(+)
 create mode 100644 README.md

This output is Git telling us:

Let’s inspect how our repository has changed on disk:

$ tree -a -F --noreport -I 'config|description|hooks|exclude' git-experiment.jha/
git-experiment.jha/
├── .git/
│   ├── COMMIT_EDITMSG
│   ├── HEAD
│   ├── index
│   ├── info/
│   ├── logs/
│   │   ├── HEAD
│   │   └── refs/
│   │       └── heads/
│   │           └── main
│   ├── objects/
│   │   ├── 26/
│   │   │   └── 17c87dce8b25f1c67acd220677749e0e3b3f81
│   │   ├── 59/
│   │   │   └── b6bc826b7d4af749f6059e159145fefb840f4c
│   │   ├── d3/
│   │   │   └── 466f9fe2e0db4bca597823cd5602e401ed1337
│   │   ├── info/
│   │   └── pack/
│   └── refs/
│       ├── heads/
│       │   └── main
│       └── tags/
└── README.md

.git/COMMIT_EDITMSG is a file used by Git (and hooks) to manipulate the commit message. We passed our message via the -m flag and therefore did not encounter a case where having access to .git/COMMIT_EDITMSG would be useful. Git does show it with our message though:

git-experiment.jha $ cat .git/COMMIT_EDITMSG
initial commit

Going in order within .git/objects/, let’s inspect the new objects.

First is a tree object that lists our original blob we inspected during staging:

git-experiment.jha $ git cat-file -t 59b6b
tree
git-experiment.jha $ git ls-tree 59b6b
100644 blob 2617c87dce8b25f1c67acd220677749e0e3b3f81	README.md

Second is a commit object, representing our snapshot:

git-experiment.jha $ git cat-file -t d3466
commit
git-experiment.jha $ git cat-file commit d3466
tree 59b6bc826b7d4af749f6059e159145fefb840f4c
author Kip Landergren <klandergren@users.noreply.github.com> 1636743452 -0800
committer Kip Landergren <klandergren@users.noreply.github.com> 1636743452 -0800

initial commit

This commit object is telling us:

Note: the values for author and committer were specified by me previously; more info on how to do this in git-config(1).

And finally, let’s inspect .git/refs/heads/main:

git-experiment.jha $ cat .git/refs/heads/main
d3466f9fe2e0db4bca597823cd5602e401ed1337

This tells us that the head of branch main—the most recent commit of branch (but not necessarily the most recent commit in the project’s history)—is the object named d3466f9, which we saw above was the initial commit!

The inspection of the main branch head tells us important information: