Git Knowledge

Kip Landergren

July 05, 2020 (Updated: October 19, 2024)

My Git knowledge base explaining key concepts, git internals, and a step-by-step demonstration of operation.

Overview
Core Idea
Key Concepts
Internals
- Objects
  - Blobs
  - Trees
  - Commits
  - Tags
- References
By Demonstration
- A Tour of the Initial Repository
- Basic Operations
  - Staging
    - Digging Deeper Into Objects
    - Abbreviated Object Names
  - Committing

Overview

As a Distributed Version Control System

Git is about:

being truly distributed:
- no central repository required
- the ability to safely exchange content and history between a local repository and a remote repository
- allowing easy collaboration with others
a local user’s responsibility to integrate with a remote repository’s content
verifiable, immutable snapshots—via commits—of a repository’s state over time

As Software

Git seeks to be:

fast
lightweight
plumbable

As a Source Code Manager

Git is about:

creating a repository containing:
- all the contents of a project’s top-level directory
- a special .git/ directory for Git usage
transferring the repository to a user’s local machine and configuring it for use, via cloning
checking out a specific point in history to start work from
allowing a user to make project file changes with whatever tool they wish
staging those changed files that, together, represent a body of work
creating a commit that:
- points to the project’s new state of top-level directory contents (created by taking the parent commit’s top-level directory and applying the staged files, including any removals or moves)
- points to any parent commits from which this commit descended
- records the author of the new state
- records the committer of the commit
- includes a message by the author describing the changes from any parent commit’s state
forming chains of these commits, through their parent commit lineage, into branches
joining branches via merge commits, where multiple parent commits are pointed to
chronicling the project’s history, via these linked commits, as a directed, acyclic graph
exchanging this local history with a remote repository that other user’s may access

Core Idea

“...Git is fundamentally a content-addressable filesystem with a VCS\ [(Version Control System)] user interface written on top of it.”

Key Concepts

Dependent concepts:

Distributed Version Control
Cryptographic Hash Functions
Graph Theory
Content-Addressable Filesystem

Three conceptual areas to keep in mind while using Git:

the working tree , or working directory, containing the files and directories on disk
the index , or staging area, where Git stores intermediate data related to the operations being performed
the object database where Git stores finalized data related to repository snapshots and history

Hashing

The cryptographic hash function SHA-1, and soon SHA-256, is used to hash the contents of all Git objects. This hash, or object name, is then used to write out a file in Git’s object database containing the bytes of the actual object.

The reasons behind this choice are:

fast and sufficiently unique key generation for object content lookup and storage
integrity checking becomes easy: if the content has changed the hash will not match
ability to digitally sign objects to verify their authenticity
short (40 hex digit) and reliable string to refer to objects via out of band communication (e.g. bug report system including the hash of the last commit)

Staging

Git is all about recording interconnected snapshots—commits—of the working tree’s contents. The specific content that you want Git to record as a commit can be put into an intermediate, or staging, area while work is being done or as decisions of what to include fluctuate. This process is called “staging”.

The intermediate area has a specific name: the index. The index used by many Git operations but is often specifically referred to for staging as “adding a file to the index” or “removing a file from the index”.

It is important to note that changes (diffs) themselves are not being staged, entire changed files are being staged.

Staging a file includes:

adding
modifying
moving (renaming)
deleting

Committing

Changed files in the working tree are staged in the index and, on commit, converted into a immutable, verifiable, and retrievable blob and tree objects that comprise the snapshot of the repository at that point in time. Commits also include references to any parent commits which were the snapshots before changes were made.

The network of commits, through their parental lineage, form the history of the repository from its root commit to its tip:

          ---time-->
 o---o---o---o---o---o---x---o
 ↑    \                 /    ↑
root   o---o---o---o---o    tip

o - normal commit
x - merge commit

The addition of this parent commit reference, and the date associated with the author and committer, mean that even if a repository returns to some previous state—e.g. the same tree as a previous commit—the history will reflect that this is a new point in the repository’s history.

Important: the contents of the working tree are not committed, only what is staged in the index.

This means that you could stage a file, delete it from the working tree, and still have Git include it in a commit operation. Any staged change needs to be reverted to be excluded from commit.

Branching

A branch is a linear path—a sequence of commits—through the repository history, with the most recent commit of that branch being known as the head or tip. A repository history may have commits existing more recently than the commit a branch head points to.

Remotes and Tracking

When a repository is cloned, the cloned-from location is considered the remote repository, and by default is referred to as origin. Multiple remotes can be configured per repository.

A remote-tracking branch is a special local reference that is set to the value of the head of the branch on the remote. It cannot be checked out directly for modification, but you can configure a different local branch to track it.

So for a branch foo on origin, its corresponding remote-tracking branch would be origin/foo and be stored in .git/refs/remotes/origin/foo. If you wanted to work on a local branch named foo that would fetch and merge from origin/foo, you would configure foo to track origin/foo:

git checkout -b foo --track origin/foo

This tracking configuration is stored in .git/config for use in local development and is not transferred to the remote repository.

Keep in mind: your new local branch foo “tracks” the “remote-tracking branch” origin/foo which itself “tracks” the remote origin’s branch foo. But because of the way Git’s tooling works, the fact that there is a local origin/foo is obscured, and it feels like you are working directly with the remote’s foo branch.

More info on this process is available in git-branch(1), git-checkout(1), and git-fetch(1).

Merging

Merging is the process of joining two or more branches of a repository’s history into a single commit. Git takes specific care to maximize the speed and reliability of merging file contents and applies multiple strategies—like a three-way merge—to accomplish.

Internals

Objects

Objects are stored in the object database and referable by the SHA-1 hash of their contents.

                  *modify bar.rb!*

    commit e9d...  <---parent---  commit da3...
        /                                  \
  tree f43...                          tree 9ae...
blob 13b... foo.rb -------------┐    blob 7fc... bar.rb -┐
blob a6b... bar.rb -┐           | ┌- blob 13b... foo.rb  |
                    |           | |                      |
                    v           v v                      v
               [BYTES a6b] [BYTES 13b]          [BYTES 7fc]

Blobs

A blob contains file contents as raw bytes.

Trees

A tree contains information about a directory’s:

subdirectories, via references to other tree objects
contained files:
- filemode
- path name
- the blob containing file contents

Commits

A commit includes:

a tree object reference, representing the directory and file contents of the project at commit
an optional parent commit reference:
- no commit information for an initial commit
- a single parent commit for a normal commit
- multiple parent commits for a merge commit
an author, who is responsible for the changes
a committer, who made the commit
a message, representing the changes from the previous commit(s)

References

Anything under .git/refs/. These include:

local branches, under .git/refs/heads/
remotes, under .git/refs/remotes/
tags, under .git/refs/tags/

By Demonstration

Note: the following goes through the files backing a Git repo, and does not attempt to bootstrap understanding. A comprehensive overview of files and directories is available in gitrepository-layout(5). A similar walk through of Git internals is available in Chapter 10 of Pro Git .

A Tour of the Initial Repository

Fresh project, without any version control:

$ tree -a --noreport git-experiment.jha/
git-experiment.jha/
└── README.md

Initialize the repository:

$ git init ./git-experiment.jha
Initialized empty Git repository in /path/to/git-experiment.jha/.git/

Let’s look at what Git created:

$ tree -a -F --noreport git-experiment.jha/
git-experiment.jha/
├── .git
│   ├── HEAD
│   ├── config
│   ├── description
│   ├── hooks/
│   │   ├── applypatch-msg.sample*
│   │   ├── commit-msg.sample*
│   │   ├── fsmonitor-watchman.sample*
│   │   ├── post-update.sample*
│   │   ├── pre-applypatch.sample*
│   │   ├── pre-commit.sample*
│   │   ├── pre-merge-commit.sample*
│   │   ├── pre-push.sample*
│   │   ├── pre-rebase.sample*
│   │   ├── pre-receive.sample*
│   │   ├── prepare-commit-msg.sample*
│   │   ├── push-to-checkout.sample*
│   │   └── update.sample*
│   ├── info/
│   │   └── exclude
│   ├── objects/
│   │   ├── info/
│   │   └── pack/
│   └── refs/
│       ├── heads/
│       └── tags/
└── README.md

Some terms:

the working tree is everything in git-experiment.jha/, sans .git/
the .git/ directory contains Git’s administrative and control files
the repository is considered the combination of:
- the working tree
- the .git/ administrative and control directory

Let’s breakdown the files we see.

.git/HEAD stores the value of HEAD, a symbolic reference, that always points to the head of the current checkout:

git-experiment.jha $ cat .git/HEAD
ref: refs/heads/main

In this case it points to refs/heads/main, which does not exist yet.

.git/config stores the local repository config values:

git-experiment.jha $ cat .git/config
[core]
	repositoryformatversion = 0
	filemode = true
	bare = false
	logallrefupdates = true
	ignorecase = true
	precomposeunicode = true

.git/description is used by gitweb, the web frontend that ships with Git, and unless you are using it I believe it may be ignored:

git-experiment.jha $ cat .git/description
Unnamed repository; edit this file 'description' to name the repository.

.git/hooks/ contains sample hooks; see githooks(5) for more.

git-experiment.jha $ ls -F1 .git/hooks/
applypatch-msg.sample*
commit-msg.sample*
fsmonitor-watchman.sample*
post-update.sample*
pre-applypatch.sample*
pre-commit.sample*
pre-merge-commit.sample*
pre-push.sample*
pre-rebase.sample*
pre-receive.sample*
prepare-commit-msg.sample*
push-to-checkout.sample*
update.sample*

.git/info/exclude stores patterns for excluding files specific to the local repository. For patterns that should be applied to every clone of a repository, look into gitignore(5).

git-experiment.jha $ cat .git/info/exclude
# git ls-files --others --exclude-from=.git/info/exclude
# Lines that start with '#' are comments.
# For a project mostly in C, the following would be a good set of
# exclude patterns (uncomment them if you want to use them):
# *.[oa]
# *~

.git/objects/ contains data and information related to the object store. Right now the store is empty, but as we perform Git operations we will see objects being created here. Its current subdirectories are:

.git/objects/info/ which will contain additional information about the store
.git/objects/pack/ which is used when compressing objects

.git/refs/ contains references, which are files that name objects (and those objects are stored in the object store). We have not made any objects yet so no references exist. The directories created by default are:

.git/refs/heads, which store a file, per branch, that name the commit serving as the tip or head of that branch
.git/refs/tags, which store a file, per tag, that name a commit of interest, like a specific release version

Basic Operations

Ask Git what its understanding is of our repository:

git-experiment.jha $ git status
On branch main

No commits yet

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	README.md

nothing added to commit but untracked files present (use "git add" to track)

This tells us a few things:

we are on a branch named main. Note: this is the default branch name I configured via init.defaultBranch for new repositories. Further reading in git-config(1).
there have been no commits
git sees one untracked file, README.md
nothing as been added (“staged”) for commit

Staging

Let’s add README.md to the index, or stageREADME.md, for commit:

git-experiment.jha $ git add README.md
git-experiment.jha $ git status
On branch main

No commits yet

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)
	new file:   README.md

Git is now telling us:

still on branch main
still no commits
there are no longer any untracked files
README.md is recognized as a new file, and is considered a change to be committed

Git recommends using git rm --cached <file> to unstage; what does this refer to?

First: “cached” refers to both the older name of the index, which is “cache”, and the actual process by which git stores data in the index, effectively caching it. Second, staging or unstaging a change means to add or remove file contents from the index.

Let’s inspect how our repository has changed on disk:

$ tree -a --noreport -I 'config|description|hooks|exclude' git-experiment.jha/
git-experiment.jha/
├── .git
│   ├── HEAD
│   ├── index
│   ├── info/
│   ├── objects/
│   │   ├── 26/
│   │   │   └── 17c87dce8b25f1c67acd220677749e0e3b3f81
│   │   ├── info/
│   │   └── pack/
│   └── refs/
│       ├── heads/
│       └── tags/
└── README.md

We have a few new files:

.git/index which acts as a cache for data Git uses; among those uses is storing information about staged data for the next commit
.git/objects/ now contains .git/objects/26/17c87dce8b25f1c67acd220677749e0e3b3f81

That latter directory and file pair is our first Git object! It represents a hexadecimal digest of the README.md contents we staged for commit.

Digging Deeper Into Objects

The two-character directory name—26—is one of 256 possible options (2 hexademical digits → 16 × 16 = 256) and a strategy Git uses to ensure that these object directories are uniformly distributed while growing. As we create more Git objects through our actions, new directories and files will appear.

Let’s ask Git what type of object is named by that digest:

git-experiment.jha $ git cat-file -t 2617c87dce8b25f1c67acd220677749e0e3b3f81
blob

A blob object represents an untyped sequence of bytes, typically file contents.

Looking inside that blob object:

git-experiment.jha $ git cat-file blob 2617c87dce8b25f1c67acd220677749e0e3b3f81
# experiment

And confirm that it makes sense:

git-experiment.jha $ cat README.md
# experiment
git-experiment.jha $ cat README.md | git hash-object --stdin
2617c87dce8b25f1c67acd220677749e0e3b3f81

it does: we get the same 2617c87dce8b25f1c67acd220677749e0e3b3f81 digest back that is stored in the object database.

Abbreviated Object Names

In the above example we can shorten our object name to 2617c87, or even 2617, provided that we do not have collisions using this shortened version. I have not had success using less than 4 digits, so I assume there is a minimum.

7 hexadecimal digits is the default abbreviated form of an object name, and Git will automatically increase this during object display to ensure unique addressing.

Committing

Let’s now assume we are satisfied that the state of the index is worth demarcating as an important snapshot in the history of our development.

To do this, we commit the snapshot with a message “initial commit” via:

git-experiment.jha $ git commit -m 'initial commit'
[main (root-commit) d3466f9] initial commit
 1 file changed, 1 insertion(+)
 create mode 100644 README.md

This output is Git telling us:

the branch main now contains a root-commit—the first commit in a history—with abbreviated object name d3466f9, and message initial commit
1 file was “changed” through the creation of README.md
1 line was inserted (“# experiment”)
the filemode is 100644, which refers to a normal file

Let’s inspect how our repository has changed on disk:

$ tree -a -F --noreport -I 'config|description|hooks|exclude' git-experiment.jha/
git-experiment.jha/
├── .git/
│   ├── COMMIT_EDITMSG
│   ├── HEAD
│   ├── index
│   ├── info/
│   ├── logs/
│   │   ├── HEAD
│   │   └── refs/
│   │       └── heads/
│   │           └── main
│   ├── objects/
│   │   ├── 26/
│   │   │   └── 17c87dce8b25f1c67acd220677749e0e3b3f81
│   │   ├── 59/
│   │   │   └── b6bc826b7d4af749f6059e159145fefb840f4c
│   │   ├── d3/
│   │   │   └── 466f9fe2e0db4bca597823cd5602e401ed1337
│   │   ├── info/
│   │   └── pack/
│   └── refs/
│       ├── heads/
│       │   └── main
│       └── tags/
└── README.md

.git/COMMIT_EDITMSG is a file used by Git (and hooks) to manipulate the commit message. We passed our message via the -m flag and therefore did not encounter a case where having access to .git/COMMIT_EDITMSG would be useful. Git does show it with our message though:

git-experiment.jha $ cat .git/COMMIT_EDITMSG
initial commit

Going in order within .git/objects/, let’s inspect the new objects.

First is a tree object that lists our original blob we inspected during staging:

git-experiment.jha $ git cat-file -t 59b6b
tree
git-experiment.jha $ git ls-tree 59b6b
100644 blob 2617c87dce8b25f1c67acd220677749e0e3b3f81	README.md

Second is a commit object, representing our snapshot:

git-experiment.jha $ git cat-file -t d3466
commit
git-experiment.jha $ git cat-file commit d3466
tree 59b6bc826b7d4af749f6059e159145fefb840f4c
author Kip Landergren <klandergren@users.noreply.github.com> 1636743452 -0800
committer Kip Landergren <klandergren@users.noreply.github.com> 1636743452 -0800

initial commit

This commit object is telling us:

Git’s understanding of the project working tree—what we staged, not necessarily the working tree on disk (remember: we can have unstaged changes and untracked files)—is stored in 59b6bc8, which we inspected above, and that tree, in turn, refers to the README.md content blob 2617c87
the author is me, followed by the timestamp
the committer is me, followed by the timestamp
it has our message “initial commit”
that’s it!

Note: the values for author and committer were specified by me previously; more info on how to do this in git-config(1).

And finally, let’s inspect .git/refs/heads/main:

git-experiment.jha $ cat .git/refs/heads/main
d3466f9fe2e0db4bca597823cd5602e401ed1337

This tells us that the head of branch main—the most recent commit of branch (but not necessarily the most recent commit in the project’s history)—is the object named d3466f9, which we saw above was the initial commit!

The inspection of the main branch head tells us important information:

branches are just pointers to commits
this, in turn, shows that branches are cheap to create, modify, and manipulate: no data is copied when branching
the history of the project, demarcated by commits, are a set of pointers that themselves point to immutable objects (trees and blobs)

Git Knowledge

Contents

Overview

As a Distributed Version Control System

As Software

As a Source Code Manager

Core Idea

Key Concepts

Hashing

Staging

Committing

Branching

Remotes and Tracking

Merging

Internals

Objects

Blobs

Trees

Commits

Tags

References

By Demonstration

A Tour of the Initial Repository

Basic Operations

Staging

Digging Deeper Into Objects

Abbreviated Object Names

Committing