About me: My name is Solène Rapenne, pronouns she/her. I like learning and sharing knowledge. Hobbies: '(BSD OpenBSD Qubes OS Lisp cmdline gaming security QubesOS internet-stuff). I love percent and lambda characters. OpenBSD developer solene@. No AI is involved in this blog.

Contact me: solene at dataswamp dot org or @solene@bsd.network (mastodon).

You can sponsor my work financially if you want to help me writing this blog and contributing to Free Software as my daily job.

Introduction to git-annex (Port Of The Week)

Written by Solène, on 12 May 2021.
Tags: #git #versioning #openbsd

Comments on Fediverse/Mastodon

1. Introduction §

Now that git-annex is available as a package on OpenBSD I can use it again. I've been relying on it a few years ago but it was really complicated for me to compile it and I gave up. Since I really missed it, I'm now back to it and I think it's time to share about this wonderful piece of software.

git-annex is meant to help you manage your data like you would manage books in a library, you have a database telling you where the books are and you can find them on the shelves, or at least you can know who borrowed the book. We are working with digital files that can be copied here so the analogy doesn't fully work, but you could want to put your data in an external hard drive but not everything, and you may want to have some data on multiples devices for safety reasons, git-annex automates this.

It works very well for files that are not changing much, I call them "static files", they are music, videos, pictures, documents. You don't really want to use git-annex with files you edit everyday, it doesn't work well because the process can be a bit tedious.

git-annex may not be easy to understand at first, I suggest you try locally to grasp its purpose.

git-annex official website

what git-annex is not

2. Cheat sheet §

Let's create a cheat sheet first. Most git-annex commands have a dedicated man page, but can also provide a simpler help by using "git annex help somecommand".

2.1. Create the repository §

The first step is to create a repository which is based on git, then we will tell git-annex to init it too.

mkdir ~/MyDataLibrary && cd ~/MyDataLibrary
git init
git annex init "my-computer"

2.2. Add a file §

When you want to register a file in git annex, you need to use "git annex add" to add it and then "git commit" to make it permanent. The files are not stored in the git repository, it will only contains metadata.

git annex add Something
git commit -m "I added something"


$ echo "hello there" > hello
$ ls -l hello
-rw-r--r--  1 solene  wheel  12 May 12 18:38 hello
$ git annex add hello
add hello
(recording state in git...)
$ ls -l hello
lrwxr-xr-x  1 solene  wheel  180 May 12 18:38 hello -> .git/annex/objects/qj/g5/SHA256E-s12--aadc1955c030f723e9d89ed9d486b4eef5b0d1c6945be0dd6b7b340d42928ec9/SHA256E-s12--aadc1955c030f723e9d89ed9d486b4eef5b0d1c6945be0dd6b7b340d42928ec9
$  git status hello
On branch master
Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
        new file:   hello

2.3. Make changes to a file §

If you want to make changes to a file, you first need to "unlock" it in git-annex, which mean the symbolic link is replaced by the file itself and is no longer in read-only. Then, after your changes, you need to add it again to git-annex and commit your changes.

git annex unlock file
vi file
git annex add file
git commit -m "I changed something" file

2.4. Add a remote encrypted repository §

If you want to store data (for duplication) on a remote server using ssh you can use a remote of type "rsync" and encrypt the data in many fashions (GPG with hybrid is the best). This will allow to store data on remote untrusted devices.

git annex initremote my-remote-server type=rsync rsyncurl=remote-server.com:/home/solene/git-annex-data keyid=my-gpg@address encryption=hybrid

After this command, I can send files to my-remote-server.

git-annex website about encryption

git-annex website about special remotes

2.5. Manage data from multiple computers (with ssh) §

**This is a way to have a central git repository for many computers, this is not the best way to store data on remote servers**.

If you want to use a remote server through ssh, there are two ways: mounting the remote file system using sshfs or use a plain ssh. If you use sshfs, then it falls as a standard local file system like an external usb drive, but if you go through ssh, it's different.

You need to have a key authentication based for the remote ssh and you also need git-annex on the remote server. It's important to have a bare git repo.

cd /home/data/
git init --bare
git annex init "remote-server"

On your computer:

git remote add remote-server ssh://hostname:/home/data/
git fetch remote-server

You will be able to use commands related to repositories now!

2.6. List files and where they are stored §

You can use the "git annex list" command to list where your files are physically stored.

In the following example you can see which files are on my computer and which are available on my remote server called "network", "web" and "bittorrent" are special remotes.

X___ Documentation/Nim/Dominik Picheta - Nim in Action-Manning Publications (2017).pdf
X___ Documentation/ada/Ada-Distilled-24-January-2011-Ada-2005-Version.pdf
X___ Documentation/ada/courseada1.pdf
X___ Documentation/ada/courseada2.pdf
X___ Documentation/ada/courseada3.pdf
X___ Documentation/scheme/artanis.pdf
X___ Documentation/scheme/guix.pdf
X___ Documentation/scheme/manual_guix.pdf
X___ Documentation/skribilo/skribilo.pdf
X___ Documentation/uck2ep1.pdf
X___ Documentation/uck2ep2.pdf
X___ Documentation/usingckermit3e.pdf
XX__ Musique/Daft Punk/01 - Albums/1997 - Homework/01 - Daftendirekt.flac
XX__ Musique/Daft Punk/01 - Albums/1997 - Homework/02 - Wdpk 83.7 fm.flac
XX__ Musique/Daft Punk/01 - Albums/1997 - Homework/03 - Revolution 909.flac
XX__ Musique/Daft Punk/01 - Albums/1997 - Homework/04 - Da Funk.flac
XX__ Musique/Daft Punk/01 - Albums/1997 - Homework/05 - Phoenix.flac
_X__ Musique/Alan Walker/Alan Walker - Different World/01 - Alan Walker - Intro.flac
_X__ Musique/Alan Walker/Alan Walker - Different World/02 - Alan Walker, Sorana - Lost Control.flac
_X__ Musique/Alan Walker/Alan Walker - Different World/03 - Alan Walker, Julie Bergan - I Don_t Wanna Go.flac

2.7. List files locally available §

If you want to list the files for which you have the content available locally, you can use the "list" command from git-annex but only restrict to the group "here" representing your local repository.

git annex list --in here

3. Work with a remote repository §

3.1. Delete a repository §

Simply mark it as "dead".

git annex dead $repo_name

3.2. Adding a remote repository GPG encrypted §

git annex initremote $name type=rsync rsyncurl=remote-server:/home/solene/mydirectory keyid=your@email encryption=shared

3.3. Copy files to a remote §

If you want to duplicate files between repositories to have multiples copies you can use "git annex copy".

git annex copy Music -t remote-server

3.4. Move files to a remote §

If you want to move files from a repository to another (removing the content from origin) you can use "git annex move" which will copy to destination and remove from origin.

git annex move Music -t remote-server

3.5. Get a file content §

If you don't have a file locally, you can fetch it from a remote to get the content.

git annex get Music/Queen

3.6. Forget a file locally §

If you don't want to have the file locally because you don't have disk space or you simply don't want it, you can use the "drop" command. Note that "drop" is safe because git-annex won't allow you to drop files that have only one copy (except if you use --force of course).

git annex drop Music/Queen

Real life example: I have a very huge music library but my laptop SSD is too small, I get get some music I want and drop the files I don't want to listen for a while.

3.7. Use mincopies to enforce multi repository data duplication §

The numcopies and mincopies variables can be used to tell git-annex you want exactly or at least "n" copies of the files, so it will be able to protect you from accidental deletions and also help uploading files to other repositories to match the requirements.

3.7.1. Enable per directory recursively §

echo "* annex.mincopies=2" > .gitattributes

3.7.2. Only upload files not matching the num copies §

If you have multiples repositories and some files doesn't match the copies requirements, you can use the following commands to only push the files missing copies.

git annex copy --auto -t remote-server

Real life example: I want my salaries PDF to be really safe, I can ask to have 2 copies of those and then run a sync to the remote server which will proceed to upload them if there is only one copy of the file yet.

3.8. Verifying integrity and requirements §

There is the git-annex fsck command which will check the integrity of every file in the local repository and reports you if they are sane (or not), but it will also tell you which file doesn't meet the mincopies requirements.

git annex fsck

4. Reversibility §

If for some reasons you want to give up git-annex, you can easily get all your files back like a normal file system by using "git annex unlock ." on the top directory of your repository, every local files will be replaced by their physical copy instead of the symlink. Reversibility is very important when you deal with your data because it means you are not stuck forever with a tool in case it's broken or if you want to switch to another process.

5. My workflow §

I have a ~/DATA/ directory in which I have sub directories {documents,documentation,pictures,videos,music,images}, documents are papers or legal papers, documentation are mostly PDF. Pictures are family pictures and images are wallpapers or stupid images I want to keep.

I've set a mincopies to 2 for documents and pictures and my music is not on my computer but on a remote, I get the music files I want to listen when I'm on the local network with the computer having the files, I drop them locally when I'm bored.

6. Conclusion §

git-annex separates content from indexation, it can be used in many ways but it implies an archivist philosophy: redundancy, safety, immutability (sort of). It is not meant for backup, you can backup your directory managed by git-annex, it will save the data you have locally, you will have to make backup of your other data as well.

I love that tool, it's a very nice piece of software. It's unique, I didn't find any other program to achieve this.

6.1. More resources §

git-annex official walkthrough

git-annex special remotes (S3, webdav, bittorrent etc..)

git-annex encryption