Git with ipython notebooks

Interactive tutorials are in notebooks. A full “experiment” in the lab is contained in a notebook. Notebooks are supposed to change a lot and meant to be played with. They are graphical. They are also essential to track.

Problem 1

  • Diff-ing your work against someone else’s is impossible
  • Changes to binary outputs take up a huge amount of space, even if nothing significant actually changed

Jupyter notebooks have two sections: inputs (code, markdown) and outputs (stdout, plots, images). Interactive python notebook files embed compiled outputs. This is good if you want to restart a kernel but still see the output, or if you close the file, etc.

Solution 1 and Problem 2: The nbstripout filter

nbstripout is a Git filter and “hides” the output and some metadata in .ipynb files from Git such that it does not get committed. This allows only tracking the actual input code cells in Git. It is installed via the requirements.txt, but there is also some interesting discussion and documentation

There are three downsides:

  1. What if you liked keeping those outputs without rerunning every commit?
  2. It has to strip evvverything, including all those high-quality graphics, every single time you git status.
  3. It crashes your essential commands. Very easy to get into a chicken-and-egg hole where you can’t diff anything because __some__thing isn’t JSON – causing a crash – but you can’t figure out what isn’t JSON because you can’t see which files just changed.
  4. It can corrupt files. That’s why we made cleannbline.

Solution 2. Deactivate the nbstripout filter

source venv/bin/activate
nbstripout --uninstall

Never think about it again… until you have to merge.

Best practice

Ultimately, some of the work in notebooks will be lost. This is desireable in the case where two people made slightly different versions of the same figure. However, it is impossible to tell if something important changed in a source cell.

Use semi-libraries for long and complex code segments. These are regular python files in the same directory as the notebook. They can be diffed easily.

> notebooks/myFolder
| gatherData.ipynb
| libStuff.py
-

In “libStuff.py”:

def squareIt(x):
    return x ** 2

In “gatherData.ipynb”:

from libStuff import squareIt
y = squareIt(3)

The merge scenario

You have branches development and cool-feature, and you want to merge cool-feature into development. Both have lots of notebooks with outputs, possibly with corrupted first lines.

Preliminaries

nbstripout is in your venv, so activate the venv. Later, when we install the filter, it expects a clean attributes file.

source venv/bin/activate
rm .git/info/attributes <<don't have to do this every time>>

You should have a good file editor (Sublime) ready for lots of conflicts happening within unreadable (in multiple senses) .ipynb files. You will need some kind of “Find All Within Project.” Have it going on your local machine with an SSHFS.

Be aware of the cleannbline script. Sometimes non-JSON and non-unicode characters get into the first line, making them unreadable for everything. This script cleans them.

Process

Create a test branch for merge

git checkout -b test_merge_cool-feature-into-development

Activate your filter

nbstripout --install
cat .git/info/attributes

should produce an output that looks like this

*.ipynb filter=nbstripout
*.ipynb diff=ipynb

Strip the notebooks on test branch

Run

git status

It takes some time. What is that error? It means that some of the notebooks are not valid JSON and cannot be parsed by the nbstripout filter.

In the crash log, it should point to a certain file, let’s say notebooks/Test.ipynb First, clean it with

./cleannbline notebooks/Test.ipynb

Then, open that file in Sublime and search for <<<<. Sometimes conflicts in your stash can get hidden in a way that does not show up in Jupyter. nbstripout will crash. You can find it in Sublime.

Return to running git status until it completes without error. It should show a ton of modifications: those are the effects of stripping. Add those and commit

git add .
git commit -m "stripped notebooks for merge"

Strip the notebooks on cool-feature branch

Your filter is currently active, so when you try

git checkout cool-feature

it will automatically crash. As above though, it will point to a file. Keep going until git status completes. Add those and commit.

Side note: even though git status shows a ton of modifications, you should get a clean git diff (Although sometimes it will just crash, NBD). Both commands are applying the .ipynb filter… in some way.

Do the merge

git checkout test_merge_cool-feature-into-development
git merge cool-feature

You will get conflicts in two categories: notebooks and other. Since there are <<<< conflict markers everywhere, your git diff will crash while you’re in the merge. It also doesn’t point you to an offending file. Here is where you’ll really appreciate Sublime.

Make sure Sublime opens the entire notebooks directory. That way Find All will search all the files.

  1. Pick one file, let’s say notebooks/Test2.ipynb
  2. You might have to ./cleannbline notebooks/Test2.ipynb
  3. In sublime, fix all instances of <<<<, which are usually
    • Minor version changes or metadata stuff
    • Legitimate conflicts
  4. When you are satisfied, go back and git add notebooks/Test2.ipynb

Repeat for all the notebooks. Then do the same for all the regular code files. When you run git status and everything is green, you are done. End the merge with

git commit

If for some reason, you want to abandon the merge while keeping the test_merge branch stripped, you can run git reset --hard

Finalize

Double check that everything went well (i.e. open some notebooks in Jupyter). If something screwed up in your merging or stripping, you can just delete the test_merge branch and start over.

Now we’re going to make changes to the real development branch.

git checkout development

This will take a while. If it causes crashes, do the thing above to make sure all notebooks are valid JSON until you get a successful git status. Make a commit on the real branch

git add .
git commit -m "stripped notebooks from target branch"
git merge test_merge_cool-feature-into-development

This should succeed without conflict.

Cleanup

Remove the test branch

git branch -d test_merge_cool-feature-into-development

Then you must deactivate the filter

nbstripout --uninstall

Now you can move around the unclean branches without triggering crashes left and right.

While you’re at it, leave the venv

deactivate

Some additional notes on the filter:

When you have the filter active and checkout a normal branch, it will checkout AND strip the outputs in git’s mind (not the HEAD version though… confusing)

When you have the filter active and leave a branch that has outputs, it will generate changes, thereby not allowing you to checkout without committing changes

You can turn it on and off with the nbstripout --install, nbstripout --uninstall commands, as long as the attributes file has nothing else in it This is the easiest way to check: cat .git/info/attributes