Git with ipython notebooks¶
Interactive tutorials are in notebooks. A full “experiment” in the lab is contained in a notebook. Notebooks are supposed to change a lot and meant to be played with. They are graphical. They are also essential to track.
Problem 1¶
- Diff-ing your work against someone else’s is impossible
- Changes to binary outputs take up a huge amount of space, even if nothing significant actually changed
Jupyter notebooks have two sections: inputs (code, markdown) and outputs (stdout, plots, images). Interactive python notebook files embed compiled outputs. This is good if you want to restart a kernel but still see the output, or if you close the file, etc.
Solution 1 and Problem 2: The nbstripout filter¶
nbstripout
is a Git filter and “hides” the output and some metadata in .ipynb
files from Git such that it does not get committed. This allows only tracking the actual input code cells in Git. It is installed via the requirements.txt
, but there is also some interesting discussion and documentation
There are three downsides:¶
- What if you liked keeping those outputs without rerunning every commit?
- It has to strip evvverything, including all those high-quality graphics, every single time you
git status
. - It crashes your essential commands. Very easy to get into a chicken-and-egg hole where you can’t
diff
anything because __some__thing isn’t JSON – causing a crash – but you can’t figure out what isn’t JSON because you can’t see which files just changed. - It can corrupt files. That’s why we made
cleannbline
.
Solution 2. Deactivate the nbstripout filter¶
source venv/bin/activate
nbstripout --uninstall
Never think about it again… until you have to merge.
Best practice¶
Ultimately, some of the work in notebooks will be lost. This is desireable in the case where two people made slightly different versions of the same figure. However, it is impossible to tell if something important changed in a source cell.
Use semi-libraries for long and complex code segments. These are regular python files in the same directory as the notebook. They can be diffed easily.
> notebooks/myFolder
| gatherData.ipynb
| libStuff.py
-
In “libStuff.py”:
def squareIt(x):
return x ** 2
In “gatherData.ipynb”:
from libStuff import squareIt
y = squareIt(3)
The merge scenario¶
You have branches development
and cool-feature
, and you want to merge cool-feature
into development
. Both have lots of notebooks with outputs, possibly with corrupted first lines.
Preliminaries¶
nbstripout
is in your venv, so activate the venv. Later, when we install the filter, it expects a clean attributes file.
source venv/bin/activate
rm .git/info/attributes <<don't have to do this every time>>
You should have a good file editor (Sublime) ready for lots of conflicts happening within unreadable (in multiple senses) .ipynb
files. You will need some kind of “Find All Within Project.” Have it going on your local machine with an SSHFS.
Be aware of the cleannbline
script. Sometimes non-JSON and non-unicode characters get into the first line, making them unreadable for everything. This script cleans them.
Process¶
Create a test branch for merge¶
git checkout -b test_merge_cool-feature-into-development
Activate your filter¶
nbstripout --install
cat .git/info/attributes
should produce an output that looks like this
*.ipynb filter=nbstripout
*.ipynb diff=ipynb
Strip the notebooks on test branch¶
Run
git status
It takes some time. What is that error? It means that some of the notebooks are not valid JSON and cannot be parsed by the nbstripout
filter.
In the crash log, it should point to a certain file, let’s say notebooks/Test.ipynb
First, clean it with
./cleannbline notebooks/Test.ipynb
Then, open that file in Sublime and search for <<<<
. Sometimes conflicts in your stash can get hidden in a way that does not show up in Jupyter. nbstripout
will crash. You can find it in Sublime.
Return to running git status
until it completes without error. It should show a ton of modifications: those are the effects of stripping. Add those and commit
git add .
git commit -m "stripped notebooks for merge"
Strip the notebooks on cool-feature branch¶
Your filter is currently active, so when you try
git checkout cool-feature
it will automatically crash. As above though, it will point to a file. Keep going until git status
completes. Add those and commit.
Side note: even though git status
shows a ton of modifications, you should get a clean git diff
(Although sometimes it will just crash, NBD). Both commands are applying the .ipynb
filter… in some way.
Do the merge¶
git checkout test_merge_cool-feature-into-development
git merge cool-feature
You will get conflicts in two categories: notebooks and other. Since there are <<<<
conflict markers everywhere, your git diff
will crash while you’re in the merge. It also doesn’t point you to an offending file. Here is where you’ll really appreciate Sublime.
Make sure Sublime opens the entire notebooks
directory. That way Find All will search all the files.
- Pick one file, let’s say
notebooks/Test2.ipynb
- You might have to
./cleannbline notebooks/Test2.ipynb
- In sublime, fix all instances of
<<<<
, which are usually- Minor version changes or metadata stuff
- Legitimate conflicts
- When you are satisfied, go back and
git add notebooks/Test2.ipynb
Repeat for all the notebooks. Then do the same for all the regular code files. When you run git status
and everything is green, you are done. End the merge with
git commit
If for some reason, you want to abandon the merge while keeping the test_merge branch stripped, you can run git reset --hard
Finalize¶
Double check that everything went well (i.e. open some notebooks in Jupyter). If something screwed up in your merging or stripping, you can just delete the test_merge branch and start over.
Now we’re going to make changes to the real development
branch.
git checkout development
This will take a while. If it causes crashes, do the thing above to make sure all notebooks are valid JSON until you get a successful git status
. Make a commit on the real branch
git add .
git commit -m "stripped notebooks from target branch"
git merge test_merge_cool-feature-into-development
This should succeed without conflict.
Cleanup¶
Remove the test branch
git branch -d test_merge_cool-feature-into-development
Then you must deactivate the filter
nbstripout --uninstall
Now you can move around the unclean branches without triggering crashes left and right.
While you’re at it, leave the venv
deactivate
Some additional notes on the filter:¶
When you have the filter active and checkout a normal branch, it will checkout AND strip the outputs in git’s mind (not the HEAD version though… confusing)
When you have the filter active and leave a branch that has outputs, it will generate changes, thereby not allowing you to checkout without committing changes
You can turn it on and off with the nbstripout --install
, nbstripout --uninstall
commands, as long as the attributes file has nothing else in it
This is the easiest way to check: cat .git/info/attributes