vignettes/iii-using-git-github.Rmd
iii-using-git-github.Rmd
git
?
Git is a version control system used extensively for software
development. Unlike ‘file-based’ versioning you may be familiar with
using Dropbox or Google Docs, git
provides line-by-line
(character-by-character) versioning for text files. It is especially
powerful for collaborative projects, allowing easy merging of changed
files by multiple authors. While it may seem a bit daunting at first,
like any other professional tool, a small investment learning to use
this system pays dividends down the road.
See https://git-scm.com/book/en/v2/Getting-Started-Installing-Git for instructions for your operating system.
Getting started with git
is relatively easy using a
graphical user interface (GUI), like the one built
into Rstudio. However, to really get going with git
I
recommend GitKraken – an extremely
powerful and user-friendly git
GUI.
There are several excellent resources to help you get started with
git
, GitHub, and GitKraken:
Setup a PAT following these
instructions, then edit ~/.Renviron
to include the
following:
GITHUB_PAT=xxxxxxxxxxxxxxxx
Set your name and email for commits:
git config --global user.name "YOURNAME"
git config --global user.email "YOUREMAIL@EMAIL.COM"
Set default editor for commit messages, etc. to use nano
instead of vi
:
git config --global core.editor "nano"
Use ssh instead of https with GitHub:
git config --global url.ssh://git@github.com/.insteadOf https://github.com/
Git workflows are branch-based. The main
branch is the
primary branch from which others are derived, and contains the code of
the latest release. The development
branch contains the
latest contributions and other code that will appear in the next
release. Other branches can be created as needed to implement features,
fix bugs, or try out new algorithms, before being merged into
development
(and eventually into main
).
Before merging branches, it is useful to create a pull request (PR) via GitHub, to allow for code review as well as trigger any automated tests and code checks.
Another vignette discussed how to manage large SpaDES projects, and suggested the following project directory structure:
myProject/ # a version controlled git repo
|_ .git/
|_ cache/ # should be .gitignore'd
|_ inputs/ # should be .gitignore'd (selectively)
|_ manuscripts/
|_ modules/
|_ module1/ # can be a git submodule
|_ module2/ # can be a git submodule
|_ module3/ # can be a git submodule
|_ module4/ # can be a git submodule
|_ module5/ # can be a git submodule
|_ outputs/ # should be .gitignore'd
...
The layout of a project directory is somewhat flexible, but this approach works especially well if you’re a module developer using git submodules for each of your module subdirectories. And each module really should be its own git repository:
SpaDES
module
repositories.However, note that you cannot nest a git repository inside another
git repository. So if you are using git for your project directory, you
cannot use SpaDES
modules as repos inside that project
directory (this is what git submodules are for). If git
submodules aren’t your thing, then you will need to keep your project
repo separate from your module repo!
modules/ # use this for your simulation modulePath
|_ module1/
|_ module2/
|_ module3/
|_ module4/
|_ module5/
myProject/
|_ cache/ # use this for your simulation cachePath
|_ inputs/ # use this for your simulation inputPath
|_ manuscripts/
|_ outputs/ # use this for your simulation outputPath
|_ packages/
...
Alternatively, your myProject/
directory could be a
subdirectory of modules/
.
modules/ # use this for your simulation modulePath
|_ module1/
|_ module2/
|_ module3/
|_ module4/
|_ module5/
myProject/
|_ cache/ # use this for your simulation cachePath
|_ inputs/ # use this for your simulation inputPath
|_ manuscripts/
|_ outputs/ # use this for your simulation outputPath
|_ packages/
...
These allow you to have each module and project be a git repository,
and if you’re worried about storage space it ensures you only keep one
copy of a module no matter how many projects it’s used with. However,
there can e several drawbacks to this approach. First off, it is
inconsistent with the way Rstudio projects work, because not all
project-related files are in the same directory. This means you need to
take extra care to ensure that you set your module path using a
relative file path (e.g., ../modules
),
and you’ll need to take even more care to update this path if you move
the modules/
directory or are sharing your project code
(because your collaborator may store their modules in a different
location). Second, if you are working with multiple projects and each
one uses the same module(s) but different versions, it’s going to be
extremely inconvenient to have to manually reset them when switching
projects. As with package libraries, it’s best practice to keep
projects’ modules isolated (i.e., standalone) as much as possible.
In the end, which approach you use will depend on your level of git-savviness (and that of your collaborators), and how comfortable you are using git submodules.
git submodule add https://github.com/USERNAME/REPO <path/to/submodule>
Within a project repository, git tracks specific submodule commits,
not their branches. So switching to a submodule directory and running
git pull will likely warn you that you are in a detached
HEAD
state. Before making changes to code in a submodule
directory, be sure to switch to the branch you want to use using
git checkout <branch-name>
.
To get your latest updates on another machine, you need to update the project repo and the submodules:
git pull ## updates the project repo
git submodule update ## updates submodules based on project repo changes