Managing large binary files with Git

Question

I am looking for opinions of how to handle large binary files on which my source code (web application) is dependent. What are your experiences/thoughts regarding this?

I personally have run into synchronisation failures with Git with some of my cloud hosts once my web applications binary data notched above the 3 GB mark. I considered BFT Repo Cleaner at the time, but it felt like a hack. Since then I've begun to just keep files outside of Git purview, instead leveraging purpose-built tools such as Amazon S3 for managing files, versioning and back-up.

Does anybody have experience with multiple Git repositories and managing them in one project?

Yes. Hugo themes are primarily managed this way. It's a little kudgy, but it gets the job done.

My suggestion is to choose the right tool for the job. If it's for a company and you're managing your codeline on GitHub pay the money and use Git-LFS. Otherwise you could explore more creative options such as decentralized, encrypted file storage using blockchain.

Additional options to consider include Minio and s3cmd.

User · Answer

I am looking for opinions of how to handle large binary files on which my source code (web application) is dependent. What are your experiences/thoughts regarding this?

I personally have run into synchronisation failures with Git with some of my cloud hosts once my web applications binary data notched above the 3 GB mark. I considered BFT Repo Cleaner at the time, but it felt like a hack. Since then I've begun to just keep files outside of Git purview, instead leveraging purpose-built tools such as Amazon S3 for managing files, versioning and back-up.

Does anybody have experience with multiple Git repositories and managing them in one project?

Yes. Hugo themes are primarily managed this way. It's a little kudgy, but it gets the job done.

My suggestion is to choose the right tool for the job. If it's for a company and you're managing your codeline on GitHub pay the money and use Git-LFS. Otherwise you could explore more creative options such as decentralized, encrypted file storage using blockchain.

Additional options to consider include Minio and s3cmd.

User · Answer

If the program won t work without the files it seems like splitting them into a separate repo is a bad idea   We have large test suites that we break into a separate repo but those are truly  auxiliary  files   However  you may be able to manage the files in a separate repo and then use git-submodule to pull them into your project in a sane way   So  you d still have the full history of all your source but  as I understand it  you d only have the one relevant revision of your images submodule   The git-submodule facility should help you keep the correct version of the code in line with the correct version of the images   Here s a good introduction to submodules from Git Book

User · Answer

git clone --filter from Git 2 19   shallow clones  This new option might eventually become the final solution to the binary file problem  if the Git and GitHub devs and  make it user friendly enough  which they arguably still haven t achieved for submodules for example    It allows to actually only fetch files and directories that you want for the server  and was introduced together with a remote protocol extension   With this  we could first do a shallow clone  and then automate which blobs to fetch with the build system for each type of build   There is even already a --filter blob limit lt size gt  which allows limiting the maximum blob size to fetch   I have provided a minimal detailed example of how the feature looks like at  How do I clone a subdirectory only of a Git repository

User · Answer

The solution I d like to propose is based on orphan branches and a slight abuse of the tag mechanism  henceforth referred to as  Orphan Tags Binary Storage  OTABS   TL DR 12-01-2017 If you can use github s LFS or some other 3rd party  by all means you should  If you can t  then read on  Be warned  this solution is a hack and should be treated as such   Desirable properties of OTABS   it is a pure git and git only solution -- it gets the job done without any 3rd party software  like git-annex  or 3rd party infrastructure  like github s LFS   it stores the binary files efficiently  i e  it doesn t bloat the history of your repository  git pull and git fetch  including git fetch --all are still bandwidth efficient  i e  not all large binaries are pulled from the remote by default  it works on Windows  it stores everything in a single git repository  it allows for deletion of outdated binaries  unlike bup     Undesirable properties of OTABS   it makes git clone potentially inefficient  but not necessarily  depending on your usage   If you deploy this solution you might have to advice your colleagues to use git clone -b master --single-branch  lt url gt  instead of git clone  This is because git clone by default literally clones entire repository  including things you wouldn t normally want to waste your bandwidth on  like unreferenced commits  Taken from SO 4811434  it makes git fetch  lt remote gt  --tags bandwidth inefficient  but not necessarily storage inefficient  You can can always advise your colleagues not to use it   you ll have to periodically use a git gc trick to clean your repository from any files you don t want any more  it is not as efficient as bup or git-bigfiles  But it s respectively more suitable for what you re trying to do and more off-the-shelf  You are likely to run into trouble with hundreds of thousands of small files or with files in range of gigabytes  but read on for workarounds    Adding the Binary Files  Before you start make sure that you ve committed all your changes  your working tree is up to date and your index doesn t contain any uncommitted changes  It might be a good idea to push all your local branches to your remote  github etc   in case any disaster should happen    Create a new orphan branch  git checkout --orphan binaryStuff will do the trick  This produces a branch that is entirely disconnected from any other branch  and the first commit you ll make in this branch will have no parent  which will make it a root commit  Clean your index using git rm --cached    gitignore  Take a deep breath and delete entire working tree using rm -fr    gitignore  Internal  git directory will stay untouched  because the   wildcard doesn t match it  Copy in your VeryBigBinary exe  or your VeryHeavyDirectory   Add it  amp  amp  commit it  Now it becomes tricky -- if you push it into the remote as a branch all your developers will download it the next time they invoke git fetch clogging their connection  You can avoid this by pushing a tag instead of a branch  This can still impact your colleague s bandwidth and filesystem storage if they have a habit of typing git fetch  lt remote gt  --tags  but read on for a workaround  Go ahead and git tag 1 0 0bin Push your orphan tag git push  lt remote gt  1 0 0bin  Just so you never push your binary branch by accident  you can delete it git branch -D binaryStuff  Your commit will not be marked for garbage collection  because an orphan tag pointing on it 1 0 0bin is enough to keep it alive    Checking out the Binary File   How do I  or my colleagues  get the VeryBigBinary exe checked out into the current working tree  If your current working branch is for example master you can simply git checkout 1 0 0bin -- VeryBigBinary exe  This will fail if you don t have the orphan tag 1 0 0bin downloaded  in which case you ll have to git fetch  lt remote gt  1 0 0bin beforehand  You can add the VeryBigBinary exe into your master s  gitignore  so that no-one on your team will pollute the main history of the project with the binary by accident    Completely Deleting the Binary File  If you decide to completely purge VeryBigBinary exe from your local repository  your remote repository and your colleague s repositories you can just    Delete the orphan tag on the remote git push  lt remote gt   refs tags 1 0 0bin Delete the orphan tag locally  deletes all other unreferenced tags  git tag -l   xargs git tag -d  amp  amp  git fetch --tags  Taken from SO 1841341 with slight modification  Use a git gc trick to delete your now unreferenced commit locally  git -c gc reflogExpire 0 -c gc reflogExpireUnreachable 0 -c gc rerereresolved 0 -c gc rerereunresolved 0 -c gc pruneExpire now gc       It will also delete all other unreferenced commits  Taken from SO 1904860 If possible  repeat the git gc trick on the remote  It is possible if you re self-hosting your repository and might not be possible with some git providers  like github or in some corporate environments  If you re hosting with a provider that doesn t give you ssh access to the remote just let it be  It is possible that your provider s infrastructure will clean your unreferenced commit in their own sweet time  If you re in a corporate environment you can advice your IT to run a cron job garbage collecting your remote once per week or so  Whether they do or don t will not have any impact on your team in terms of bandwidth and storage  as long as you advise your colleagues to always git clone -b master --single-branch  lt url gt  instead of git clone  All your colleagues who want to get rid of outdated orphan tags need only to apply steps 2-3  You can then repeat the steps 1-8 of Adding the Binary Files to create a new orphan tag 2 0 0bin  If you re worried about your colleagues typing git fetch  lt remote gt  --tags you can actually name it again 1 0 0bin  This will make sure that the next time they fetch all the tags the old 1 0 0bin will be unreferenced and marked for subsequent garbage collection  using step 3   When you try to overwrite a tag on the remote you have to use -f like this  git push -f  lt remote gt   lt tagname gt    Afterword   OTABS doesn t touch your master or any other source code development branches  The commit hashes  all of the history  and small size of these branches is unaffected  If you ve already bloated your source code history with binary files you ll have to clean it up as a separate piece of work  This script might be useful  Confirmed to work on Windows with git-bash  It is a good idea to apply a set of standard trics to make storage of binary files more efficient  Frequent running of git gc  without any additional arguments  makes git optimise underlying storage of your files by using binary deltas  However  if your files are unlikely to stay similar from commit to commit you can switch off binary deltas altogether  Additionally  because it makes no sense to compress already compressed or encrypted files  like  zip   jpg or  crypt  git allows you to switch off compression of the underlying storage  Unfortunately it s an all-or-nothing setting affecting your source code as well  You might want to script up parts of OTABS to allow for quicker usage  In particular  scripting steps 2-3 from Completely Deleting Binary Files into an update git hook could give a compelling but perhaps dangerous semantics to git fetch   fetch and delete everything that is out of date    You might want to skip the step 4 of Completely Deleting Binary Files to keep a full history of all binary changes on the remote at the cost of the central repository bloat  Local repositories will stay lean over time  In Java world it is possible to combine this solution with maven --offline to create a reproducible offline build stored entirely in your version control  it s easier with maven than with gradle   In Golang world it is feasible to build on this solution to manage your GOPATH instead of go get  In python world it is possible to combine this with virtualenv to produce a self-contained development environment without relying on PyPi servers for every build from scratch  If your binary files change very often  like build artifacts  it might be a good idea to script a solution which stores 5 most recent versions of the artifacts in the orphan tags monday bin  tuesday bin       friday bin  and also an orphan tag for each release 1 7 8bin 2 0 0bin  etc  You can rotate the weekday bin and delete old binaries daily  This way you get the best of two worlds  you keep the entire history of your source code but only the relevant history of your binary dependencies  It is also very easy to get the binary files for a given tag without getting entire source code with all its history  git init  amp  amp  git remote add  lt name gt   lt url gt   amp  amp  git fetch  lt name gt   lt tag gt  should do it for you

User · Answer

Have a look at camlistore  It is not really Git-based  but I find it more appropriate for what you have to do

User · Answer

I would use submodules  as Pat Notz  or two distinct repositories  If you modify your binary files too often  then I would try to minimize the impact of the huge repository cleaning the history   I had a very similar problem several months ago   21 nbsp GB of MP3 files  unclassified  bad names  bad id3 s  don t know if I like that MP3 file or not      and replicated on three computers   I used an external hard disk drive with the main Git repository  and I cloned it into each computer  Then  I started to classify them in the habitual way  pushing  pulling  merging    deleting and renaming many times    At the end  I had only  6 nbsp GB of MP3 files and  83 nbsp GB in the  git directory  I used git-write-tree and git-commit-tree to create a new commit  without commit ancestors  and started a new branch pointing to that commit  The  git log  for that branch only showed one commit   Then  I deleted the old branch  kept only the new branch  deleted the ref-logs  and run  git prune   after that  my  git folders weighted only  6 nbsp GB     You could  purge  the huge repository from time to time in the same way  Your  git clone  s will be faster

User · Answer

SVN seems to handle binary deltas more efficiently than Git   I had to decide on a versioning system for documentation  JPEG files  PDF files  and  odt files   I just tested adding a JPEG file and rotating it 90 degrees four times  to check effectiveness of binary deltas   Git s repository grew 400   SVN s repository grew by only 11    So it looks like SVN is much more efficient with binary files   So my choice is Git for source code and SVN for binary files like documentation

User · Answer

Another solution  since April 2015 is Git Large File Storage  LFS   by GitHub    It uses git-lfs  see git-lfs github com  and tested with a server supporting it  lfs-test-server  You can store metadata only in the git repo  and the large file elsewhere

User · Answer

Have a look at git bup which is a Git extension to smartly store large binaries in a Git repository   You d want to have it as a submodule  but you won t have to worry about the repository getting hard to handle  One of their sample use cases is storing VM images in Git   I haven t actually seen better compression rates  but my repositories don t have really large binaries in them   Your mileage may vary

User · Answer

You can also use git-fat  I like that it only depends on stock Python and rsync  It also supports the usual Git workflow  with the following self explanatory commands   git fat init git fat push git fat pull   In addition  you need to check in a  gitfat file into your repository and modify your  gitattributes to specify the file extensions you want git fat to manage   You add a binary using the normal git add  which in turn invokes git fat based on your gitattributes rules   Finally  it has the advantage that the location where your binaries are actually stored can be shared across repositories and users and supports anything rsync does   UPDATE  Do not use git-fat if you re using a Git-SVN bridge  It will end up removing the binary files from your Subversion repository  However  if you re using a pure Git repository  it works beautifully

User · Answer

I discovered git-annex recently which I find awesome  It was designed for managing large files efficiently  I use it for my photo music  etc   collections  The development of git-annex is very active  The content of the files can be removed from the Git repository  only the tree hierarchy is tracked by Git  through symlinks   However  to get the content of the file  a second step is necessary after pulling pushing  e g      git annex add mybigfile   git commit -m add mybigfile    git push myremote   git annex copy --to myremote mybigfile    This command copies the actual content to myremote   git annex drop mybigfile    Remove content from local repo       git annex get mybigfile    Retrieve the content    or to specify the remote from which to get    git annex copy --from myremote mybigfile   There are many commands available  and there is a great documentation on the website  A package is available on Debian

User · Answer

In my opinion  if you re likely to often modify those large files  or if you intend to make a lot of git clone or git checkout  then you should seriously consider using another Git repository  or maybe another way to access those files    But if you work like we do  and if your binary files are not often modified  then the first clone checkout will be long  but after that it should be as fast as you want  considering your users keep using the first cloned repository they had

[git] Managing large binary files with Git

Examples related to git

Examples related to version-control

Examples related to large-files

Examples related to binaryfiles