[git] How to remove/delete a large file from commit history in Git repository?

I accidentally dropped a DVD-rip into a website project, then carelessly git commit -a -m ..., and, zap, the repo was bloated by 2.2 gigs. Next time I made some edits, deleted the video file, and committed everything, but the compressed file is still there in the repository, in history.

I know I can start branches from those commits and rebase one branch onto another. But what should I do to merge together the 2 commits so that the big file doesn't show in the history and is cleaned in garbage collection procedure?

This question is related to git version-control git-rebase git-rewrite-history

The answer is


If you know your commit was recent instead of going through the entire tree do the following: git filter-branch --tree-filter 'rm LARGE_FILE.zip' HEAD~10..HEAD


According to GitHub Documentation, just follow these steps:

  1. Get rid of the large file

Option 1: You don't want to keep the large file:

rm path/to/your/large/file        # delete the large file

Option 2: You want to keep the large file into an untracked directory

mkdir large_files                       # create directory large_files
touch .gitignore                        # create .gitignore file if needed
'/large_files/' >> .gitignore           # untrack directory large_files
mv path/to/your/large/file large_files/ # move the large file into the untracked directory
  1. Save your changes
git add path/to/your/large/file   # add the deletion to the index
git commit -m 'delete large file' # commit the deletion
  1. Remove the large file from all commits
git filter-branch --force --index-filter \
  "git rm --cached --ignore-unmatch path/to/your/large/file" \
  --prune-empty --tag-name-filter cat -- --all
git push <remote> <branch>

When you run into this problem, git rm will not suffice, as git remembers that the file existed once in our history, and thus will keep a reference to it.

To make things worse, rebasing is not easy either, because any references to the blob will prevent git garbage collector from cleaning up the space. This includes remote references and reflog references.

I put together git forget-blob, a little script that tries removing all these references, and then uses git filter-branch to rewrite every commit in the branch.

Once your blob is completely unreferenced, git gc will get rid of it

The usage is pretty simple git forget-blob file-to-forget. You can get more info here

https://ownyourbits.com/2017/01/18/completely-remove-a-file-from-a-git-repository-with-git-forget-blob/

I put this together thanks to the answers from Stack Overflow and some blog entries. Credits to them!


These commands worked in my case:

git filter-branch --force --index-filter 'git rm --cached -r --ignore-unmatch oops.iso' --prune-empty --tag-name-filter cat -- --all
rm -rf .git/refs/original/
git reflog expire --expire=now --all
git gc --prune=now
git gc --aggressive --prune=now

It is little different from the above versions.

For those who need to push this to github/bitbucket (I only tested this with bitbucket):

# WARNING!!!
# this will rewrite completely your bitbucket refs
# will delete all branches that you didn't have in your local

git push --all --prune --force

# Once you pushed, all your teammates need to clone repository again
# git pull will not work

This will remove it from your history

git filter-branch --force --index-filter 'git rm -r --cached --ignore-unmatch bigfile.txt' --prune-empty --tag-name-filter cat -- --all

Other than git filter-branch (slow but pure git solution) and BFG (easier and very performant), there is also another tool to filter with good performance:

https://github.com/xoofx/git-rocket-filter

From its description:

The purpose of git-rocket-filter is similar to the command git-filter-branch while providing the following unique features:

  • Fast rewriting of commits and trees (by an order of x10 to x100).
  • Built-in support for both white-listing with --keep (keeps files or directories) and black-listing with --remove options.
  • Use of .gitignore like pattern for tree-filtering
  • Fast and easy C# Scripting for both commit filtering and tree filtering
  • Support for scripting in tree-filtering per file/directory pattern
  • Automatically prune empty/unchanged commit, including merge commits

100 times faster than git filter-branch and simpler

There are very good answers in this thread, but meanwhile many of them are outdated. Using git-filter-branch is no longer recommended, because it is difficult to use and awfully slow on big repositories.

git-filter-repo is much faster and simpler to use.

git-filter-repo is a Python script, available at github: https://github.com/newren/git-filter-repo . When installed it looks like a regular git command and can be called by git filter-repo.

You need only one file: the Python3 script git-filter-repo. Copy it to a path that is included in the PATH variable. On Windows you may have to change the first line of the script (refer INSTALL.md). You need Python3 installed installed on your system, but this is not a big deal.

First you can run

git filter-repo --analyze

This helps you to determine what to do next.

You can delete your DVD-rip file everywhere:

git filter-repo --invert-paths --path-match DVD-rip
 

Filter-repo is really fast. A task that took around 9 hours on my computer by filter-branch, was completed in 4 minutes by filter-repo. You can do many more nice things with filter-repo. Refer to the documentation for that.

Warning: Do this on a copy of your repository. Many actions of filter-repo cannot be undone. filter-repo will change the commit hashes of all modified commits (of course) and all their descendants down to the last commits!


Just note that this commands can be very destructive. If more people are working on the repo they'll all have to pull the new tree. The three middle commands are not necessary if your goal is NOT to reduce the size. Because the filter branch creates a backup of the removed file and it can stay there for a long time.

$ git filter-branch --index-filter "git rm -rf --cached --ignore-unmatch YOURFILENAME" HEAD
$ rm -rf .git/refs/original/ 
$ git reflog expire --all 
$ git gc --aggressive --prune
$ git push origin master --force

(The best answer I've seen to this problem is: https://stackoverflow.com/a/42544963/714112 , copied here since this thread appears high in Google search rankings but that other one doesn't)

A blazingly fast shell one-liner

This shell script displays all blob objects in the repository, sorted from smallest to largest.

For my sample repo, it ran about 100 times faster than the other ones found here.
On my trusty Athlon II X4 system, it handles the Linux Kernel repository with its 5,622,155 objects in just over a minute.

The Base Script

git rev-list --objects --all \
| git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' \
| awk '/^blob/ {print substr($0,6)}' \
| sort --numeric-sort --key=2 \
| cut --complement --characters=13-40 \
| numfmt --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest

When you run above code, you will get nice human-readable output like this:

...
0d99bb931299  530KiB path/to/some-image.jpg
2ba44098e28f   12MiB path/to/hires-image.png
bd1741ddce0d   63MiB path/to/some-video-1080p.mp4

Fast File Removal

Suppose you then want to remove the files a and b from every commit reachable from HEAD, you can use this command:

git filter-branch --index-filter 'git rm --cached --ignore-unmatch a b' HEAD

You can do this using the branch filter command:

git filter-branch --tree-filter 'rm -rf path/to/your/file' HEAD


After trying virtually every answer in SO, I finally found this gem that quickly removed and deleted the large files in my repository and allowed me to sync again: http://www.zyxware.com/articles/4027/how-to-delete-files-permanently-from-your-local-and-remote-git-repositories

CD to your local working folder and run the following command:

git filter-branch -f --index-filter "git rm -rf --cached --ignore-unmatch FOLDERNAME" -- --all

replace FOLDERNAME with the file or folder you wish to remove from the given git repository.

Once this is done run the following commands to clean up the local repository:

rm -rf .git/refs/original/
git reflog expire --expire=now --all
git gc --prune=now
git gc --aggressive --prune=now

Now push all the changes to the remote repository:

git push --all --force

This will clean up the remote repository.


git filter-branch --tree-filter 'rm -f path/to/file' HEAD worked pretty well for me, although I ran into the same problem as described here, which I solved by following this suggestion.

The pro-git book has an entire chapter on rewriting history - have a look at the filter-branch/Removing a File from Every Commit section.


I basically did what was on this answer: https://stackoverflow.com/a/11032521/1286423

(for history, I'll copy-paste it here)

$ git filter-branch --index-filter "git rm -rf --cached --ignore-unmatch YOURFILENAME" HEAD
$ rm -rf .git/refs/original/ 
$ git reflog expire --all 
$ git gc --aggressive --prune
$ git push origin master --force

It didn't work, because I like to rename and move things a lot. So some big file were in folders that have been renamed, and I think the gc couldn't delete the reference to those files because of reference in tree objects pointing to those file. My ultimate solution to really kill it was to:

# First, apply what's in the answer linked in the front
# and before doing the gc --prune --aggressive, do:

# Go back at the origin of the repository
git checkout -b newinit <sha1 of first commit>
# Create a parallel initial commit
git commit --amend
# go back on the master branch that has big file
# still referenced in history, even though 
# we thought we removed them.
git checkout master
# rebase on the newinit created earlier. By reapply patches,
# it will really forget about the references to hidden big files.
git rebase newinit

# Do the previous part (checkout + rebase) for each branch
# still connected to the original initial commit, 
# so we remove all the references.

# Remove the .git/logs folder, also containing references
# to commits that could make git gc not remove them.
rm -rf .git/logs/

# Then you can do a garbage collection,
# and the hidden files really will get gc'ed
git gc --prune --aggressive

My repo (the .git) changed from 32MB to 388KB, that even filter-branch couldn't clean.


I ran into this with a bitbucket account, where I had accidentally stored ginormous *.jpa backups of my site.

git filter-branch --prune-empty --index-filter 'git rm -rf --cached --ignore-unmatch MY-BIG-DIRECTORY-OR-FILE' --tag-name-filter cat -- --all

Relpace MY-BIG-DIRECTORY with the folder in question to completely rewrite your history (including tags).

source: https://web.archive.org/web/20170727144429/http://naleid.com:80/blog/2012/01/17/finding-and-purging-big-files-from-git-history/


This works perfectly for me : in git extensions :

right click on the selected commit :

reset current branch to here :

hard reset ;

It's surprising nobody else is able to give this simple answer.

reset current branch to here

hard reset


Use Git Extensions, it's a UI tool. It has a plugin named "Find large files" which finds lage files in repositories and allow removing them permenently.

Don't use 'git filter-branch' before using this tool, since it won't be able to find files removed by 'filter-branch' (Altough 'filter-branch' does not remove files completely from the repository pack files).


Why not use this simple but powerful command?

git filter-branch --tree-filter 'rm -f DVD-rip' HEAD

The --tree-filter option runs the specified command after each checkout of the project and then recommits the results. In this case, you remove a file called DVD-rip from every snapshot, whether it exists or not.

If you know which commit introduced the huge file (say 35dsa2), you can replace HEAD with 35dsa2..HEAD to avoid rewriting too much history, thus avoiding diverging commits if you haven't pushed yet. This comment courtesy of @alpha_989 seems too important to leave out here.

See this link.


git filter-branch is a powerful command which you can use it to delete a huge file from the commits history. The file will stay for a while and Git will remove it in the next garbage collection. Below is the full process from deleteing files from commit history. For safety, below process runs the commands on a new branch first. If the result is what you needed, then reset it back to the branch you actually want to change.

# Do it in a new testing branch
$ git checkout -b test

# Remove file-name from every commit on the new branch
# --index-filter, rewrite index without checking out
# --cached, remove it from index but not include working tree
# --ignore-unmatch, ignore if files to be removed are absent in a commit
# HEAD, execute the specified command for each commit reached from HEAD by parent link
$ git filter-branch --index-filter 'git rm --cached --ignore-unmatch file-name' HEAD

# The output is OK, reset it to the prior branch master
$ git checkout master
$ git reset --soft test

# Remove test branch
$ git branch -d test

# Push it with force
$ git push --force origin master

What you want to do is highly disruptive if you have published history to other developers. See “Recovering From Upstream Rebase” in the git rebase documentation for the necessary steps after repairing your history.

You have at least two options: git filter-branch and an interactive rebase, both explained below.

Using git filter-branch

I had a similar problem with bulky binary test data from a Subversion import and wrote about removing data from a git repository.

Say your git history is:

$ git lola --name-status
* f772d66 (HEAD, master) Login page
| A     login.html
* cb14efd Remove DVD-rip
| D     oops.iso
* ce36c98 Careless
| A     oops.iso
| A     other.html
* 5af4522 Admin page
| A     admin.html
* e738b63 Index
  A     index.html

Note that git lola is a non-standard but highly useful alias. With the --name-status switch, we can see tree modifications associated with each commit.

In the “Careless” commit (whose SHA1 object name is ce36c98) the file oops.iso is the DVD-rip added by accident and removed in the next commit, cb14efd. Using the technique described in the aforementioned blog post, the command to execute is:

git filter-branch --prune-empty -d /dev/shm/scratch \
  --index-filter "git rm --cached -f --ignore-unmatch oops.iso" \
  --tag-name-filter cat -- --all

Options:

  • --prune-empty removes commits that become empty (i.e., do not change the tree) as a result of the filter operation. In the typical case, this option produces a cleaner history.
  • -d names a temporary directory that does not yet exist to use for building the filtered history. If you are running on a modern Linux distribution, specifying a tree in /dev/shm will result in faster execution.
  • --index-filter is the main event and runs against the index at each step in the history. You want to remove oops.iso wherever it is found, but it isn’t present in all commits. The command git rm --cached -f --ignore-unmatch oops.iso deletes the DVD-rip when it is present and does not fail otherwise.
  • --tag-name-filter describes how to rewrite tag names. A filter of cat is the identity operation. Your repository, like the sample above, may not have any tags, but I included this option for full generality.
  • -- specifies the end of options to git filter-branch
  • --all following -- is shorthand for all refs. Your repository, like the sample above, may have only one ref (master), but I included this option for full generality.

After some churning, the history is now:

$ git lola --name-status
* 8e0a11c (HEAD, master) Login page
| A     login.html
* e45ac59 Careless
| A     other.html
|
| * f772d66 (refs/original/refs/heads/master) Login page
| | A   login.html
| * cb14efd Remove DVD-rip
| | D   oops.iso
| * ce36c98 Careless
|/  A   oops.iso
|   A   other.html
|
* 5af4522 Admin page
| A     admin.html
* e738b63 Index
  A     index.html

Notice that the new “Careless” commit adds only other.html and that the “Remove DVD-rip” commit is no longer on the master branch. The branch labeled refs/original/refs/heads/master contains your original commits in case you made a mistake. To remove it, follow the steps in “Checklist for Shrinking a Repository.”

$ git update-ref -d refs/original/refs/heads/master
$ git reflog expire --expire=now --all
$ git gc --prune=now

For a simpler alternative, clone the repository to discard the unwanted bits.

$ cd ~/src
$ mv repo repo.old
$ git clone file:///home/user/src/repo.old repo

Using a file:///... clone URL copies objects rather than creating hardlinks only.

Now your history is:

$ git lola --name-status
* 8e0a11c (HEAD, master) Login page
| A     login.html
* e45ac59 Careless
| A     other.html
* 5af4522 Admin page
| A     admin.html
* e738b63 Index
  A     index.html

The SHA1 object names for the first two commits (“Index” and “Admin page”) stayed the same because the filter operation did not modify those commits. “Careless” lost oops.iso and “Login page” got a new parent, so their SHA1s did change.

Interactive rebase

With a history of:

$ git lola --name-status
* f772d66 (HEAD, master) Login page
| A     login.html
* cb14efd Remove DVD-rip
| D     oops.iso
* ce36c98 Careless
| A     oops.iso
| A     other.html
* 5af4522 Admin page
| A     admin.html
* e738b63 Index
  A     index.html

you want to remove oops.iso from “Careless” as though you never added it, and then “Remove DVD-rip” is useless to you. Thus, our plan going into an interactive rebase is to keep “Admin page,” edit “Careless,” and discard “Remove DVD-rip.”

Running $ git rebase -i 5af4522 starts an editor with the following contents.

pick ce36c98 Careless
pick cb14efd Remove DVD-rip
pick f772d66 Login page

# Rebase 5af4522..f772d66 onto 5af4522
#
# Commands:
#  p, pick = use commit
#  r, reword = use commit, but edit the commit message
#  e, edit = use commit, but stop for amending
#  s, squash = use commit, but meld into previous commit
#  f, fixup = like "squash", but discard this commit's log message
#  x, exec = run command (the rest of the line) using shell
#
# If you remove a line here THAT COMMIT WILL BE LOST.
# However, if you remove everything, the rebase will be aborted.
#

Executing our plan, we modify it to

edit ce36c98 Careless
pick f772d66 Login page

# Rebase 5af4522..f772d66 onto 5af4522
# ...

That is, we delete the line with “Remove DVD-rip” and change the operation on “Careless” to be edit rather than pick.

Save-quitting the editor drops us at a command prompt with the following message.

Stopped at ce36c98... Careless
You can amend the commit now, with

        git commit --amend

Once you are satisfied with your changes, run

        git rebase --continue

As the message tells us, we are on the “Careless” commit we want to edit, so we run two commands.

$ git rm --cached oops.iso
$ git commit --amend -C HEAD
$ git rebase --continue

The first removes the offending file from the index. The second modifies or amends “Careless” to be the updated index and -C HEAD instructs git to reuse the old commit message. Finally, git rebase --continue goes ahead with the rest of the rebase operation.

This gives a history of:

$ git lola --name-status
* 93174be (HEAD, master) Login page
| A     login.html
* a570198 Careless
| A     other.html
* 5af4522 Admin page
| A     admin.html
* e738b63 Index
  A     index.html

which is what you want.


Examples related to git

Does the target directory for a git clone have to match the repo name? Git fatal: protocol 'https' is not supported Git is not working after macOS Update (xcrun: error: invalid active developer path (/Library/Developer/CommandLineTools) git clone: Authentication failed for <URL> destination path already exists and is not an empty directory SSL_connect: SSL_ERROR_SYSCALL in connection to github.com:443 GitLab remote: HTTP Basic: Access denied and fatal Authentication How can I switch to another branch in git? VS 2017 Git Local Commit DB.lock error on every commit How to remove an unpushed outgoing commit in Visual Studio?

Examples related to version-control

How can I switch to another branch in git? Do I commit the package-lock.json file created by npm 5? Project vs Repository in GitHub Remove a modified file from pull request Git push: "fatal 'origin' does not appear to be a git repository - fatal Could not read from remote repository." Git: How to squash all commits on branch git: updates were rejected because the remote contains work that you do not have locally Sourcetree - undo unpushed commits Cannot checkout, file is unmerged Git diff between current branch and master but not including unmerged master commits

Examples related to git-rebase

rebase in progress. Cannot commit. How to proceed or stop (abort)? git remove merge commit from history Choose Git merge strategy for specific files ("ours", "mine", "theirs") What's the difference between 'git merge' and 'git rebase'? Your branch is ahead of 'origin/master' by 3 commits Rebase feature branch onto another feature branch git rebase merge conflict Remove folder and its contents from git/GitHub's history How to get "their" changes in the middle of conflicting Git rebase? How to rebase local branch onto remote master

Examples related to git-rewrite-history

How to amend older Git commit? Rebasing a Git merge commit How to remove/delete a large file from commit history in Git repository? How to squash all git commits into one? How to modify a specified commit? Remove sensitive files and their commits from Git history How to change the author and committer name and e-mail of multiple commits in Git? How can one change the timestamp of an old commit in Git? How do you fix a bad merge, and replay your good commits onto a fixed merge? How to modify existing, unpushed commit messages?