Thomas Winged

Cleaning Unreal Engine git repo from unused assets


Problem

I was working on an Unreal Engine project which utilized the git version control system. In the beginning, I uploaded all kinds of assets, but in the end, only half of them were used. Hence the git repository contained a lot of unused stuff inside Content directory. And so, I raised a question:

"How can I reduce the size of my project and git repository while keeping commits history intact?"

Solution

There are many approaches to solving this problem, but for the sake of jotting down this solution for my future self, this is the one I took:

First, I created a new Unreal Engine project and migrated the main map containing all assets, mechanics, etc. Right away, its size decreased by half. I manually copied the used source code (as it did not copy with the migration process). And then, I launched the project to see if everything was still working. Everything was correct.

1673370083-GviOTr4Mkl.webp

Migrating the main map with all important assets

Next, using WinMerge software, I compared an old directory of a project with the new one. It displayed all differences between given directories.

1673370274-hweNgoPXRV.webp

Comparing projects directories

Then I generated a report file. Its form did not fit my needs, as I needed a simple list of paths that were NOT in the new directory. The raw version of the report looked like this:

report.txt - raw version of the report generated by WinMerge
Compare D:\git\OldProject\Content with D:\git\NewProject\Content\Content
10.01.2023 18:05:23
Filename,Folder,Comparison result,Left Date,Right Date,Extension
ThirdPerson,__ExternalActors__,Left only: D:\git\OldProject\Content\__ExternalActors__,* 10.01.2023 14:49:21,  ,
Maps,__ExternalActors__\ThirdPerson,Left only: D:\git\OldProject\Content\__ExternalActors__\ThirdPerson,* 05.01.2023 10:50:52,  ,
ThirdPersonMap,__ExternalActors__\ThirdPerson\Maps,Left only: D:\git\OldProject\Content\__ExternalActors__\ThirdPerson\Maps,* 10.01.2023 14:49:43,  ,
MI_QuestMapProDemo_WoodPanel-Large.uasset,QuestMap\Demo\Materials,Left only: D:\git\OldProject\Content\QuestMap\Demo\Materials,* 05.01.2023 10:52:32,  ,uasset
MI_QuestMapProDemo_WoodPanel-Small.uasset,QuestMap\Demo\Materials,Left only: D:\git\OldProject\Content\QuestMap\Demo\Materials,* 05.01.2023 10:52:32,  ,uasset
Str_QuestMapPro_DirectionData.uasset,QuestMap\Structures,Binary files are different,  05.01.2023 10:52:32,* 10.01.2023 16:26:35,uasset
Str_QuestMapPro_LandmarkData.uasset,QuestMap\Structures,Binary files are different,  05.01.2023 10:52:32,* 10.01.2023 16:26:36,uasset
Str_QuestMapPro_MapVisibilityFlags.uasset,QuestMap\Structures,Binary files are different,  05.01.2023 10:52:32,* 10.01.2023 16:26:36,uasset
...

And so I wrote a short Python script to process this list giving me a proper list of file paths:

Script processing raw report of directory differences
import os
with open("report.txt", "r") as file:
    for line in file.readlines():
        splitted = line.split(',')
        ext = splitted[5].strip()  # get only lines pointing to files, skip directories
        not_in_new_project = splitted[4].strip() == '' 
        if ext and not_in_new_project:
            with open("files_i_dont_want.txt", "a") as output:
                output.write(f"Content/{os.path.join(splitted[1], splitted[0])}\n")

files_i_dont_want.txt - list of files it generated
QuestMapPro.png
ToImport/Button.fbx
Platforms/HoloLens/Config/HoloLensEngine.ini
Content/__ExternalActors__/ThirdPerson/Maps/ThirdPersonMap/D/SY/Z4CTQ4LG3YV10EKPD0UE8Q.uasset
Content/__ExternalActors__/ThirdPerson/ThirdPersonMap/0/49/2CCD6BLWT4JNP9PXKWWIY7.uasset
Content/__ExternalActors__/ThirdPerson/ThirdPersonMap/1/29/JSFSL6YKZU8XHYQWSN320J.uasset
...

Next, I found a couple of scripts for cleaning git histories and removing unused files. In the end, the most practical turned out to be git-filter-repo, because it allowed feeding that tool with a file containing the list of files that needed to be deleted. And so, I fetched a fresh clone of my repository and copied it as a backup.

getting a fresh repo clone
git clone git://example.com/some-big-repo.git

Next I run that script with following parameters:

Running git-filter-repo script
python git-filter-repo.py --invert-paths --paths-from-file "files_i_dont_want.txt"

It parsed all commits and wrote new history while cleaning and repacking it out of unneeded objects. The whole operation took a fraction of a second. I could immediately see a difference in numbers of objects by looking at the console log:

2314 objects were downloaded using clone command
Receiving objects: 100% (2314/2314), 2.11 MiB | 6.08 MiB/s, done.

1327 objects were repacked by git-filter-repo
Total 1327 (delta 126), reused 1327 (delta 126), pack-reused 0

Next, I run the following commands to clean old reflog entries and executed garbage collector to eradicate unneeded files:

Cleaning repository out of unneeded files
 git reflog expire --expire=now --all
 git gc --prune=now --aggressive

In the end I pushed whole repository back to GitHub:

Pushing the repository back to GitHub
git remote add origin git://example.com/some-big-repo.git
git push --set-upstream -f origin main

Ultimately, my repository decreased in size by half, and everything worked perfectly.

Conclusions

The next time I need to perform the same operation, I will have its steps written down here. Who knows, maybe it will help you, too. And if you know a better way of comparing two directories without using WinMerge and Python, let me know in the comments below. Thanks!