Limits of files you can store in a directory
04/03/2022 Update: Instead of using ext4, you should simply consider other file format such as xfs, zfs, btrfs for high file counts storage
Its normal for a typical research projects to store each data as an individual file under a directory. Although this is not idea, no one seems to question what are the limits of such method.
Ext4
Ext4 is the most commonly used format in linux environment nowadays so we will focus in finding the limit of files under a directory.
According to this stackoverflow answer, ext4 limits is 2^32-1 (inode table maximum size) likely due to compatibility with 32-bit filesystem.
So the answer is 2^32 -1 or 4 billions right?
Introducing Index node aka Inode
However under linux things doesn’t work this way. In order to efficiently access each files, they are stored as index node or we typically call inodes. With this unique index number, we can do symbolic link which points different files to the same inode id. Inode is also where you store the permissions, owner, group settings.
Now the bad news is inodes are limited, but they usually comes with in a very large number. This default value grows proportionally with disk size. For example here’s a list of inodes number I get in my machine using df -i
.
Disk | Storage Size | Inodes Limit | Inodes in used | Storage Usage |
---|---|---|---|---|
Samsung 860 | 512G | 30,531,584 | 4,069,444 | 97% |
WD Blue SSD | 512G | 30,498,816 | 1,029,493 | 93% |
WD Blue SSD | 1T | 61,054,976 | 9,918,907 | 92% |
WD Blue HDD | 6T | 183,144,448 | 4,392,142 | 10% |
As you can see filling both of my SSDs almost full with hundreds of millions of files doesn’t even come close to reaching the inode limit.
Unless you are storing billions of very small files ( ~ 1kB) inodes won’t be a problem. If you are the 1% who have such problems, I would recommend you take a look into this thread to create filesystem with larger inode size.
So everything is fine right?
Say hello to hash collision
This is practically CS data structure 101, so I would dwell into what hash collision is.
Since a single directory can potentially hold up to 2^32 files, ext4 also maintains a hash table of size 2^32 to speed up filename search.
This translate to subdirectory limits as well right?
dir_nlink
As mentioned in manpage about ext4
Normally, ext4 allows an inode to have no more than 65,000 hard links. This applies to regular files as well as directories, which means that there can be no more than 64,998 subdirectories in a directory
So no, it doesn’t
Issues with ext4 high inode setup
Although setting to the upper limits of ext4 inode count seems to solve most of other stackoverflow problems, this doesn’t solve mine.
Out of space errors still occur after reaching 15 millions of unique files in the filesystem. Using df -i
shows inode counts using only 4% of total inodes. Hence I believe something is still missing which I failed to considered.
I decided to use XFS instead for solving the problem once and for all, since it has built in dynamic inodes structure by sacrificing more storage space ( empty xfs filesystem uses about 15G space in a 8T hard disk ). So far I was able to store all 30M files without hurdles at all.