Limits of files you can store in a directory

04/03/2022 Update: Instead of using ext4, you should simply consider other file format such as xfs, zfs, btrfs for high file counts storage

Its normal for a typical research projects to store each data as an individual file under a directory. Although this is not idea, no one seems to question what are the limits of such method.

Ext4

Ext4 is the most commonly used format in linux environment nowadays so we will focus in finding the limit of files under a directory.

According to this stackoverflow answer, ext4 limits is 2^32-1 (inode table maximum size) likely due to compatibility with 32-bit filesystem.

So the answer is 2^32 -1 or 4 billions right?

Introducing Index node aka Inode

However under linux things doesn’t work this way. In order to efficiently access each files, they are stored as index node or we typically call inodes. With this unique index number, we can do symbolic link which points different files to the same inode id. Inode is also where you store the permissions, owner, group settings.

Now the bad news is inodes are limited, but they usually comes with in a very large number. This default value grows proportionally with disk size. For example here’s a list of inodes number I get in my machine using df -i.

Disk Storage Size Inodes Limit Inodes in used Storage Usage
Samsung 860 512G 30,531,584 4,069,444 97%
WD Blue SSD 512G 30,498,816 1,029,493 93%
WD Blue SSD 1T 61,054,976 9,918,907 92%
WD Blue HDD 6T 183,144,448 4,392,142 10%

As you can see filling both of my SSDs almost full with hundreds of millions of files doesn’t even come close to reaching the inode limit.

Unless you are storing billions of very small files ( ~ 1kB) inodes won’t be a problem. If you are the 1% who have such problems, I would recommend you take a look into this thread to create filesystem with larger inode size.

So everything is fine right?

Say hello to hash collision

This is practically CS data structure 101, so I would dwell into what hash collision is.

Since a single directory can potentially hold up to 2^32 files, ext4 also maintains a hash table of size 2^32 to speed up filename search.

This translate to subdirectory limits as well right?

As mentioned in manpage about ext4

Normally, ext4 allows an inode to have no more than 65,000 hard links. This applies to regular files as well as directories, which means that there can be no more than 64,998 subdirectories in a directory

So no, it doesn’t

Issues with ext4 high inode setup

Although setting to the upper limits of ext4 inode count seems to solve most of other stackoverflow problems, this doesn’t solve mine.

Out of space errors still occur after reaching 15 millions of unique files in the filesystem. Using df -i shows inode counts using only 4% of total inodes. Hence I believe something is still missing which I failed to considered.

I decided to use XFS instead for solving the problem once and for all, since it has built in dynamic inodes structure by sacrificing more storage space ( empty xfs filesystem uses about 15G space in a 8T hard disk ). So far I was able to store all 30M files without hurdles at all.

Reading materials

  1. Ext4 high level design documentation

  2. Simple filesystem introduction if you don’t have time to read 1

  3. ext4: Mysterious “No space left on device”-errors

Written on November 27, 2021