Why df -h incorrectly accounts for a 65.3G file in the used column?


I reported an issue to the OpenBSD misc@ mailing list regarding recent behaviour of spamassasin on my mail server. If you look at ls -alh output the bayes_toks file is 65.3G large, this value is confirmed by wc -c & the file lives on the /var partition. If you take a look at the output of df -h there isn't a single partition that uses that amount of space. The largest is var reporting usage at 4.3G. When the file was deleted the value reported for var dropped from 4.3G to 2.0G. Can anyone explain why the used disk space is incorrectly reported?

$ ls -alh
total 4738352
drwx------ 2 _spampd _spampd 512B Sep 4 10:00 .
drwxr-xr-x 3 _spampd _spampd 512B Sep 3 15:57 ..
-rw------- 1 _spampd _spampd 36B Sep 4 09:53 bayes.lock
-rw------- 1 _spampd _spampd 9.8M Sep 3 22:52 bayes_seen
-rw------- 1 _spampd _spampd 65.3G Sep 3 22:55 bayes_toks
$ df -h
Filesystem Size Used Avail Capacity Mounted on
/dev/sd0a 1008M 90.1M 868M 9% /
/dev/sd0k 9.8G 80.3M 9.3G 1% /home
/dev/sd0d 3.9G 118K 3.7G 0% /tmp
/dev/sd0f 3.9G 1.0G 2.7G 28% /usr
/dev/sd0g 1001M 212M 738M 22% /usr/X11R6
/dev/sd0h 9.8G 572M 8.8G 6% /usr/local
/dev/sd0j 3.9G 2.0K 3.7G 0% /usr/obj
/dev/sd0i 2.0G 2.0K 1.9G 0% /usr/src
/dev/sd0e 598G 4.3G 564G 1% /var


Take dump of bayes_toks in text file and restore it again. It should do the trick.


Actually I’m curious why the size was not properly reported. I checked the df sources and it calculates size by asking statfs for used blocks then multiplies that by the blocksize. I’m courious how often those stats are refreshed and why the difference would be so huge (65.3G accounted as 2.3G).
As for the spamassasin issue itself, well I removed the DB so it started from scratch.