FindDups can locate duplicate files hidden away in your server volumes.


FindDups <path> [/nosize] [/nocase] [/crc [/nonull] [/revsort]] [/?]

Path: use volume:directory format: eg. data:users . If no volume is specified then it assumes relative to sys:. eg finddups sys:login and finddups login are equivalent.

/nosize: By default, the app flags files with the same name and filesize, if you specify /NoSize - it'll report solely on matching FileNames.

/nocase: By default the filename comparison is case sensitive - Filea is different to FileA - using no case will remove the case distinction.

/CRC: locate duplicates using file CRCs - VERY accurate, but slower than the other methods.

/Nonull - can be used in conjunction with /crc - prevents CRC checking of 0 byte files - which will all be flagged as duplicates with a 0 CRC.

/RevSort - Display duplicates found by /CRC in largest to smallest order.

/L=<dir\logfile> - Directory path and file to log the results to. Default is SYS:FindDups.log

/? - display usage info.

If you use the /nosize switch, individual file sizes and the total Wastage won't be listed.

Sample FindDups.log

Filename : 1px_spacer.gif
FileSize : 43
Instances: 5
- VMTEST/SYS:/SYSTEM/ndsimon/public/images
- VMTEST/SYS:/JAVA/NWGFX/help/tour/images
- VMTEST/SYS:/adminsrv/webapps/welcome/images
- VMTEST/SYS:/adminsrv/webapps/WebAdmin/Images
- VMTEST/SYS:/adminsrv/webapps/apacheadmin/images

Filename : 3000.vda
FileSize : 767
Instances: 2
- VMTEST/SYS:/JAVA/NWGFX/ace/monitors/viewsonic
- VMTEST/SYS:/JAVA/NWGFX/ace/monitors/optiquest


Files which have duplicates : 3,523
Total number duplicate Files : 8,322
Space Wasted due to Duplicate: 544,482,250

Using the /crc switch will identify duplicate files by their CRC's. This is incredibly accurate, but much more time/disk intensive. Its a two pass process - it reads the entire directory structure enumerating all the files, and the file sizes. All files of the same size then have their CRC's compared (this is the slow part - it has to read the whole file to calc the CRC).

Two files with the same size and CRC aren't necessarily identical, but the odds of them being different are something like 1 in 4 billion - so if the CRC's match, you can assume they're a match <g> It doesn't rely on filenames at all - so even if a file has been renamed, it can be matched as a dup.

For example, a run against a fairly default 6.5 server turned up the following result - and the files confirmed as being identical via FC.exe:

CRC Match (352928234)
File Size : 298630
- VMTEST/SYS:/usr/bin\captoinfo.nlm
- VMTEST/SYS:/usr/bin\infotocap.nlm
- VMTEST/SYS:/usr/bin\tic.nlm

Memory requirements are about the same for the CRC comparison as filename comparison - but the need to do the CRC's makes it slower - memory usage is approx 100MB of server memory per million files.


Comment List
Related Discussions