My math-major friend sent me this link: http://hardware.slashdot.org/story/12/09/02/1223201/ask-slashdot-how-do-i-de-dupe-a-system-with-42-million-files

we discussed the most efficient solution.

we have an inverted index, map[filesize, list[filepath]], this is called

N = number of files

1. reading all N file descriptors = O(N)

2. iterating through the index = O(N)

therefore O(N)

from step 1 we have an inverted index with all possible dup files in it, this is still called

we have another inverted index, map[checksum, list[filepath]], this is called

N = number of files

1. reading all N files from the hard disk = O(N)

2. iterating through the index = O(N)

therefore O(N)

Step 1 gives us O(N), Step 2 gives us O(N), therefore the overall O(N) for this solution is O(N), where N is the number of files.

Note: each file is only read from the disk once, this minimizes disk reads (which is the bottleneck)

Additionally we could multi-thread the checksum gathering in Step 2 part 1. This would maximize processor throughput.

we discussed the most efficient solution.

**Problem:**We have 4.2 million files, which takes up 4.9TB of space. I want to get rid of duplicate files.**Solution (theoretical):***Step 1*: Check the filesize of every file. If a filesize is unique to file we know it cannot be a duplicate.__pseudocode__we have an inverted index, map[filesize, list[filepath]], this is called

**fsIndex**- for each directory
- for each file in the directory
- look in
**fsIndex**for filesize - if filesize exists, put the filepath in
**fsIndex** - else, put the filesize, new list(with filepath) in
**fsIndex** - for each entry in
**fsIndex** - if there is only 1 filepath for this filesize, delete the entry

__big-O__N = number of files

1. reading all N file descriptors = O(N)

2. iterating through the index = O(N)

therefore O(N)

*Step 2*: Compare the CRC32 checksum of every potential dup file. Same checksum = duplicate__pseudocode__from step 1 we have an inverted index with all possible dup files in it, this is still called

**fsIndex**we have another inverted index, map[checksum, list[filepath]], this is called

**csIndex****fsIndex**- run CRC32 on file, this outputs the checksum
- if checksum exists in
**csIndex**then put the filepath into this entry's list - else, put the checksum, new list(with filepath) in
**csIndex** - for each entry in
**csIndex** - (do whatever)* to the file at the filepath, because it's a duplicate

__big-O__N = number of files

1. reading all N files from the hard disk = O(N)

2. iterating through the index = O(N)

therefore O(N)

*Overall big-O*Step 1 gives us O(N), Step 2 gives us O(N), therefore the overall O(N) for this solution is O(N), where N is the number of files.

Note: each file is only read from the disk once, this minimizes disk reads (which is the bottleneck)

Additionally we could multi-thread the checksum gathering in Step 2 part 1. This would maximize processor throughput.

## No comments:

## Post a Comment