Monday, September 3, 2012

De-duping a huge amount of files (theoretical solution)

My math-major friend sent me this link:

we discussed the most efficient solution.

Problem: We have 4.2 million files, which takes up 4.9TB of space. I want to get rid of duplicate files.

Solution (theoretical):

Step 1: Check the filesize of every file. If a filesize is unique to file we know it cannot be a duplicate.

we have an inverted index, map[filesize, list[filepath]], this is called fsIndex

  1. for each directory
    1.  for each file in the directory
    2.  look in fsIndex for filesize
    3.  if filesize exists, put the filepath in fsIndex
    4. else, put the filesize, new list(with filepath) in fsIndex
  2. for each entry in fsIndex
    1. if there is only 1 filepath for this filesize, delete the entry 
so now we have an index with potential dup files.

N = number of files
1. reading all N file descriptors = O(N)
2. iterating through the index = O(N)

therefore O(N)

Step 2: Compare the CRC32 checksum of every potential dup file. Same checksum = duplicate

from step 1 we have an inverted index with all possible dup files in it, this is still called fsIndex
we have another inverted index, map[checksum, list[filepath]], this is called csIndex

  1.  for each filepath in fsIndex
    1. run CRC32 on file, this outputs the checksum
    2. if checksum exists in csIndex then put the filepath into this entry's list
    3. else, put the checksum, new list(with filepath) in csIndex
  2. for each entry in csIndex
    1.  if the entry's list has more than 1 element, for each entry in the list after the 1st element
      1.  (do whatever)* to the file at the filepath, because it's a duplicate
*(do whatever) = delete, or output the filepath somewhere so the user can manually decide which files to delete.

N = number of files
1. reading all N files from the hard disk = O(N)
2. iterating through the index = O(N)

therefore O(N)

Overall big-O
Step 1 gives us O(N), Step 2 gives us O(N), therefore the overall O(N) for this solution is O(N), where N is the number of files.

Note: each file is only read from the disk once, this minimizes disk reads (which is the bottleneck)
Additionally we could multi-thread the checksum gathering in Step 2 part 1. This would maximize processor throughput.


No comments:

Post a Comment