How to determine duplicate URLs?
You have a lot of URLs, about 10 billion. How would you organize an effective duplicate search? Surely, all of them cannot be stored in the memory.
The complexity of the task lies in the fact that you have 10 billions of URLs. How much storage you will need to store 10 billions of URLs? If a URL takes approximately 100 symbols, each symbol is represented by 4 bytes, then you will need 4 terabytes to store 10 billions of URLs. Most likely, you will not need to store such amount of data in the memory.
Let’s try to initially solve the easy challenge version. Imagine that the memory stores the whole URL list. In this case, you can create a hash table where each duplicate URL has a true value (an alternative solution: you can just sort the list and find duplicates, it will take some time, but you will get some particular advantages).
Now we have the easy challenge version solution and we can turn to 400 Gb of data that you cannot completely store in the memory. Let’s save some part of data on a disk or divide it between several computers.
Solution 1: data storage on a disk
In case you want to store all the data on one single machine, you will need a document double pass. During the first pass, we will divide the list into 400 fragments for 1 Gb in each one. A simple approach is to store all the URLs in the .txt file, where x = hash(u) % 400. This way, we divide URLs by the hash values. All the URLs with the same hash value will be placed in on file.
During the second pass, you can use the before-mentioned solution: upload the file to the memory, create a has table of URLs and search for repeats.
Solution 2: a lot of computers
This algorithm is pretty similar to the previous one, but here you use different computers for data storage. Instead of storing the data in .txt file, you need to send the data to the machine X.
This solution option has both advantages and disadvantages.
The main benefit is that you can create a parallel operation so that all the 400 blocks are operated simultaneously. For a large amount of data, you will get a time advantage.
The main drawback is that all the 400 machines have to work without errors and crashes that is hard to implement in practice (especially when you have a large amount of data and multiple computers). For this reason, you will need to take in mind the fault handling processing. Thus, both solutions are pretty good and have the right to exist.