If `a` links to `b`, `c`, `d`, `e`, and `f`, and each of b-f are 302 to `z`, then `z` is downloaded five times. This is especially a problem with the Internet Archive, because 99% of the pages are code 302, causing the "good" pages to be downloaded hundreds of times each.
302`s should be handled as links... add them to the end of the queue. Failing that, it would be nice to have an option to not follow 302`s to their new page.
I would only suggest you to watch the Queue from tim to time and abort duplicate URLs manually.
I understand, this is not very convenient, but if I change the logic, some sites will be unable to download correctly at all.
Might it be possible to make it an option, e.g., "Don`t reload redirected links"? Aborting duplicate URLs manually is not a practical option for two reasons: 1) I`m downloading about a million URLs from one site.... just checking which ones are duplicates would be impossible. 2) I would have to download one link at a time, because the duplicates sometimes appear all at once.
The Internet Archive makes ridiculously extensive use of 302, and doesn`t have a text search. I`m downloading history for one website, and it`s taking about 1,000 times as long as it needs to because of the number of duplications.
> Well, there was a reason to do it this way. There are sites which have to be authenticated. After the logon they redirect to the page that was already loaded (but now it is authorized and with the stuff available to members of the site), so it is necessary to load that page again.
> I would only suggest you to watch the Queue from tim to time and abort duplicate URLs manually.
> I understand, this is not very convenient, but if I change the logic, some sites will be unable to download correctly at all.
> Best regards,
> Oleg Chernavin
> MP Staff
I also have an idea on the URL Substitutes issue. You just need to add another rule there - to add another rule that converts URLs - from:
This should work.
Thanks for all the hard work!
You will need to use:
Additional=NoMovedDuplicates in the Project URLs field.
Please let me know how it works. Thank you!
Do you have this built as OE Enterprise? The link you offer contains the executable for Pro. I`m using Enterprise.
Can you please tell me, how I can reproduce that?