302 Page moved causes redundant downloading
|Keith Mason||05/22/2004 12:19 pm|
|When downloading from a site with lots of 302 status, the same pages are loaded over and over again. It seems that the 302 causes OE to simply replace the old link with the new link in the queue, without checking whether the new link is in the queue already or has been downloaded.
If `a` links to `b`, `c`, `d`, `e`, and `f`, and each of b-f are 302 to `z`, then `z` is downloaded five times. This is especially a problem with the Internet Archive, because 99% of the pages are code 302, causing the "good" pages to be downloaded hundreds of times each.
302`s should be handled as links... add them to the end of the queue. Failing that, it would be nice to have an option to not follow 302`s to their new page.
|Oleg Chernavin||05/22/2004 12:44 pm|
|Well, there was a reason to do it this way. There are sites which have to be authenticated. After the logon they redirect to the page that was already loaded (but now it is authorized and with the stuff available to members of the site), so it is necessary to load that page again.
I would only suggest you to watch the Queue from tim to time and abort duplicate URLs manually.
I understand, this is not very convenient, but if I change the logic, some sites will be unable to download correctly at all.
|Keith Mason||05/22/2004 12:49 pm|
|Hmmm... I understand. I`ve run across those sites myself.
Might it be possible to make it an option, e.g., "Don`t reload redirected links"? Aborting duplicate URLs manually is not a practical option for two reasons: 1) I`m downloading about a million URLs from one site.... just checking which ones are duplicates would be impossible. 2) I would have to download one link at a time, because the duplicates sometimes appear all at once.
The Internet Archive makes ridiculously extensive use of 302, and doesn`t have a text search. I`m downloading history for one website, and it`s taking about 1,000 times as long as it needs to because of the number of duplications.
> Well, there was a reason to do it this way. There are sites which have to be authenticated. After the logon they redirect to the page that was already loaded (but now it is authorized and with the stuff available to members of the site), so it is necessary to load that page again.
> I would only suggest you to watch the Queue from tim to time and abort duplicate URLs manually.
> I understand, this is not very convenient, but if I change the logic, some sites will be unable to download correctly at all.
> Best regards,
> Oleg Chernavin
> MP Staff
|Oleg Chernavin||05/22/2004 12:55 pm|
|OK. I can add another Additional= option for the URLs field. Does this sound fine for you? If yes, I will work on it this Monday.
I also have an idea on the URL Substitutes issue. You just need to add another rule there - to add another rule that converts URLs - from:
This should work.
|Keith Mason||05/22/2004 01:02 pm|
|Hmmm... I like it... I`ll give it a try.
|Oleg Chernavin||05/22/2004 01:05 pm|
|What about the Additional= option? Would it be convenient for you to use it?
|Keith Mason||05/22/2004 01:08 pm|
|Sure, if there was an Additional= option that instructed OE to check the map/queue when hitting 302`s, that would be superb!
Thanks for all the hard work!
|Keith Mason||05/22/2004 01:10 pm|
|Note: I still want to follow 302`s if they haven`t been D/L already... And to have OE still write the redirect file. The desired behavior is to just not follow duplicates|
|Oleg Chernavin||05/24/2004 08:36 am|
|Yes, sure. Redirect file will be composed, but the Queue will not allow duplicate files. I just made the improvement. Here is the updated oe.exe file:
You will need to use:
Additional=NoMovedDuplicates in the Project URLs field.
Please let me know how it works. Thank you!
|Keith Mason||05/24/2004 12:54 pm|
Do you have this built as OE Enterprise? The link you offer contains the executable for Pro. I`m using Enterprise.
|Oleg Chernavin||05/25/2004 04:16 am|
|Sure. Here it is:
|Keith Mason||05/25/2004 08:44 pm|
|I may be using it wrong, but it seemed to have no effect.|
|Oleg Chernavin||05/26/2004 08:44 am|
Can you please tell me, how I can reproduce that?