Very slow parsing
|Themuzz||08/24/2008 12:14 pm|
I''m busy with downloading a huge project (about 700.000 files) but the problem is that the parsing is slower then the downloading.. So after a day it has only downloaded 25.000 files and has 7.000 files in the parsing list. At that moment everything goes very slow and it is also donwloading files at intervals of 30 minutes....
I''ve tried a lot, also the newest version. This is what I''ve tried so far:
Unchecked ''check file sintegrity'', ''supress website errors''
Tried with the option ''evaluate script calculations''
''no link translation''
Only 4 connections, 2 seconds delay
prevent download directory''s from overloading checked
But the list that needs to be parsed is still growing.
The most files that i''m downloading don''t need to be parsed, they don''t contain any relevant links. So i''ve read somewhere here that the option SkipparsingFiles=*.pdf,*.exe exists.
The only problem is that the files that don''t need to be downloaded are in a specific directory, so perhaps you can create something like SkipparsingFolders=*test* or SkipparsingLinks=*test*
That would be great and would save me about 20 days :) Oh yeah, with the option SkipparsingFiles, can you also skip files with no extension? Something like ''test/d78a6sd8776a7sd6''?
|Oleg Chernavin||08/24/2008 05:39 pm|
|If you have "evaluate script calculations" box checked, this may greatly slowdown the parsing speed.
If you do not want some files to be downloaded, please use URL Filters sections. For example, Filename - Excluded keywords list.
If you need to download them, but do not extract links (parsing) to follow, then SkipParsingFiles=test will really help. It doesn''t matter if you specify an extension or part of a folder or filename there.
|Themuzz||08/24/2008 06:52 pm|
|Thanks for the fast reply!
The SkipParsingFiles works like a charm! Perhaps you should document it a little bit better becease I couldn?t find anywhere that you could give a part of the url :)
Perhaps you should make an option that URL with specific keywords in the URL should be parsed but the following links in those files not. That is also handy I think :)
|Oleg Chernavin||08/25/2008 07:21 am|
|You are right! I completely forgot to describe it in the Help file. I just added few sentences and two examples there.
Regarding the suggestion for another command - can you describe what do you mean?
|Themuzz||08/25/2008 08:42 am|
|Thanx again for the fast reply!
About the new command I suggested, I''ll try to explain it:
Sometimes I have to download a website with a huge amount of files, for example with the following structure:
The files calles (for example) justafile.html need to be parsed and the files with the name lotsoffiles.html don''t need to be parsed. The only problem is that justafile.html and lotsoffiles.html have a different name per file. So I can''t give the name with the command SkipParsingFiles. So if it would be possible to start browsing a website and just parse every page until some specific keywords are found (specified somewhere in the options), the links in that file with the keywords will be downloaded but not parsed.
I hope you understand :)
About the answer from you yesterday, well, i''ve started downloading yesterday but today it''s going very slow. I have the option prevent download directory''s from overloading checked but I think that only works wit files? Doesn''t it?
So now I have a folder from the website with over 200.000 other folders and I think windows doesn''t like that neither.
Perhaps you should add that to the option prevent download directory''s from overloading cause it''s going very very slow again :( unless you know a better way :)
> You are right! I completely forgot to describe it in the Help file. I just added few sentences and two examples there.
> Regarding the suggestion for another command - can you describe what do you mean?
> Thank you!
|Oleg Chernavin||08/25/2008 09:57 am|
|OK. I understand. Well, I will think about such command. Regarding the slow-down - yes, most probably it is because of the folders. But I do not have a solution for this now. It will be not easy to add to the overloading prevention. Too many things to check for.
|Themuzz||08/26/2008 08:13 am|
|Alright, to bad it?s to hard. I?m still busy downloading the same website, so to solve my problem I had some other ideas:)
-an option to download all file, without folder structure, to the same folder with overload protection. And link translation wont be needed then :)
-because I have to download lot''s of folders windows will slow down. But if I suspend the project for 2 minutes it can download for about 30 minutes full. So if you could create an option at the Schedule page for ''suspend project every x minutes'' and ''wait for x minutes during suspeding before starting again''.
If you could create something like that, then people who have a problem with the huge list of parsing files can use this option so the parsing could be done while the downloading is suspended. And for me windows can take a breathe before starting again with adding folders to a huge folder with 200.000 other folders :)
Please let me know what you think about it and if you understand :)
|Oleg Chernavin||08/26/2008 08:35 am|
|The second thing can be done with the delay between downloads - just add Delay=10 where 10 is time in seconds. This will slow down the download, but give system time to refresh. You can experiment with the delay to find the optimal speed.