Improving Performace in Data Mining Project

pterrell
12/29/2003 02:43 pm
I`m attempting to mine an ASP-based data-driven website. I believe that there are about 250K pages in the site, which are only 4-8K in size, reachable by searching AAA to ZZZ.

So the project starts from 17,576 URLs.

I`ve kicked the tires on several spidering products and found most had no way for me to specify the long list of URLS, which is why I`m really impressed by Offline Explorer`s macro feature.

http://www2.dre.ca.gov/PublicASP/pplinfo.asp?NAV=1&LicenseeName={:A..Z}{:A..Z}{:A..Z}

At first the Offline Exporer chugged away, but after about 10K pages, the screen periodically stops updating for long blocks of time. Also, it consumes around 90%+ of the CPU and is updating the screen at a snails pace.

After two days, the screen is reporting:
Downloaded 90,440
Parsing: 44,324
Queue: 21

This is on a new Dell P4, XP Pro, 512M, 40G. Nothing else is running.

What can I do to improve performace, so that I know more accuratley what is happening and can better control the download speeds? (The download performance is not an issue: I`ve throttled it back to 2 threads so I don`t hog the server.)
pterrell
12/30/2003 12:53 pm
I "discovered" that exiting the program and rebooting the system is a bad thing. After being back up for 12 hours, the status bar reports:
Parsing (38054)
Downloaded 62370
So, am I worse off then when I started 3 days ago?
Oleg Chernavin
01/01/2004 05:23 am
Hello,

I would suggest you to use 3.0 version, which has significant improvements in the downloading and parsing speed. Please download it here:

http://www.metaproducts.com/download/eebsetup.exe

Also, if the performance is still low, please set 1 second delay between downloads in the Options dialog.

I would also offer you to press Ctrl-F5 to start downloading the Project. This way Offline Explorer will go through all downloaded files to find missing ones and load only them.

I hope this helps.

Best regards,
Oleg Chernavin
MP Staff
pterrell
01/05/2004 04:18 pm
Oleg,

Thanks for the reply.

I was away for four days with the 2.9 version chugging away. I came back to find that the program was not running and that 198K files had been downloaded.

I`m not sure if the program quit on its own, as in the settings, because I configured it not to do that. Is there any log being recorded that would tell me if it finished mining the URLs or if some problem caused it to quit?

The delay between down load funtion improves perforce...? Wouldn`t adding a delay slow things down?

I upgraded to the 3.0 beta and started the download with the Cntl-F5. After 14 hours, it is saying:
Parsing: 38624
Downloaded: 68239
Is this performance to be expected?