project size ends up too large - I must be doing something wrong

Author Message
ivan 03/05/2010 11:09 am
I’m downloading a project, with depth level = 3. I’m excluding all audio, images, and videos; and I’m checking “load only within the starting URL”. The problem is the project ends up very large in volume (over 5 gbs). Does that seem like I’m doing anything wrong, given that I’m only collecting textual information? I’m downloading a relatively large online discussion forums (post number = 12 million). Please let me know. Thanks a lot!
Paul 03/05/2010 04:06 pm
With my project, im estimating about 100GB's for 10,000,000 files for demonoid.com, and the queue is still going but has slowed down substantially. With Just 52,000 files im looking at about 1.5 GB's or 2 GB's. But im having trouble now with Resuming from Files, where it is saying its only 20,000 KB's, unless thats just what has bgeen added since the resume process?

Write back how many files you downloaded so far, and what site. but it sounds like it could be write if its videos, and pictures. AMAZING how it all adds up, isnt it!!
Oleg Chernavin 03/06/2010 08:23 am
I think, the project also downloads user profiles and other information that is useless. I would suggest you to use the Project Properties - URL Filters - Filename section - Excluded list to skip from downloading unwanted addresses.

Or you may use the Included list to allow only certain pages. For example, look at this forum:

http://www.i30ownersclub.com/forum/

The URL Filters - Filename - Included list should contain:

&board=
topic*.html
board*.html

If you have a problem setting up the other site, let me know its URL, I will help you out.

Best regards,
Oleg Chernavin
MP Staff
ivan 03/08/2010 02:22 pm
Oleg,

What does this text you've included in the last message mean?

&board=
topic*.html
board*.html

I understand this is a code for some sort of inclusion parameters. But what exactly do those commands accomplish? And how did you come up with the keywords for the commands?

I'm trying to download this entire forum:

http://www.digitalcorvettes.com/forums/

I do want the user's profile info to be downloaded. I do not want any pictures, videos, music files, and any other external sites that might be linked from this site. Please let me know how I can do this project.

ivan
ivan 03/08/2010 02:27 pm
the login info is:

username: ivan3789
password: project3789
Oleg Chernavin 03/09/2010 04:43 am
OK. For this forum you should use in URL Filters - Filename - Included list:

forumdisplay.php
showthread.php
member.php

Set File Filters - Images, Video, etc. to "Load from the starting server" in the Location field.

Oleg.
ivan 03/09/2010 04:20 pm
Oleg,

Could you explain what the keywords mean? I am trying to figure out the logic of how to do it for a different site, without having to ask you... Thanks!

ivan
Oleg Chernavin 03/10/2010 01:41 am
Open the URL for the forum. You will see that there are links to various sections, like:

http://www.digitalcorvettes.com/forums/forumdisplay.php?f=112
http://www.digitalcorvettes.com/forums/forumdisplay.php?f=249
http://www.digitalcorvettes.com/forums/forumdisplay.php?f=9

and so on. The common part is forumdisplay.php

Then let's go inside some section. Particular topics have another kind of address:

http://www.digitalcorvettes.com/forums/showthread.php?t=132625
http://www.digitalcorvettes.com/forums/showthread.php?t=135475

etc.

Again, the common part (that doesn't happen in other links on the site) is showthread.php

Respectively, user profile links:

http://www.digitalcorvettes.com/forums/member.php?find=lastposter&t=135654
http://www.digitalcorvettes.com/forums/member.php?find=lastposter&t=132625

Oleg.