project size ends up too large - I must be doing something wrong
|ivan||03/05/2010 11:09 am|
|I’m downloading a project, with depth level = 3. I’m excluding all audio, images, and videos; and I’m checking “load only within the starting URL”. The problem is the project ends up very large in volume (over 5 gbs). Does that seem like I’m doing anything wrong, given that I’m only collecting textual information? I’m downloading a relatively large online discussion forums (post number = 12 million). Please let me know. Thanks a lot!|
|Paul||03/05/2010 04:06 pm|
|With my project, im estimating about 100GB's for 10,000,000 files for demonoid.com, and the queue is still going but has slowed down substantially. With Just 52,000 files im looking at about 1.5 GB's or 2 GB's. But im having trouble now with Resuming from Files, where it is saying its only 20,000 KB's, unless thats just what has bgeen added since the resume process?
Write back how many files you downloaded so far, and what site. but it sounds like it could be write if its videos, and pictures. AMAZING how it all adds up, isnt it!!
|Oleg Chernavin||03/06/2010 08:23 am|
|I think, the project also downloads user profiles and other information that is useless. I would suggest you to use the Project Properties - URL Filters - Filename section - Excluded list to skip from downloading unwanted addresses.
Or you may use the Included list to allow only certain pages. For example, look at this forum:
The URL Filters - Filename - Included list should contain:
If you have a problem setting up the other site, let me know its URL, I will help you out.
|ivan||03/08/2010 02:22 pm|
What does this text you've included in the last message mean?
I understand this is a code for some sort of inclusion parameters. But what exactly do those commands accomplish? And how did you come up with the keywords for the commands?
I'm trying to download this entire forum:
I do want the user's profile info to be downloaded. I do not want any pictures, videos, music files, and any other external sites that might be linked from this site. Please let me know how I can do this project.
|ivan||03/08/2010 02:27 pm|
|the login info is:
|Oleg Chernavin||03/09/2010 04:43 am|
|OK. For this forum you should use in URL Filters - Filename - Included list:
Set File Filters - Images, Video, etc. to "Load from the starting server" in the Location field.
|ivan||03/09/2010 04:20 pm|
Could you explain what the keywords mean? I am trying to figure out the logic of how to do it for a different site, without having to ask you... Thanks!
|Oleg Chernavin||03/10/2010 01:41 am|
|Open the URL for the forum. You will see that there are links to various sections, like:
and so on. The common part is forumdisplay.php
Then let's go inside some section. Particular topics have another kind of address:
Again, the common part (that doesn't happen in other links on the site) is showthread.php
Respectively, user profile links: