Download Full Website and Only Certain Linked Pages
|jmazor||06/06/2004 11:26 pm|
|Oleg, how can I do these two things.
I`d like to download all of the pages in a website plus only some of the pages tht the site links to (i.e., only the linked pages that contain the word "Bush").
I`d then like OEP to add new qualifying pages to the project daily, weekly or monthly without removing saved pages even if those saves pages have been removed from their websites.
|jmazor||06/06/2004 11:46 pm|
It seems that I`m doing two downloads:
1. The entire website with only pages on the website`s server regardless of content; and,
2. Linked pages on other servers down to two levels IF they have the word "Bush" in them.
|Oleg Chernavin||06/07/2004 09:53 am|
|Do you want to load the pages that contain "Bush" in links, like http://www.server.com/page_Bush.html or in the pages contents?
|jmazor||06/07/2004 10:55 am|
|Thank you for fast response! I`d want to download linked pages:
a. Where the content of the linked page contains the word "Bush"; and,
b. Even if the page in the starting website does not contain that word.
For example, these pages are part of the starting website ("has" refers to content):
1. (Website page has "bush") AND (Linked page has "Bush") - Download
2. (Website page has "bush") AND (Linked page NOT have "Bush") - No Download
3. (Website page NOT have "bush") AND (Linked page has "Bush") - Download
4. (Website page NOT have "bush") AND (Linked page Not have "Bush") - No Download
> Do you want to load the pages that contain "Bush" in links, like http://www.server.com/page_Bush.html or in the pages contents?
> Best regards,
> Oleg Chernavin
> MP Staff
|Oleg Chernavin||06/07/2004 11:28 am|
|There is a Contents Filter in the Project Properties, however it works for all downloaded pages and there is no way to specify different rules for different pages now. It has to be completely redesigned for such purpose.
How important this kind of download for you? If it is really important, I will work on the redesign of Contents Filters.
|jmazor||06/07/2004 10:32 pm|
|Thank you, Oleg. This kind of filtering ability would be very useful for me, and I think it would be useful for others as well. I can think of two aspects that I would hope that you consider:
A. Presently, I find the content filtering choices ambiguous and hard to use:
1. Search for all keywords. ~~Will the condition be met if ANY of the keywords is present, or only if ALL of them are present?
2. Save pages with no keywords in their text.~~All pages will be saved except that any pages that have even one keyword will not be saved.
3. Do not save pages with keywords in their text.~~Isn’t this the same as the one immediately above?
4. Download graphics files for pages with no keywords in their text?~~I have no idea what this means.
5. Stop downloading when keywords are encountered.~~Does this mean abort the download as soon as one keyword is found? Does this mean pause the download each time a keyword is found? Functionally, how would this be used differently than 2 or 3? When and why would someone use this?
B. I’d like to be able to have OEP download or skip a page depending upon whether the text on the linked page has the following (i.e., “a”, “b”, “c”, etc. represent words):
a OR b OR c
a AND b
a /5 b (i.e. a and b are within 5 words of each other, in any order)
a +5 b (i.e. a and b are within 5 words of each other, but b is after a)
“a x b y c z” (i.e. the phrase “a x b y c z”)
a NOT b (i.e., this would be the least important type of filter)
C. Important: Aside from the syntax of the filters, I’d like to download all of certain websites and some but not all of the pages they link to and I think that this would be very useful for people in many fields. Examples:
1. One user might want to download the entire Cancer Society site plus the linked pages that deal with liver cancer (but only the linked pages that deal with liver cancer).
2. Another user might want to download four entire political websites and also download (only) those linked pages that deal with Iraq.
I would think that this approach would be useful for many people. Maybe, on the URL filtering section, the portion that says “Load files only within the starting server....” and “Load up to ## links on other servers.” could be followed with: “Load only linked pages that contain one or more of the following words/phrases: ______________________”.
|Oleg Chernavin||06/08/2004 04:46 am|
|Yes, the current design is not easy to use. In fact, Contents Filter was a quick feature that I added when some users asked me about it. Different people had different needs for this, so I was adding more and more checkboxes on that page. I see that your need requires a serious redesign. I will add it to my plans. I am not sure which version will contain the new system you are asking about. Perhaps, I will have to combine them with the current URL Filters somehow.
|Jeffrey Mazor||06/08/2004 06:49 pm|
|Thank you again for your response. I, for one, am delighted that you added the content filters. I couldn`t use the program without them.
|Oleg Chernavin||06/09/2004 03:23 am|
|Steven||07/09/2005 01:37 am|
|> Thank you!
> > Oleg.
I like your product but I think it`ll be my dream software if following features are available:
1. Is there anyway to fully analyze the link, eg. I want to download some pages within the same server, but I want to download the link point to other server if the content of the link itself is called "Complete Story" (only that link, no further downloading on that site)? Like in www.linuxtoday.com, I want to download the page <a href="http://www.pocketpcthoughts.com/index.php?action=expand,41356">Complete Story</a> even though it doesn`t come from www.linuxtoday.com. I tried your content filtering but cannot make this work.
2. Same for linuxtoday site, I`m only interested in the link in the main body of the home page. I don`t want to download those tabs which contain Preference, Search, Contact Us at the left tab or those like "Editor`s Picks" at the right tab even though they are from same server?
3. "Level limit" is not the ideal way to control how much to download, if we set smaller number of "level limit", then a lot useful pages will not be downloaded, however, if we set bigger number of "level limit", then a lot unrelevant pages are downloaded as well. It will be ideal if we can follow the human way to download pages, just imagine I`m browsing a web forum, normally I click all links in the main body of the first page, then I click "Next Page", browse the links in the main body again, and then "Next Page" until I click n times of "Next Page". In this case I don`t lose anything, yet I don`t include those unrelevant pages.
Just my 2 cents,
|Oleg Chernavin||07/09/2005 04:30 am|
|For 1 & 2 you can use URL Filters | Filename and something like:
to allow or disallow such links.
The 3rd wish is not easy to do. I will think about it.