GoogleNews blows right by File Filters

Author Message
Erik S. 02/27/2004 06:48 pm
I am having a problem when downloading from GoogleNews (http://news.google.com/ . I think the following exemplifies my situation:

Using the following settings...:
File Filters: "Text," "Others" (I only want the articles, no pix, font info...)
Excluded Server Keywords: "google" (I don`t want any other Google pages)
Level limit: 1
Skip Existing files on levels higher than: 0

...I try the following address (just as an example)...:

http://news.google.com/news?hl=en&ie=ISO-8859-1&edition=us&q=%22john+kerry%22&btnG=Search+News

...And despite having set a level limit of only 1, OE just keeps adding and adding files to the Queue.

I tried using "Additional=DisableJava;DisableScripts" but the same occurs.

I wanted 11 total pages, and got 67.

Any solutions?

-E. Schwartz
Oleg Chernavin 02/28/2004 02:25 pm
I tried to load the same search for "john kerry" with your settings. I got 42 files.

All these files are the following - 1 - is the initial Google page with links to articles. 22 - are the direct links to articles from the google search results. You may see that some search results include up to 5 links to different sites. For example:


Sorted by relevance Sort by date


Pakistanis to raise funds for John Kerry -->> link to hipakistan.com
Hi Pakistan, Pakistan - 12 hours ago
... Since the Democrat party frontrunner in the presidential campaign, Senator John
Kerry is poised to challenge incumbent President George Bush for the United ...
Focused on Florida - Intellivu -->> link to Intellivu.com
Kerry ranks as Senate`s `most liberal` - PHXNews -->> link to PHXNews.com
Graham, Nelson to back Kerry - The Tallahassee Democrat --> link to Tallahassee.com
Click10.com --> link to www.local10.com and more »

The rest 19 files we downloaded because they are being referenced in frames - these are subpages that are being inserted to the page itself. Offline Explorer considers them to be on the same level as the page itself, because they are not links to other pages, but parts of the page.

I understand that you don`t need these frames. If you want, I can add another Additional= parameter, which will skip loading these frames. Just let me know.

Best regards,
Oleg Chernavin
MP Staff
Erik S. 03/01/2004 03:53 pm
> The rest 19 files we downloaded because they are being referenced in frames - these are subpages that are being inserted to the page itself. Offline Explorer considers them to be on the same level as the page itself, because they are not links to other pages, but parts of the page.
>
> I understand that you don`t need these frames. If you want, I can add another Additional= parameter, which will skip loading these frames. Just let me know.

That explains why Google is the only website I visit that has this problem since other sites stay within their servers, but search engines link to other servers. I thought it was a serious bug, but now I realize that it`s not.

That would be a good parameter, IMO.

-E. Schwartz
Oleg Chernavin 03/02/2004 03:05 am
OK. I will add this parameter shortly.

Oleg.
Oleg Chernavin 03/02/2004 09:52 am
I just added the parameter to skip loading IFrames. Now you can use the followingn line:

Additional=DisableScripts;DisableJava;SkipIFrames

I uploaded the updated oe.exe file here:

http://www.metaproducts.com/download/betas/oep1511.zip

Oleg.