Parser is escaping the given restraints

Author Message
FSchr 03/24/2011 03:01 pm
Hello,

I'm about to buy your Offline Explorer Pro, but there are still some issues.

First and most important: it seems that the parser is escaping the project server on files named
like a non-HTML file but containing HTML Data.

Here is an example from the Download Queue:

URL: http://www.rechsteiner-basel.ch/publik/fileadmin/templates/bilder/uploads/media/fileadmin/templates/bilder/fileadmin/templates/bilder/uploads/media/fileadmin/templates/bilder/fileadmin/templates/bilder/fileadmin/templates/bilder/fileadmin/templates/bilder/uploads/media/fileadmin/templates/bilder/fileadmin/templates/bilder/uploads/media/zSaeule_Reformdrucknzz02.pdf

Referer URL: http://www.rechsteiner-basel.ch/publik/fileadmin/templates/bilder/uploads/media/fileadmin/templates/bilder/fileadmin/templates/bilder/uploads/media/fileadmin/templates/bilder/fileadmin/templates/bilder/fileadmin/templates/bilder/fileadmin/templates/bilder/uploads/media/fileadmin/templates/bilder/fileadmin/templates/bilder/kopf.gif

www.rechsteiner-basel.ch is of course not in the project addresses.
But gif's and pdf's are allowed to load from anywhere.

If you need more Information, just let me know.
I'm willing to help tracking the Problem down.

Version is: Offline Explorer Pro 5.9.3318 Service Release 3

Regards,
Frank

P.S.: The CTRL-A hotkey is not working in this Editor (I'm writing from inside OE Pro), too.
Oleg Chernavin 03/24/2011 03:31 pm
Please uncheck the "Suppress server errors" box in the Properties - Parsing section. This should work.

Regarding Ctrl+A - I will check this. Thank you!

Best regards,
Oleg Chernavin
MP Staff
Oleg Chernavin 03/24/2011 04:15 pm
Yes, 200 OK means that the server outputs pages with wrong address and they contain wrong links. This is a sample of a not very correct design. In your case I would suggest to add an Excluded Directory keyword in URL Filters:

/bilder/*/bilder/

This will disallow such URLs. Only please change all File Filters sections to Load using URL Filters.

I will think about suspending the parsing queue. So far it is very useful to pause downloads and keep parser working.

Oleg.
FSchr 03/24/2011 04:15 pm
First, Thank You for your fast Reply!

But the suggested solution is not working out.
I unchecked 'Suppress Web Site Errors', clicked apply and Ok and removed all
the unwanted URLs from the Download Queue.

But the Parser is repopulating it with URLs like the reported one.

And to be honest, even IF this would have worked, I'd still had considered this
a bug.

Anyways, I checked the HTTP Response with wget (see the full result as an
appendix to this reply).
The Code was 200 OK. So the server didn't return an error.

Regards,
Frank

P.S.: In the above there is already the second issue (#2) I have with this program:
Even if the Download is suspended the parser is still working. As I'm writing here it
has still around 60000 pages to process.
This behaviour makes it sometimes impossible to clear the queue totally from unwanted
URLs (I managed it this time though).
Also it's not possible to have full CPU power for another process. That's a Problem
cause OEP is obviously supposed to run for days at once and restarting the program
has the penalty of hours of reparsing.
It's great that download is suspendable, but the same should be true for the parsing
process.
This issue is not a major one for me though as I'm still able to freeze the whole Win
VM if I want to.


Appendix:
-----------

frank@Frank:/home/stone/home/tmp/oe-pro> wget --server-response http://www.rechsteiner-basel.ch/publik/fileadmin/templates/bilder/uploads/media/fileadmin/templates/bilder/fileadmin/templates/bilder/uploads/media/fileadmin/templates/bilder/fileadmin/templates/bilder/fileadmin/templates/bilder/fileadmin/templates/bilder/uploads/media/fileadmin/templates/bilder/fileadmin/templates/bilder/uploads/media/zSaeule_Reformdrucknzz02.pdf
--2011-03-24 20:50:33-- http://www.rechsteiner-basel.ch/publik/fileadmin/templates/bilder/uploads/media/fileadmin/templates/bilder/fileadmin/templates/bilder/uploads/media/fileadmin/templates/bilder/fileadmin/templates/bilder/fileadmin/templates/bilder/fileadmin/templates/bilder/uploads/media/fileadmin/templates/bilder/fileadmin/templates/bilder/uploads/media/zSaeule_Reformdrucknzz02.pdf
Resolving www.rechsteiner-basel.ch... 92.42.184.90
Connecting to www.rechsteiner-basel.ch|92.42.184.90|:80... connected.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
Date: Thu, 24 Mar 2011 19:50:33 GMT
Server: Apache
Set-Cookie: fe_typo_user=fd0834b51a; path=/
Vary: Accept-Encoding
Connection: close
Content-Type: text/html;charset = utf-8
Length: unspecified [text/html]
Saving to: `zSaeule_Reformdrucknzz02.pdf.1'

[ <=> ] 18,830 107K/s in 0.2s

2011-03-24 20:50:34 (107 KB/s) - `zSaeule_Reformdrucknzz02.pdf.1' saved [18830]

frank@Frank:/home/stone/home/tmp/oe-pro>
frank@Frank:/home/stone/home/tmp/oe-pro>
frank@Frank:/home/stone/home/tmp/oe-pro>
frank@Frank:/home/stone/home/tmp/oe-pro> wget --server-response http://www.rechsteiner-basel.ch/publik/fileadmin/templates/bilder/uploads/media/fileadmin/templates/bilder/fileadmin/templates/bilder/uploads/media/fileadmin/templates/bilder/fileadmin/templates/bilder/fileadmin/templates/bilder/fileadmin/templates/bilder/uploads/media/fileadmin/templates/bilder/fileadmin/templates/bilder/kopf.gif
--2011-03-24 20:51:27-- http://www.rechsteiner-basel.ch/publik/fileadmin/templates/bilder/uploads/media/fileadmin/templates/bilder/fileadmin/templates/bilder/uploads/media/fileadmin/templates/bilder/fileadmin/templates/bilder/fileadmin/templates/bilder/fileadmin/templates/bilder/uploads/media/fileadmin/templates/bilder/fileadmin/templates/bilder/kopf.gif
Resolving www.rechsteiner-basel.ch... 92.42.184.90
Connecting to www.rechsteiner-basel.ch|92.42.184.90|:80... connected.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
Date: Thu, 24 Mar 2011 19:51:27 GMT
Server: Apache
Set-Cookie: fe_typo_user=7cf519c0f8; path=/
Vary: Accept-Encoding
Connection: close
Content-Type: text/html;charset = utf-8
Length: unspecified [text/html]
Saving to: `kopf.gif'

[ <=> ] 18,830 106K/s in 0.2s

2011-03-24 20:51:27 (106 KB/s) - `kopf.gif' saved [18830]

frank@Frank:/home/stone/home/tmp/oe-pro>
Oleg Chernavin 03/24/2011 06:03 pm
Yes, I understand that an obvious solution is to skip parsing pages with ....gif or ....pdf ending and text/html MIME type. And actually, not consider them as HTML at all.

But I saw many examples when files with, say, .asp or .php extension were images and .png ones were real pages that contain a slideshow or larger pictures. I can't come with real samples of this now, but I remember it happens on some sites.

So, really if I would make this workaround, some sites will be handled impoperly. The web is very different and there are so many odd things on it. I will think about this case and what kind of workaround to make. But it will be more complex than this obvious approach.

Oleg.
FSchr 03/24/2011 06:03 pm
> Yes, 200 OK means that the server outputs pages with wrong address and they contain wrong links. This is a sample of a not very correct design. In your case I would suggest to add an Excluded Directory keyword in URL Filters:

This is not correct. The 200 HTTP Status code says that the HTTP server - apache in the case of my example - was able to deliver the requested resource without an error.

The naming of the resource is not relevant for the server, and that PDF files are usually named <something>.pdf is a pure convention, not a requirement.

What really counts is the Content-Type Header of the server response.
And as you can see in the wget prints I provided this content-types are TEXT/HTML.
No wonder the parser is parsing them as it's really html.

The Problem, as I see it, is that the pdf and gif files in this example are allowed due to the filter for images which says that images are allowed from everywhere.
So I guess the filter is judging from the URL containing .pdf or .gif in the FILENAME, but the parser
finds that the MIME type is text/html and is NOT treating it as pdf or gif but as a source of more links.

Well, at least thats what I think that is happening :-)

> /bilder/*/bilder/
>
> This will disallow such URLs. Only please change all File Filters sections to Load using URL Filters.

Yes, but this will only work for the one example I've given.
I don't want to do that for all the other urls from other servers.
Most of them I'd miss anyways, cause I can not surveill all of them.

This problem must be fixed in the software, not in the user ;-)

> I will think about suspending the parsing queue. So far it is very useful to pause downloads and keep parser working.

Yes, you are right! In most cases it makes sense as it is. The additional option to be able to suspend parsing as well, would be very useful in some cases though!
But as I said, I can live with this minor problem. But the above one is a major one, hope you can find a solution to that!

Thanks,
Frank
FSchr 03/24/2011 06:36 pm
>Yes, I understand that an obvious solution is to skip parsing pages with ....gif or ....pdf ending and text/html MIME type. And actually, not consider them as HTML at all.

>But I saw many examples when files with, say, .asp or .php extension were images and .png ones were real pages that contain a slideshow or larger pictures. I can't come with real samples of this now, but I remember it happens on some sites.

>So, really if I would make this workaround, some sites will be handled impoperly. The web is very different and there are so many odd things on it. I will think about this case and what kind of workaround to make. But it will be more complex than this obvious approach.

>Oleg.

Oleg,

as a programmer myself, I certaintly understand the complexity of such programs. It is definitely not an easy task!

Maybe I'm wrong, but may I suggest the following: I think it should be quite easy to make the parser recheck just before starting to parse whether the resource is actually allowed to be parsed by the filter settings. In the example here the resource turned out not to be an actual image or pdf. So the rules which say images or pdfs from anywhere would not apply, and parsing would be denied.
Because at that time the program already knows that the filter was tricked because the type suggested by the URL/filename and the real type indicated by the content-type differ.

Anyways, as soon as there's a real fix for this problem available, I'm gonna buy this program. I'm very sure about this!
I was already eveluating several spiders, but I'd really prefer Offline Explorer Pro for ease of use and it's reliability and Design! I also like that I always can see whats going on ... it's really fun to work with!

You guys did an excellent job!

Thanks,
Frank
FSchr 03/24/2011 06:59 pm
P.S.: Even simpler solution (maybe): Just don't parse any resources which are only downloaded because of the image or pdf file filters (and the like) allowed it.
Oleg Chernavin 03/25/2011 07:21 am
Frank,

I really faced situations when a page with ...png extension was a regular web page with a large image.

It was some forum with links to images. However when you follow such link, you get a page with the image, ads, other links, etc.

If I would implement your solution, it would mean that Offline Explorer would load the forum page, follow such ...png links, it will see that they are pages and it will not parse them. It will not get the link to actual image inside this page and the job will be failed.

This is why in such cases I prefer to load more than less.

I agree that it is inconvenient to download the weird pages like this. However it happens on rare sites. Offline Explorer also has protection from URLs that have folder repetitions, like:

http://www.server.com/dir/dir/dir/...
and
http://www.server.com/dir1/dir2/dir1/dir2/...

Your case is more complex - there are repetitions but they are more deep with more than 2 repeating folder names. I will think on how to detect and avoid such URLs. This approach will be more robust than just skipping parsing of irrelevant extension/MIME type cases.

Pegarding parsing - I am thinking about adding some viewer for the parsing queue with the ability to pause/resume, remove unwanted URLs from it, etc. This should be done after 6.0 version release.

Oleg.
FSchr 03/25/2011 08:12 am
Oleg,

I understand that .png and .gif named resources can be perfectly valid html files, but this is not the problem.

Maybe there's a misunderstanding here, I try to explain it more concrete:

Project address is: http://el-abba.org/
File filters (Text) says: Load using URL filter settings
URL filters (Server) says: Load files only within the Starting Server
File filters (Image) says: Load from any Site
File filters (Archive) says: Load from any Site

So, let's say:
http://el-abba.org/index.html is pointing to http://example.org/image.gif

Now what I expect is the program to download image.gif, but not follow links inside image.gif as it obviously currently does!

Regards,
Frank
Oleg Chernavin 03/25/2011 10:12 am
Oh, yes, I understand now. You are correct - it should be improved! I will work on this and let you know when done.

Oleg.
Oleg Chernavin 04/01/2011 10:01 am
I finally added this workaround. Here is the updated oe.exe file:

http://www.metaproducts.com/download/betas/OEP3340.ZIP

Please let me know how it works. Thank you!

Oleg.
FSchr 04/04/2011 11:48 am
Great! This now works as expected.

Thank you.

Regards,
Frank

P.S.: I still have two other problems I'm trying to find a Testcase for them.
Oleg Chernavin 04/04/2011 12:20 pm
OK. I will wait for the details to reproduce.

Oleg.