Only modified files

Author Message
Jack 01/25/2005 06:05 am
When I download this site www.oecd.org I always get all of the files, even though I have specified only modified files. I have also tried the modifiedsince parameter but still get all the files.

Any ideas?
Defenestration 01/25/2005 01:14 pm
When I tried it, the html files were always downloaded, but the image files were not downloaded again.

This is because the html files are generated on the fly using the Vignette content management system. Annoyingly, the page content is actually the same apart Vignette adding the following string:

<!-- Vignette V6 Tue Jan 25 18:56:47 2005 -->

which indicates the date/time the file was generated. It also includes a similar string to the beginning of each file, although this is the date the content was last modified (ie. this only changes when the content changes).

Because the image files are not generated on the fly, they always stay the same and so are not downloaded again unless they are modified.

There`s the explanation of what`s going on.

Because the files are actually the same size though, I would have thought that enabling "Check file size" should stop them from being redownloaded, but it doesn`t for some reason. Without checking, I would guess that OE cannot determine the file size and so it has to download the file anyway.
Oleg Chernavin 01/26/2005 04:38 am
Yes, many sites (ASP and PHP especially) generate all HTML pages on the fly and they look as newly created everytime a browser downloads them. Even if their contents is the same. Moreover, the Check File Size option often fails on these pages, because the server doesn`t give the file length when Offline Explorer starts loading them.

Best regards,
Oleg Chernavin
MP Staff
Jack 01/26/2005 08:24 am
> Yes, many sites (ASP and PHP especially) generate all HTML pages on the fly and they look as newly created everytime a browser downloads them. Even if their contents is the same. Moreover, the Check File Size option often fails on these pages, because the server doesn`t give the file length when Offline Explorer starts loading them.
>
> Best regards,
> Oleg Chernavin
> MP Staff

Is there a IIS server option that needs to be set to send the size?
Oleg Chernavin 01/26/2005 08:50 am
I do not think that there is such IIS setting. We had the same problem on our site some time ago. But recently I noticed that now our site returns file length. Maybe it is related to some of the ISAPI filters we use.

Oleg.
Defenestration 01/26/2005 01:02 pm
Oleg,

It would be cool if there was a "Stop downloading page if it contains above keywords" Content filters feature that worked "on-the-fly", allowing the download of a page to be stopped as soon as any of the keywords are found. For this feature to be really useful though, macros would need to be supported.
With this functionality, the problem experienced by Jack (and other ASP/PHP site problems of this nature) could be reduced to a minimum, because the whole file would not have to be downloaded.

It might be worth benchmarking the download with and without this feature enabled to see if the speed benefits are worthwhile (ie. does the parsing on-the-fly/aborting take longer than the actual download) of the www.oecd.org site, where each html file is anywhere between a few K to a couple of hundred K.
Oleg Chernavin 01/26/2005 05:46 pm
Well, I don`t understand how this filtering on the fly would help. What should OE look for in the file beginning?

Oleg.
Defenestration 01/26/2005 09:09 pm
You are right in this case. I hadn`t thought it through properly. However, filtering on-the-fly could still be particularly advantageous to people on slower connections, when used in conjunction with the
"Do not save any pages that contain keywords" and "Stop downloading pages when keywords found" options. Currently the whole file has to be downloaded anyway, before the file is parsed and a decision made as whether to save the file or discard it. If there are a lot of large files this could take a long time, even though only a few might end up being saved. With filtering on-the-fly, the download (of the page or project) will stop as soon as it finds one of the keywords, which in the extreme case could possibly be right at the start of the file.

I have a few more ideas related to how the power of Content filters could be improved, although they would require a reworking of the Content filters prefs page, and inner workings. For example, currently all keywords relate to all options. It would be much more powerful to have a list of content filters. When creating a new filter, you would select which option(s) to use, create a list of keywords this filter will act on. The filters could be moved up/down the list (filters are applied in top down order) so that some filters could be given priority over other filters. An example of this method in action would be firewall rules (eg. LooknStop firewall).

You could also have another option to only a particular filter if the file doesn`t already exist.

PS. It may just be me having a blonde moment, but aren`t the "Save all pages that do not contain the above keywords" and "Do not save any pages that contain the above keywords" options the same ??
Defenestration 01/26/2005 09:11 pm
> > PS. It may just be me having a blonde moment, but aren`t the "Save all pages that do not contain >> the above keywords" and "Do not save any pages that contain the above keywords" options the
>> same ??

If they are different, then the first option should read "Save all pages that contain the above keywords" (ie. remove the "do not")
Jack 01/27/2005 09:21 am
> I do not think that there is such IIS setting. We had the same problem on our site some time ago. But recently I noticed that now our site returns file length. Maybe it is related to some of the ISAPI filters we use.
>
> Oleg.

Get back to the lastmodified, when I look at the OE log I see this Get for a css file:

Host www.oecd.org connected. Waiting for http://www.oecd.org/dataoecd/style/oecd_cda_0.css.
GET /dataoecd/style/oecd_cda_0.css HTTP/1.0
If-Modified-Since: Tue, 23 Nov 2004 08:30:45 GMT
Accept: *.*, */*

However for an html page I only see:

Host www.oecd.org connected. Waiting for Http://www.oecd.org/department/0,2688,en_2649_34487_1_1_1_1_1,00.html.
GET /department/0,2688,en_2649_34487_1_1_1_1_1,00.html HTTP/1.0
Accept: *.*, */*


How come I don`t see the If-Modified-Since


Oleg Chernavin 01/27/2005 10:12 am
The server doesn`t return the file modification date for the Web page. Please restart the Project download and you will see that the server returns the following for the CSS file:

HTTP/1.1 200 OK
...
Last-Modified: Tue, 23 Nov 2004 08:32:28 GMT
...

The simular line is not returned by the server for HTML. This is why Offline Explorer cannot check if the file was changed or not. You may say that it may be worth to add the If-Modified-Since: <last download date> line in any case, but it is for sure that this line will not make a change. If the server doesn`t return file modification date, it will not even look at the If-Modified-Since line for that file.

Oleg.
Oleg Chernavin 01/27/2005 10:37 am
Regarding content filters. First, "Save all pages that do not contain the above keywords" forces OE to load all and save pages, regardless of whether they contain keywords or not. This was useful for one custom project we had in past.

"Do not save any pages that contain the above keywords" can be used with the above filter - in this case when a keyword is found, the page will be not saved, only pages without keywords are saved.

I agree with you that filters may be more flexible and even grouped. We plan to work on this, but so far there were no real requests that it is necessary.

Oleg.