OEE Downloads PICS..... but most aren`t Viewable.

Author Message
lee 07/16/2004 02:00 pm
Hello.... I`m trying to download all the files related to this page only on the Server and not "outside" links.
http://web.archive.org/web/20030419064753/www.restorationhistory.com/&ps.html
Notice also that it`s a WayBack Machine ARCHIVE. The site is no longer on the net.
My settings are
a. Do not download existing files
b. No Level limits
c. Directory Filter of "Include" (restorationhistory)

Three problems that I can tell.

1. Is all the Pic`s seem to download, but the vast majority of them are NOT Viewable. The file is there and seems the right size, but nothing shows. Seems for some reason most of them are not gotten correctly or are currupted in some form.
There are a small # pics however that do work.

2. Some of the files from the SAME URL seems to be downloaded several times.
The one that I`ve noticed seems to be this one: http://web.archive.org/web/20030609035118/www.restorationhistory.com/rh/other-parallels-&-polemical-issues.html
I have thought though that it appears to be a large page (is it just using multiple threads, starting over and over again in order to download)???

3. The "Do not Download Existing Files" setting doesn`t seem to work very well, because stuff that is in other directory`s seem to be being redownload, such as the above file. I`ve noticed it being downloaded from at least two different WBM Archives (url locations). And the "length" of the files are the same.

Finally, is there a cleaner and more efficient way I can download all this stuff, instead of having a million different directorys?

Anyway, you have a neat product, but this is a really big issue for me if I can`t download the Images.
I`ve also noticed that the "free" program HTTRACK also doesn`t even download the pics at all.
The images are a significant visual key which provides considerable evidences of what is written on the pages.

Thank you.... :)
Oleg Chernavin 07/16/2004 04:48 pm
1. Can you please tell me few examples of that - a page URL that contains a corrupted image and the image URL itself. I made download of some files and I noticed that the site sometimes reports that a file was not archived.

2. It is possible if the server breaks the file download at some moment and Offline Explorer redownloads it to get the complete file. Also, it is possible that the file with the same name is located on another directory on that site. These numbers are really confusing!

3. This feature affects the download when you have completed the first download and want to update the site offline.

You can simplify the resulting directories structure by using URL Substitutes feature in the Properties dialog | Advanced section. For example, you can have the following rule there:

URL:
http://web.archive.org/*
Replace:
/web/*/
With - keep this field empty. This will remove that numbers part from the files. Please add the rule and uncheck it, so that Offline Explorer loads URLs as they are, but renames files when it saves them. Links in downloaded files will be changed to reflect new filenames.

I hope this helps.

Best regards,
Oleg Chernavin
MP Staff
lee 07/16/2004 06:14 pm
> 1. Can you please tell me few examples of that - a page URL that contains a corrupted image and the image URL itself. I made download of some files and I noticed that the site sometimes reports that a file was not archived.
>

The Original URL I give you above is one of the pages. Like the main image at the top is downloaded, but it is blank when you try to view it or open the page after being downloaded.
This here is a URL of an image that DOES download and is Viewable, if you wish to compare the ones above that doesn`t.
http://web.archive.org/web/20000823225814/www.restorationhistory.com/
http://web.archive.org/web/20000823225814/http://www.restorationhistory.com/Christ.gif

> 2. It is possible if the server breaks the file download at some moment and Offline Explorer redownloads it to get the complete file. Also, it is possible that the file with the same name is located on another directory on that site. These numbers are really confusing!
>

Yes, you sound right on both counts....

> 3. This feature affects the download when you have completed the first download and want to update the site offline.
>

Oh ya, I knew that, but forgot. :)

> You can simplify the resulting directories structure by using URL Substitutes feature in the Properties dialog | Advanced section. For example, you can have the following rule there:
>
> URL:
> http://web.archive.org/*
> Replace:
> /web/*/
> With - keep this field empty. This will remove that numbers part from the files. Please add the rule and uncheck it, so that Offline Explorer loads URLs as they are, but renames files when it saves them. Links in downloaded files will be changed to reflect new filenames.
>
> I hope this helps.
>
> Best regards,
> Oleg Chernavin
> MP Staff
>

Thanks..... I`ll see what happens. By the way, do I leave my SETTINGS the same as I have above, other than maybe use "Download All Files" instead?
lee 07/16/2004 06:33 pm
My folders names look like this when downloaded....

web.archive.orgwww.restorationhistory.com

Do I need to add TWO // to make it interpret correctly after .org???
lee 07/16/2004 06:46 pm
Well, I discovered something....
Most of the Images that belonged in the www.restorationhistory.com folder have been going into a second folder called restorationhistory.com of course at the moment both with the web.archive.org at the front of those names, such as web.archive.orgrestorationhistory.com
lee 07/16/2004 06:48 pm
Oh.... Forgot to say that there are Images in BOTH folders under the same file name and I think size, but ONLY in the restorationhistory.com folder are the Images actually VIEWABLE, accept for one at the moment DC21.gif
lee 07/16/2004 07:00 pm
Let me say also that your help has been so much valuable, and it`s all downloading much more cleaner now.
This info is invaluable. It`s like 18 years of constant research that I don`t think he got around to publishing any of it yet. Was about ready too, so I think something may have happened to him. No contact or responce, site went down, etc.

So thank you so much.
Any other thoughts?

Oh.... I just realized something though. By making the stuff all go into ONE Directory, I`m "overwriting" the file that`s there already right?
Uh uo... How do I fix that? I don`t want to overwriting the NEWER writtings with the Old Ones.
Remember, these are Archives "grabbed" at different dates over a three year period.

What say ye?
Oleg Chernavin 07/19/2004 07:51 am
Sorry, you need to place:

/
in the With field.

Regarding images not being loaded - the Christ.gif was loaded well, but few other images on the same page are missing. If you try to browse to them directly:

http://web.archive.org/web/20000823225814/http://www.restorationhistory.com/descent%202.gif

You will get a page with an error "The page is not in the archive".

Oleg.
leeuniverse 07/20/2004 06:27 pm
> Sorry, you need to place:
>
> /
> in the With field.

Thanks.... :)

> Regarding images not being loaded - the Christ.gif was loaded well, but few other images on the same page are missing. If you try to browse to them directly:
>
> http://web.archive.org/web/20000823225814/http://www.restorationhistory.com/descent%202.gif
>
> You will get a page with an error "The page is not in the archive".
>
> Oleg.

1. Actually, you are looking at the wrong page..... You need to download the page that I gave in my First Post all the way at the top. (the guy designed his site retarded)
I simply gave the the above link as an example of a page that DID download an image that was "viewable", and not just a blank file.

Once you realize what I meant, then read my following posts, and you will see me mentioning that viewable images DO end up downloading but they are created in a new folder called "restorationhistory.com", instead of the downloaded file folder "www.restorationhistory.com" which has all the downloaded pages AND also the SAME images as in the other folder, full sized, but "blank" when viewed. The viewable images that are in the other folder SHOULD be in this folder with the pages. There are images in the folder, but all are "blank".

2. Also, as I mentioned above, how can I ONLY download the newest files and files that don`t exhist from other "older" archives, but yet not OVERWRITE the newest versions of the particular pages?
Because what is occuring now is that I`m downloading several YEARS of "archives" of the site, and the newer archived pages are getting OverWritten by older versions of the same page.
Oleg Chernavin 07/21/2004 06:47 am
> 1. Actually, you are looking at the wrong page..... You need to download the page that I gave in my First Post all the way at the top. (the guy designed his site retarded)
> I simply gave the the above link as an example of a page that DID download an image that was "viewable", and not just a blank file.
>
> Once you realize what I meant, then read my following posts, and you will see me mentioning that viewable images DO end up downloading but they are created in a new folder called "restorationhistory.com", instead of the downloaded file folder "www.restorationhistory.com" which has all the downloaded pages AND also the SAME images as in the other folder, full sized, but "blank" when viewed. The viewable images that are in the other folder SHOULD be in this folder with the pages. There are images in the folder, but all are "blank".

I am sorry, but I am probably mixed up. If the only problem with the images is that they are in another problem, then it is because the server uses links to both restorationhistory.com and www.restorationhistory.com.

This can be easily corrected with one more URL Substitutes rule:

URL:
*restorationhistory.com*
Replace:
/restorationhistory.com/
With:
/www.restorationhistory.com/

You will also have to check the "Apply all matching rules" box there.

> 2. Also, as I mentioned above, how can I ONLY download the newest files and files that don`t exhist from other "older" archives, but yet not OVERWRITE the newest versions of the particular pages?
> Because what is occuring now is that I`m downloading several YEARS of "archives" of the site, and the newer archived pages are getting OverWritten by older versions of the same page.

If you want to keep various versions of the files then you need to stop using URL Substitutes and have the complex structure of folders. These folders define when the file was loaded and they make all files unique, so you don`t have to worry that you loose some version of a certain file.

Oleg.
leeuniverse 07/21/2004 01:47 pm
Okay.... Great, I will try all this. :)

As to the exhisting files thing, wouldn`t it be pretty easy to add a "file duplication" check, i.e. for NOT downloading files with the SAME NAME?
You know, kind of the same way an FTP download works, to have options to Overwrite, Resume, or Do Nothing when a file with the same Name is encountered?

Would be great if this could be implemented because you have no idea how many individual directorys is downloaded.
So, I definately would want to keep the URL substitutes so every thing can go into one directory.
I`m sure you could pretty easily add a file name exhist checker in there.....?

What think ye? :)
leeuniverse 07/21/2004 02:01 pm
Oh.... And the reason for this is I want to get stuff that may have been taken off the website in the past, but may not exhist in the newer archives. But, I don`t want the old files that also exhist on the newer archives to overwrite those newer ones.

By the way. One thing with the images being downloaded in the "other" directory (restorationhistory.com) is that this "other" directory IS NOT downloaded if you just download the first archive or just the one page alone.
But, when you start downloading tons of archives, this other directory is created, the files apparently coming from PAST Archives, and not the recent one.
The Newest Archive downloads the full sized images (so there are images downloaded), but they all are BLANK when viewed accept a couple.
That`s the distinction.

Test download the page (or just the most recent archive) yourself and you will see all the images downloaded, but most of them are blank.

What do you think?
Oleg Chernavin 07/22/2004 07:39 am
OK. Is the:

http://web.archive.org/web/20030419064753/www.restorationhistory.com/&ps.html

latest version of the page?

I just loaded it and all images are there. If you don`t see them properly, does the Export fix it?

Oleg.
leeuniverse 07/22/2004 03:29 pm
Good News...... :)
Exporting does apparently make all the images viewable on the page, they just aren`t in the same directory of course. Not using the URL Substitutes.

Also, when using the URL Substitutes all the stuff goes into one directory, and everything works perfectly without even having to Export.
GREAT JOB!!! :) Your the man.

Now, what about the other issue, or if not, if it could be a Feature Request?

I am grateful for your work and patience in helping. :)
leeuniverse 07/22/2004 03:43 pm
Oh.... I had a workaround idea to make the other issue work, and thought I would run it by you to see if it Would work.
What if I download ONLY the most recent site archive. And then download again and tell it "not" to download files already downloaded?
Will that work, or will the newer files still be overwritten because the older ones are coming from older and different archives, and are likely different file sizes?

Cause I want to try to get files from the older archives, which don`t any longer exhist in the newest one.
Oleg Chernavin 07/22/2004 04:52 pm
I think that files will be not overwritten in any case (if you don`t use URL Substitutes), because each file version has its own path.

You can also download the newest version, export it to one directory and keep there. Then download another version, export it to some other directory, etc.

This may work for you, although it is not that automatic. But you will be able us the same Project or to copy/paste the Project with all its settings and just change the Project URL in the new copy, so you don`t have to setup all the Project settings again.

Oleg.
leeuniverse 07/22/2004 06:08 pm
Yes... I understand all this, too many archives to do individually though. But I was wondering if you could add in the next version at least a (Do Nothing if File NAME already Exists) feature as an exhisting downloaded files "Filter" or something?
(Overwrite, Rename etc. could also be added)

You`ve already got that kind of thing partially for the downloading part, but I guess with the URL substitutes, they break that feature?
I don`t know, don`t know your program well enough yet, but can you maybe see if you can add such a feature?

I think there would be "other" benefits of this for people not only with my particular case.
Cause I`m sure other people might encounter "Duplicate" files in other directorys that they don`t want to overwrite their first download of that file.

Thanks.... :)
leeuniverse 07/22/2004 07:56 pm
danget....... I thought I had finally got this thing to download the entire site with your help.
The thing is is it does, but during the process BOTH images AND html files are being replaced by "blank" files.
So, like half of my files work, and the other half are dead files.

During the download process it seemed all the files were being caught, but when I eventually stopped looking at it to let it finish, I go back and see that several of the files had been replaced with "blank" stuff.
SHOOOOOT!!! :(
So, it looks like it does that "blank" thing with both images and html files.

Man, this is too much of a pain.... Got to try to think of another way to do things.
leeuniverse 07/22/2004 08:07 pm
Well, I`m thinking the only way to fix this is to NOT us URL Substitutes???
So, I`ll try that and see if it works.
Oleg Chernavin 07/23/2004 05:30 am
> Yes... I understand all this, too many archives to do individually though.
> But I was wondering if you could add in the next version at least a (Do
> Nothing if File NAME already Exists) feature as an exhisting downloaded
> files "Filter" or something? (Overwrite, Rename etc. could also be added)

These features already exist. The first thing is "Do not load existing files", Overwrite - "Download only modified and new files", Rename - look in the File Copies section of the Project Properties dialog.

These blank files actually redirect one URLs to others. If you export the Project, they will be replaced with direct links to the files. You can also browse the site inside Offline Explorer Pro and it will process these files correctly.

Oleg.
leeuniverse 07/23/2004 04:20 pm
Okay, you lost me now.....

Could you list now Step by Step what all everything should be at, to download it correctly from beginning to end?
Might be easier now that we`ve gone through all that. :)
Oleg Chernavin 07/23/2004 04:38 pm
My settings were very simple for that Project:

URL - any URL of that site, unchecked Level, URL Filters | Images & User Defined - "Load from any site" in their Location boxes. URL Filters | Server - load from the starting server, Directory - Custom Configuration with:

restorationhistory

In the Included keywords list. URL Filters | Filename - Load all filenames.

That`s all! The above loaded the whole archived site with all images and this is browseable inside OE (with enabled Internal HTTP server in the Options dialog) or after exporting the site.

All various archived versions of the pages are preserved in this way, but there will be a mess with the great amount of directories, which are used by that site.

Oleg.
leeuniverse 07/23/2004 05:59 pm
Okay, yes..... Those settings is how I also figured out how to download it.
But the amount of files it downloads is HUGE (there are LOTS of Duplicated Files).
If I use the URL Substitutes it downloads everything into one directory and it`s only about a 1000 files or so, just the files that exhist for the site, but during download good version files are replaced by blank file versions html or pic.
Doing your way seems to download MANY 1000`s, so obviously I didn`t finish downloading it.

There has simply got to be a way by doing it my way (the way you showed me already) to download a file ONCE and not let it be overwritten by another archived file from some other archived directory.
This is what I was talking about if you could add a feature that causes "File Names Already Exhisting" to NOT be Redownloaded or replaced.
The feature that does seem to exhist in your program seems to only work when a site is already downloaded and then you download again and that`s when it doesn`t replace files etc.

Does it makes sense what I`m speaking of?

Does it make sense what I`m speaking of now?
leeuniverse 07/23/2004 06:01 pm
Opps..... Sorry, repeated myself with the last two sentenses above. I spaced out. :)
leeuniverse 07/23/2004 06:55 pm
Well, maybe I was wrong about the "1000`s" thing, I don`t know, I guess it just seemed like that before when I tried it.
So, I`m just using only your setup now and will see what happens. Will return and report soon. :)

I guess this will be alright, I just wanted things cleaner I suppose, and it just scared me that it was downloading blank files before which started this whole thing.
But after all of our experimentation, I suppose it will likely be just fine and good enough.

So, I will let you know how it turns out here soon. And if you have anythoughts on my last comments above, please share, especially if you could impliment the checking thing. :)
leeuniverse 07/25/2004 12:38 am
Turns out I was right..... I AM downloading 1000`s of files, when the site is only about 1000 files large.

So, instead of using "restorationhistory" as a directory name filter, is there some other way?
How bout limiting by the last archive alone?
I`ve tried doing that different ways, level limits etc., but it doesn`t want to download more than 7 files or something for some reason.
It doesn`t make sense to me, because when I look at the site, all the linked files seem to be under the SAME archive #, so I don`t understand why OEE isn`t following the links and downloading them and only downloading 7 files which is just the particular Page and it`s non-viewable images.

Why doesn`t OEE follow the links? All the pages and images seem to be under the same archive (link) when viewing through the browser, so what`s the beef?
Oleg Chernavin 07/26/2004 04:08 am
Can you tell me some URLs that are not wanted to be loaded? This will allow me to help with the filtering.

Oleg.
lee 07/26/2004 06:27 am
Well, if you notice the Main Page I want the download to start from, you`ll see that all the links are under ONE archive directory Number.
But, it seems like upon downloading just normally, it will download on 7 files the Page and it`s images.
When downloading your way with the directory keyword it seems to be downloading about 20,000 files all in different directory Numbers (i.e. Archives).
Even though all the links on the page including when gone to and viewed, as well as viewing their pictures all seem to be under the same archive # as is the Main page.
But when downloading, a lot of the stuff seems to be in different archived directorys.

You did mention something I think before that the links were like "shadow" links or something?

Anyway, all I would really like to do is to download the files that are linked from the page as well as their links.
I did download once where I set things and it seemed to download all the linked files using the url substitutes, but it seemed newer files were being overwritten by older ones or non-viewable ones.
So, I had a directory were half the files were viewable, and the other half weren`t, but it DID seem like ONLY the files I needed were downloaded.

Anyway, it`s all confusing to me..... Been trying a bunch of different things, but none of them does it just right.
I`m almost done with downloading every single archive it looks like for the site some 20,000 files.

Well, I just used OEE to browse the site, and it looks like the whole thing is working, and I realized I was wrong about one thing, on a linked page there is a link for another file and it IS under a different archive # directory.
However, some stuff still seems to be broken, I still have some 800 files to download, so maybe that`s why.

Anyway, if you have any other ideas, let me know.
But I would like to know why the Level Limit setting doesn`t seem to work. I guess though it`s because of the "shadow" linking thing???
Anyway, it` weird to me..... It shows a link, so me I don`t know why the links don`t download?

blah blah blah.... hee hee :) Think I`m going insane yet?
Oleg Chernavin 07/26/2004 06:51 am
Yes, I am confused as well. Maybe it is better to limit the site by the Level as well?

Oleg.
lee 07/26/2004 12:16 pm
Yes, I`ve tried that, and if my memory serves me all that happens is only 7 files download, the page and it`s images.
It doesn`t even follow the links.
I can`t recall also if the images are even viewable.
Oleg Chernavin 07/27/2004 05:10 am
I would suggest to use Level=2 or 3.

Oleg.
lee 07/27/2004 11:35 am
Uh, well I know that. :) Done level 2 already.
Naomi 12/05/2010 05:17 am
H, read the topic with intrest. Would be good if a brief guide was reposted here to reflect the changes to the appllication? I tried to follow the examples but some of the settings have changed since this was written. I am trying to do this www dot creweandnantwichlabour dot org dot uk from the web archive
Oleg Chernavin 12/05/2010 06:35 am
Can you describe me the issue with more details? And give the exact URL here - no need to hide or encrypt it in any way.

Oleg.
Naomi 12/05/2010 07:43 am
thanks, i opened a new thread. any advice would be very much appreciated on that thread. regards, Naomi
Oleg Chernavin 12/05/2010 08:17 am
Yes, I posted there.

Oleg.