Offline Explorer Pro to download site from Internet Archive Wayback machine

Author Message
dr john leckenby 07/26/2011 04:17 pm
I have downloaded the following site from wayback (all original and backup files and drives were erased; I made this site and very much want to recover it. I am the owner and constructor of this site although I do not now own the domain name as I thought all was lost as of 2008 when the site was erased by others and let the name lapse.):

http://web.archive.org/web/20080703083241/http://ciadvertising.org/

I have succeeded in downloading 203,770 files with 4.014 GB size.

1. When I try to view past the link from the fist page in Offline Explorer Pro browser, it launches IE (current version) and cannot find the files offline (these links are set to be opened in a new window in the html). I have checked the links, and they all reside on my harddrive but cannot be displayed. IE just shows 170.... and keeps churning away with no error message.

2. I have tried to follow the recommendations from Oleg to client Naomi in this forum of 12/5/2010 entitled "Please help me with re-constructing a site from Wayback Machine!!! garage-door-specialists.co.uk"

3. Here are the settings I have made for this download in Offline Explorer Pro:

(This site, http://www.ciadvertising.org, was downloaded to Internet Archive from 2001-2009 so there are many copies there)

URL Filters

URL Exclusions:
http://web.archive.org/web/2001*/*/
http://web.archive.org/web/2002*/*/
http://web.archive.org/web/2003*/*/
http://web.archive.org/web/2004*/*/
http://web.archive.org/web/2005*/*/
http://web.archive.org/web/2006*/*/
http://web.archive.org/web/2007*/*/
http://web.archive.org/web/2009*/*/
http://www.utexas.edu/

Server:

checked load only within this server

staticweb.archive.org
www.utexas.edu
http://web.archive.org/web/2001*/*/
http://web.archive.org/web/2002*/*/
http://web.archive.org/web/2003*/*/
http://web.archive.org/web/2004*/*/
http://web.archive.org/web/2005*/*/
http://web.archive.org/web/2006*/*/
http://web.archive.org/web/2007*/*/
http://web.archive.org/web/2009*/*/
http://www.utexas.edu/

Directory:

unchecked Load files only from starting directory and below

Filename:

nothing done here--used default values

Parsing:

setup rule to remove numbers and and unchecked to apply to files

(I did not do this quite correctly (will re-run) as the numbers were not replaced. Did the test on this rule and it works to remove numbers (dates of download on wayback machine) from files:

URL:

http://web.archive.org/web/*www.ciadvertising.org

Replace:

http://web.archive.org/web/*/


I greatly appreciate your help as I thought this site was lost forever and represents my life work as an academician (let alone my students' work). As you may know, the recommended download program by Internet Archive site no longer works with the changed wayback machine for downloading, and they indicate it will not work until after August 2011.

john l

BTW, why when I attempt to open a .gif, e.g., from offline downloaded content in Photoshop, to verify it is on my haddrive, I get an message saying it cannot open the format?
Oleg Chernavin 07/26/2011 04:20 pm
Can you please give me exact Project settings? Select it, use Export - Project Settings - Copy and then paste to the forum message.

I will do the download and try to see what is wrong.

Regarding GIF files. Yes, the site uses lots of redirects when you request a URL, it points you to another timed version of a file. So, many of the downloaded files are small HTML pages with redirections.

You may open them to see the exact location of the GIF and other such files.

Best regards,
Oleg Chernavin
MP Staff
Tim 10/26/2011 09:25 am
Hi,

I am trying to do the same thing. Is there anyway to get it so when I export the files they go into one directory instead of all into to date stamped folders?

Many thanks
Oleg Chernavin 10/26/2011 10:15 am
The best way is to use URL Substitutes (Properties - Parsing) to add rule:

URL:
http://web.archive.org/web/*
Replace:
http://web.archive.org/web/**/*
With:
http://web.archive.org/web/*
Apply to:
Filenames

Then redownload the project and export it.

Oleg.
Tim 10/28/2011 08:27 am
Hi,

Thanks that worked. Only problems is I'm getting 1000's of files and pages from years that I don't want. One site is archived for 2007 and I have this in URL exceptions:

http://web.archive.org/web/2001*/*/
http://web.archive.org/web/2002*/*/
http://web.archive.org/web/2003*/*/
http://web.archive.org/web/2004*/*/
http://web.archive.org/web/2005*/*/
http://web.archive.org/web/2006*/*/

but it still downloads pages from older years than 2007.

Many thanks for your help.
Oleg Chernavin 10/29/2011 02:59 pm
Can you post your settings here? Select the Project, press Ctrl+C on keyboard and paste it in the forum message.

Oleg.
Tim 11/03/2011 04:07 pm
Hi Oleg,

[Object]
OEVersion=Pro 6.0.0.3658
Type=0
IID=7025
Caption=http://web.archive.org/web/20070819071002/http://www.domain.co.uk/
URL=http://web.archive.org/web/20070819071002/http://www.domain.co.uk/
MVer=5
Lev=10
Weekday=257
LimTSize=10000
LimNumber=5000
LimTime=100
LTMethod=1
FTText.Exts=htmlhtmaspaspxjspstmstmlidcshtmlhtxtxttextxspxmlrxmlcfmwmlphpphp3
FTImages.Exts=gifjpgjpegtiftiffxbmfifbmppngipxjp2j2cj2kwbmplwfwebp
FTVideo.Exts=mpgavianimpegmovflvfliflcvivrmramrvasfasxwmvm1vm2vvobsmilmp4m4v
FTAudio.Exts=wavriffmp3midmp2m3uravocwmaapeoggm4aaif
FTArchive.Exts=7zziparcgzzarjlhalayleirarcabtarpakacejarpdftgzexeiso
FTUDef.Exts=jsaxdcssssivbsdtdxslswfclassent
FTText.B=ooxooo
FTImages.B=ooxooo
FTVideo.B=ooxooo
FTAudio.B=ooxooo
FTArchive.B=ooxooo
FTUDef.B=ooxooo
FTOther.B=ooxooo
FTSizes=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,3,3,0,3,0
NotIgnoreLogout=False
RPathIn=www.domain.co.uk x
RProt=255
LastStart=210:95:174:224:153:241:227:64:
LastEnd=172:231:125:225:153:241:227:64:
LastStarted=28/10/2011 19:24:29
LastEnded=28/10/2011 19:24:38
S200=9
SAbr=101
SPar=5
SSav=9
SLast=200
SSiz=267152
SMdf=8
SHTML=9
SSuccDowns=1
LFiles=9
LSize=347024
Stopped=True
Flags=1
SubstsB=aHR0cDovL3dlYi5hcmNoaXZlLm9yZy93ZWIvKglodHRwOi8vd2ViLmFyY2hpdmUub3JnL3dlYi8qKi8qCWh0dHA6Ly93ZWIuYXJjaGl2ZS5vcmcvd2ViLyoJWA0K
ImgDim=0,0,0,0
PrevURL=http://web.archive.org/web/20070819071002/www.domain.co.uk/
SkipURLs=http://web.archive.org/web/2001*/*/http://web.archive.org/web/2002*/*/http://web.archive.org/web/2003*/*/http://web.archive.org/web/2004*/*/http://web.archive.org/web/2005*/*/http://web.archive.org/web/2006*/*/
ConvertRSS=True
Exported=28/10/2011 19:10:24 - D:\directory\domain\
LIndexed=False
IndexFiles=False


Cheers,

Tim
Oleg Chernavin 11/03/2011 04:09 pm
I see now. Please place these keywords to the URL Filters - Directory - Excluded keywords list.

Oleg.