Library of Congress Crawl help needed

Author Message
Len Lydik 10/21/2004 02:23 pm
Hi,

With their permission, I`ve been crawling the Panoramic Photos collection on the Library of Congress. I was able to get the TIFF images, but not the actuall html pages, which I need to run through TextPipe Pro to make a database.

I need help with the settings so I can get these HTML pages. Below are my project settings:

[Object]
OEVersion=Pro 3.3.0.1788
Type=0
IID=7016
Caption=Panoramic Photos
URL=http://memory.loc.gov/pnp/pan/http://lcweb2.loc.gov/cgi-bin/query/
Lev=8
Weekday=257
LimTSize=10000
LimNumber=5000
LimTime=100
FTText.Exts=htmlhtmaspaspxjspstmstmlidcshtmlhtxtxttextxspxmlrxmlcfmwmlphpphp3
FTImages.Exts=tiftiff xx
FTVideo.Exts=mpgavianimpegmovfliflcvivrmramrvasfasxwmvm1vm2vvob
FTAudio.Exts=wavriffmp3midmp2m3uravocwmaape
FTArchive.Exts=ziparcgzzarjlhalayleirarcabtarpakacejar
FTUDef.Exts=jscssssivbsdtdxslswf
FTText.B=ooxooo
FTImages.B=ooxxoo
FTVideo.B=xoxooo
FTAudio.B=xoxooo
FTArchive.B=xoxooo
FTUDef.B=xoxooo
FTOther.B=ooxooo
FTSizes=0,0,0,1000,0,0,0,0,0,0,0,0,0,0,0,3,0,3,0,3,0
RSrvsIn=http://lcweb2.loc.gov/cgi-bin/query/ x
RPathBx=2
RPathIn=~pppan:pan xxx
RFileIn=pan:~pp xx
RProt=63
LastStart=82:190:239:193:205:176:226:64:
LastEnd=105:142:113:107:47:177:226:64:
S200=5238
S400=2
SAbr=15070
SPar=1045
SSav=5238
SLast=200
SSiz=27716672631
SMdf=5235
LFiles=5413
LSize=27717854143
Stopped=True
ImgDim=0,0,0,0
PrevURL=http://memory.loc.gov/pnp/pan/
SkipURLs=http://lcweb.loc.gov/rr/print/pphome.htmlhttp://lcweb2.loc.gov/pp/panabt.htmlhttp://lcweb2.loc.gov/pp/panquery.htmlhttp://lcweb2.loc.gov/pp/pphelp.htmlhttp://lcweb2.loc.gov/pp/pphome.htmlhttp://www.loc.gov/rr/print/tgm1/http://www.loc.gov/rr/print/tgm2/
Oleg Chernavin 10/22/2004 04:24 am
Good. What was the problem?

Best regards,
Oleg Chernavin
MP Staff