How would you go about crawling this page...

Author Message
Len Lydik 10/28/2004 05:14 pm
http://memory.loc.gov/ammem/gmdhtml/gmdhome.html

I need all the .sid images and the html pages of maps.

REALLY appreciate your help!
Oleg Chernavin 10/29/2004 06:25 am
Just allow downloading from all directories on that site. This should help. Uncheck the Level setting to make it unlimited.

Best regards,
Oleg Chernavin
MP Staff
Len Lydik 10/29/2004 11:28 am
I can never seem to get OEP to stay on one site. How can I force OEP to stay only on pages of the "loc.gov" domain, and go nowhere else?
Oleg Chernavin 10/29/2004 11:31 am
Just go to each of the File Filters sections of the Project Properties dialog and change their Location boxes to "Load using URL Filters settings".

Oleg.
Len Lydik 10/29/2004 11:50 am
[Object]
OEVersion=Pro 3.3.0.1788
Type=0
IID=7020
Caption=LOC Maps - ALL
URL=http://memory.loc.gov/ammem/gmdhtml/gmdsubjindex1.htmlhttp://memory.loc.gov/gmd/http://memory.loc.gov/ammem/gmdhtml/http://memory.loc.gov/cgi-bin/query/S?ammem/http://memory.loc.gov/cgi-bin/map_item.pl?data=
Lev=14
Weekday=257
LimTSize=10000
LimNumber=5000
LimTime=100
FMGroup=2
FTText.Exts=htmlhtmaspaspxjspstmstmlidcshtmlhtxtxttextxspxmlrxmlcfmwmlphpphp3
FTImages.Exts=sid x
FTVideo.Exts=mpgavianimpegmovfliflcvivrmramrvasfasxwmvm1vm2vvob
FTAudio.Exts=wavriffmp3midmp2m3uravocwmaape
FTArchive.Exts=ziparcgzzarjlhalayleirarcabtarpakacejar
FTUDef.Exts=jscssssivbsdtdxslswf
FTText.B=ooxooo
FTImages.B=xoxxoo
FTVideo.B=xoxooo
FTAudio.B=xoxooo
FTArchive.B=xoxooo
FTUDef.B=xoxooo
FTOther.B=ooxooo
FTSizes=0,0,0,99,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,3,0
RSrvsBx=3
RPathBx=2
RPathIn=gmdgmdhtmlgmd:mapmap_itemhttp://memory.loc.gov/cgi-bin/query/s?ammem/map_item.pl?data=data= xxxxxxxx
RProt=63
LastStart=35:201:173:150:21:178:226:64:
LastEnd=242:194:156:211:44:178:226:64:
S200=1113
S304=2004
S400=21
SAbr=4955
SPar=6281
SSav=1113
SLast=200
SSiz=260343472
SMdf=1110
LFiles=3136
LSize=315819912
CFKeywords=Download MrSID image
ImgDim=0,0,0,0
PrevURL=http://memory.loc.gov/ammem/gmdhtml/gmdsubjindex1.html
Oleg Chernavin 10/29/2004 12:10 pm
I downloaded a part of the site and I didn`t find any link being loaded from another domain. Can you please watch the queue and when you see an unwanted link, let me know its URL and Referer (using the right-click on the link in the Queue)?

Oleg.
Len Lydik 10/29/2004 12:18 pm
URL: http://www.gomdot.com/maps/County_maps/Perry.pdf
REFERRER: http://www.gomdot.com/maps/county_maps.asp

URL: http://www.inforain.org/maparchive/centralcoastlanduse.htm
REFERRER: http://www.inforain.org/maparchive/

Oleg Chernavin 10/29/2004 03:02 pm
But which page (referer) leads to the http://www.inforain.org/maparchive/ link? I need to know a page on the original domain that has external links to test them with your Project settings.

Oleg.
Len Lydik 10/29/2004 03:34 pm
I don`t know. I`ve probably crawled close to 100,000 pages already.

There are many with external links.

My original question applies, though. How do you tell OEP to follow only links to the *.loc.gov domain?

This is the answer I want.
Oleg Chernavin 10/29/2004 04:07 pm
I looked at your Project settings and they are correct. They should not allow downloading any other domain. If it is not so, you can try a quick workaround - go to the URL Filters | Server, select "Custom Configuration" and add the following to the Included keywords list:

loc.gov

Click OK button.

Oleg.
Len Lydik 10/29/2004 04:10 pm
...and OEP still seems to wander off course. I`ll try it again.
Len Lydik 10/29/2004 04:16 pm
... page:

http://memory.loc.gov/cgi-bin/map_item.pl?data=/home/www/data/gmd/gmd3/g3400/g3400/ct000686.sid&style=gmd&itemLink=r?ammem/gmd:@field(NUMBER+@band(g3400+ct000686))&title=Hudson`s%20Bay`s%20country%20after%20La%20Veranderie,%20about%201740%20%2f%20La%20Veranderie.

Is there a special setting that needs to be configured to save pages that don`t end with ".html" ?
Oleg Chernavin 10/29/2004 04:28 pm
I see nothing special with this URL. OE should load it well and be able to browse offline. What kind of problem do you have?

Oleg.