Help with URL macro

Steve Sieloff
01/21/2006 01:18 am
Oleg --

I would appreciate your help here ... I have a web site www.nsopr.gov that, after accepting terms, presents a map of US states and allows me to look for sexual predators from a global search form ... which in turn querires each state`s server and returns state specific pages and photos. I am trying to set the project to use URL macro Lastname={:a..z} but OEP loads first query (lastname=a), process correctly and then consumes the remaining queries (lastname=b .. z) without processing sub pages ... the clue is the search pages always remain constant.

It is interesting that if I manually submit 26 projects (1 each for lastname=a thru lastname=z) all the files download properly. I have tried Additional=DepthFirst but it does not help ... and I really do not want to have to process 26 projects for each of the 40+ states that have offender data. Please help me get the projects to work using the standard {:a..z} so that all pages and sub pages load and parse properly!!! Here is a sample project setting (I have also tried download only modified files, download all files, etc. but no changes occur):

[Object]
OEVersion=Pro 4.0.0.2314
Type=0
IID=62226
Caption=http://www.nsopr.gov/main_frameset.cfm
URL=http://www.nsopr.gov/main_frameset.cfmPOST=pageid=14&lastname=m&firstname=&county=&city=&zipcode=&zipcode2=&zipcode3=&zipcode4=&zipcode5=&Submit=Search&state=NCIgnoreLogOutLinksAdditional=DepthFirstAdditional=ConvertPOSTToFileNameReferer=http://www.nsopr.gov/SetCookie=CFTOKEN=80038330; CFID=2789912; BIGipServeriir=3356928192.20480.0000; JSESSIONID=e4303243af80$B7a8lVi; ACCEPTED_TC=1
Lev=1000001
Weekday=257
LimTSize=10000
LimNumber=5000
LimTime=100
FMGroup=1
FTText.Exts=htmlhtmaspaspxjspstmstmlidcshtmlhtxtxttextxspxmlrxmlcfmwmlphpphp3
FTImages.Exts=bmpfifgifipxj2cj2kjp2jpegjpglwfpngtiftiffwbmpxbm ooooooooxoooooo
FTVideo.Exts=mpgavianimpegmovfliflcvivrmramrvasfasxwmvm1vm2vvob
FTAudio.Exts=wavriffmp3midmp2m3uravocwmaape
FTArchive.Exts=ziparcgzzarjlhalayleirarcabtarpakacejarpdf
FTUDef.Exts=jscssssivbsdtdxslswfclass
FTText.B=ooxooo
FTImages.B=ooxooo
FTVideo.B=xoxooo
FTAudio.B=xoxooo
FTArchive.B=xoxooo
FTUDef.B=xoxooo
FTOther.B=ooxooo
FTSizes=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,3,3,0,3,0
RFileBx=2
RFileIn=?srn=main_frameset.cfm?pageid=.jpgresults.cfm xxxx
RProt=127
LastStart=153:193:64:119:65:234:226:64:
LastEnd=123:162:176:121:65:234:226:64:
S200=79
SPar=57
SSav=79
SLast=200
SSiz=1299022
SMdf=60
LFiles=79
LSize=1299022
Flags=1
ImgDim=0,0,0,0
PrevURL=http://www.nsopr.gov/main_frameset.cfm
ExploreSSMaps=True
ParseComplexScripts=True


Thanks,

Steve
Oleg Chernavin
01/23/2006 08:44 am
What about to try:

http://www.nsopr.gov/main_frameset.cfm?pageid=14&lastname={:a..z}&firstname=&county=&city=&zipcode=&zipcode2=&zipcode3=&zipcode4=&zipcode5=&Submit=Search&state=NC
IgnoreLogOutLinks
Additional=DepthFirst;ConvertPOSTToFileName
Referer=http://www.nsopr.gov/
Channels=1

I tried, but all downloads result in 500 Error from the server. Perhaps, I have to logon there first.

Best regards,
Oleg Chernavin
MP Staff
Steve Sieloff
01/23/2006 11:48 am
Oleg --

You must first browse the http://www.nsopr.gov and hit "I Agree" in internal browser ...

Also, do I keep my file name filters in you suggested solution below?

Steve

> What about to try:
>
> http://www.nsopr.gov/main_frameset.cfm?pageid=14&lastname={:a..z}&firstname=&county=&city=&zipcode=&zipcode2=&zipcode3=&zipcode4=&zipcode5=&Submit=Search&state=NC
> IgnoreLogOutLinks
> Additional=DepthFirst;ConvertPOSTToFileName
> Referer=http://www.nsopr.gov/
> Channels=1
>
> I tried, but all downloads result in 500 Error from the server. Perhaps, I have to logon there first.
>
> Best regards,
> Oleg Chernavin
> MP Staff
Oleg Chernavin
01/24/2006 01:28 pm
All of your settings are intact, except the URLs field. I used the following there:

http://www.nsopr.gov/main_frameset.cfm
POST=pageid=14&lastname=m&firstname=&county=&city=&zipcode=&zipcode2=&zipcode3=&zipcode4=&zipcode5=&Submit=Search&state=NC
IgnoreLogOutLinks
Additional=DepthFirst;ConvertPOSTToFileName
Referer=http://www.nsopr.gov/
Channels=1

Please also set Delay between downloads to 1 second and update your oe.exe file to:

http://www.metaproducts.com/download/betas/oep2321.zip

Oleg.
Steve Sieloff
01/24/2006 05:31 pm
Oleg --

I am afraid to say it is still only processing the first valid link with data and then blowing the others away ... in the case of NC lastname=a yields no names so OEPro goes to lastname=b ... which yields names/links that are processed properly ... then OEPro goes to lastname=c and (although there are valid links returned for lastname=c), OEPro proceeds to load lastname=d, then lastname=e ... etc ... with no other detail files/links being processed. Here is my new Project settings!

[Object]
OEVersion=Pro 4.1.0.2321
Type=0
IID=62228
Caption=http://www.nsopr.gov/main_frameset.cfm
URL=http://www.nsopr.gov/main_frameset.cfmPOST=pageid=14&lastname={:a..z}&firstname=&county=&city=&zipcode=&zipcode2=&zipcode3=&zipcode4=&zipcode5=&Submit=Search&state=NCIgnoreLogOutLinksAdditional=DepthFirst;ConvertPOSTToFileNameReferer=http://www.nsopr.gov/Channels=1
Lev=1000001
Weekday=257
LimTSize=10000
LimNumber=5000
LimTime=100
FTText.Exts=htmlhtmaspaspxjspstmstmlidcshtmlhtxtxttextxspxmlrxmlcfmwmlphpphp3
FTImages.Exts=bmpfifgifipxj2cj2kjp2jpegjpglwfpngtiftiffwbmpxbm ooooooooxoooooo
FTVideo.Exts=mpgavianimpegmovfliflcvivrmramrvasfasxwmvm1vm2vvob
FTAudio.Exts=wavriffmp3midmp2m3uravocwmaape
FTArchive.Exts=ziparcgzzarjlhalayleirarcabtarpakacejarpdf
FTUDef.Exts=jscssssivbsdtdxslswfclass
FTText.B=ooxooo
FTImages.B=ooxooo
FTVideo.B=xoxooo
FTAudio.B=xoxooo
FTArchive.B=xoxooo
FTUDef.B=xoxooo
FTOther.B=ooxooo
FTSizes=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,3,3,0,3,0
RFileBx=2
RFileIn=results.cfmmain_frameset.cfm?pageid=.jpg?srn= xxxx
RProt=127
LastStart=211:56:193:236:182:234:226:64:
LastEnd=3:140:242:213:182:234:226:64:
S200=120
SPar=89
SSav=120
SLast=200
SSiz=1783133
SMdf=94
LFiles=120
LSize=1783133
Flags=1
ImgDim=0,0,0,0
PrevURL=http://www.nsopr.gov/main_frameset.cfm

Perhaps I did not follow your instructions (but I think so) ... any other ideas?

Thanks for your excellent support!

Steve

> All of your settings are intact, except the URLs field. I used the following there:
>
> http://www.nsopr.gov/main_frameset.cfm
> POST=pageid=14&lastname=m&firstname=&county=&city=&zipcode=&zipcode2=&zipcode3=&zipcode4=&zipcode5=&Submit=Search&state=NC
> IgnoreLogOutLinks
> Additional=DepthFirst;ConvertPOSTToFileName
> Referer=http://www.nsopr.gov/
> Channels=1
>
> Please also set Delay between downloads to 1 second and update your oe.exe file to:
>
> http://www.metaproducts.com/download/betas/oep2321.zip
>
> Oleg.
Oleg Chernavin
01/25/2006 12:08 pm
I fixed this:

http://www.metaproducts.com/download/betas/oep2323.ZIP

Oleg.
Steve Sieloff
01/25/2006 02:02 pm
Oleg --

This time it only downloaded the detail pages linked to lastname=b and lastname=l -- every other page was parsed but no detail pages loaded, parsed or processed ... I watched the entire run in the queue ... lastname=b did exactly as expected and spawned several detail pages ... then it blew thru the others until lastname=l when it spawned several detail pages and processed them ... then it blew thru to the end ...

Sorry to keep bugging you on this ... seems like it should be a fairly straightforward site to compile ... and I have been using tool for 3 years but I am stuck!

Thanks,

Steve

Here is my latest project settings:

[Object]
OEVersion=Pro 4.1.0.2323
Type=0
IID=62229
Caption=http://www.nsopr.gov/main_frameset.cfm
URL=http://www.nsopr.gov/main_frameset.cfmPOST=pageid=14&lastname={:a..z}&firstname=&county=&city=&zipcode=&zipcode2=&zipcode3=&zipcode4=&zipcode5=&Submit=Search&state=NCIgnoreLogOutLinksAdditional=ConvertPOSTToFileNameAdditional=DepthFirstChannels=1SetCookie=CFID=3644332; BIGipServeriir=3356928192.20480.0000; CFTOKEN=15400558; JSESSIONID=e4302d997694l3$3F$24; ACCEPTED_TC=1Referer=http://www.nsopr.gov/
Lev=1000001
Weekday=257
LimTSize=10000
LimNumber=5000
LimTime=100
FTText.Exts=htmlhtmaspaspxjspstmstmlidcshtmlhtxtxttextxspxmlrxmlcfmwmlphpphp3
FTImages.Exts=bmpfifgifipxj2cj2kjp2jpegjpglwfpngtiftiffwbmpxbm ooooooooxoooooo
FTVideo.Exts=mpgavianimpegmovfliflcvivrmramrvasfasxwmvm1vm2vvob
FTAudio.Exts=wavriffmp3midmp2m3uravocwmaape
FTArchive.Exts=ziparcgzzarjlhalayleirarcabtarpakacejarpdf
FTUDef.Exts=jscssssivbsdtdxslswfclass
FTText.B=ooxooo
FTImages.B=ooxooo
FTVideo.B=xoxooo
FTAudio.B=xoxooo
FTArchive.B=xoxooo
FTUDef.B=xoxooo
FTOther.B=ooxooo
FTSizes=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,3,3,0,3,0
RFileBx=2
RFileIn=?srn=main_frameset.cfm?pageid=results.cfm.jpg xxxx
RProt=127
LastStart=232:222:195:197:209:234:226:64:
LastEnd=126:208:101:229:209:234:226:64:
S200=246
SAbr=33
SPar=176
SSav=246
SLast=200
SSiz=3786526
SMdf=246
LFiles=246
LSize=3786526
Stopped=True
Flags=1
ImgDim=0,0,0,0
PrevURL=http://www.nsopr.gov/main_frameset.cfm
ParseComplexScripts=True



> I fixed this:
>
> http://www.metaproducts.com/download/betas/oep2323.ZIP
>
> Oleg.
Oleg Chernavin
01/25/2006 04:00 pm
Can you turn on logs to see what exactly happens there - with these main_frameset.cfm... files? Please try to set 10 connections, 2 seconds between downloads in the Options. Keep the Channels=1 in the Project. This way everything was OK.

Oleg.
Steve Sieloff
01/25/2006 04:59 pm
Oleg --

I watched the log window ... for all the skipped URLs the log does not show any status (like a 200 successful) ... it just counts down the 2 second delay and moves on ... just to show that there are records ... here is what OEP saves for the main page of lastname=t but fails to parse the links to get the data ...

This is in the project folder:

http://www.nsopr.gov/main_frameset.cfm?pageid=14&lastname=t&firstname=&county=&city=&zipcode=&zipcode2=&zipcode3=&zipcode4=&zipcode5=&Submit=Search&state=NC


Search Results:
5 hits from 1 state (NC), for Last Name Like t

Name ST County City/Town Zip Code
* Offender is incarcerated, resides in a state other than the state queried, or does not have a known address.
PORTER, TEON TERRELL NC EDGECOMB ROCKY MOUNT 27801
KYMER, TERRY SCOTT NC CASWELL PROSPECT HILL 27314
TEW, PERCY RAY NC SCOTLAND LAURINBURG 28352
TEW, KENNETH BLACKMAN NC SAMPSON GODWIN 28344
LINGERFELT, VERNON JR NC MECKLENBURG CHARLOTTE 28208


But no sub pages are processed (like they are for lastname=b and lastname=l)

Steve


> Can you turn on logs to see what exactly happens there - with these main_frameset.cfm... files? Please try to set 10 connections, 2 seconds between downloads in the Options. Keep the Channels=1 in the Project. This way everything was OK.
>
> Oleg.
Oleg Chernavin
01/27/2006 08:09 am
I have better results with your Project settings. However I noticed that the site mixed up all names. I don`t know how to overcome this.

What if you would use URL Macros simply to load all possible variants of the people:

http://www.jus.state.nc.us/ncsor/?srn=0{:00000..99999}S{:00..11}

This will result in many links to try. But you can setup Content Filters to remove all unwanted pages.

I setup a sample project for you - Tools - Published Projects - Government section.

Also, please make sure that you have directory overload protection in the Options dialog - File Location section. If the directory already contains too many files, it might be better to remove it.

Oleg.
Steve Sieloff
01/28/2006 10:16 pm
Oleg --

I understand the approach ... the issue is that I use the www.nsopr.gov site to compile ALL of the different states data ... this is kind of a generic search page for me ... I really didn`t want to have to go to the individual state`s site 1 by 1 ... also, with the range indicated, the downloading will take forever and expose my extraction bot (OEP) on server logs much more ... can you help me understand what is going wrong?

The interesting point is if I create 26 projects for NC (1 for lastname=a thru lastname=z) they run perfectly ... it is just the processing of the queue when more than 1 initial lastname= value is specified (via the macro {:a..z}) that does not work ... OEP works beautifully when I designate 26 different projects!!!!

Given the above, I would prefer 26 projects that download only the actual links over the broad range (00000..99999) due to server log issues and time to download ... is there an easy way to tie the 26 different projects for each state together so that they can run 1 after the other in a semi automated fashion -- kind of like a SUPER project with the 26 sub-projects inside?.

Thanks again for your help and your assistance with this web site!

PS -- I think I am going to love the Inquiry product ... another awesome piece of programming from you guys! I had been using Macropool`s ContentSaver (www.macropool.com) but I think this tools will be better for what I need!

Steve


> I have better results with your Project settings. However I noticed that the site mixed up all names. I don`t know how to overcome this.
>
> What if you would use URL Macros simply to load all possible variants of the people:
>
> http://www.jus.state.nc.us/ncsor/?srn=0{:00000..99999}S{:00..11}
>
> This will result in many links to try. But you can setup Content Filters to remove all unwanted pages.
>
> I setup a sample project for you - Tools - Published Projects - Government section.
>
> Also, please make sure that you have directory overload protection in the Options dialog - File Location section. If the directory already contains too many files, it might be better to remove it.
>
> Oleg.