Help with URL macros -- part 2

Author Message
Steve Sieloff 02/01/2006 07:10 pm
Oleg --

I hate to be bothersome but here is an attempted download of the CA sexual offender data from www.nsopr.gov ... it would load the initial zip in the rnage but then by pass all the others ... it looks like (from a proxy server) that the initially call the link containing http://www.nsopr.gov/main_frameset.cfm?pageid=16&qid=291 where qid represents the row number of the record set (beginning at 1 if any records were returned) ... this is then followed by a fetch of the link containing http://www.nsopr.gov/results.cfm?r=57458176989 which represents the actula result set id (I think) ... within this there are several detail records ... a good zip code to test is the 92101 which has several hundred detail links ...

Once again I am perplexed with this site as it is no allowing me to automate any type of URL macro fetching ... seems to want to force me to do a single term search at a time rather than accepting all the range values automatically ... it doesn`t help using depthfirst (even thought hsi is needed as each result set "reuses" the qid= numers for each search) ... still doesn`t get past first URL without skipping past all the others in the queue ... here is the latest attempt

the base site is www.nsopr.gov ... click "I agree" in internal browser ... select state = CA and zip code = 92101 ... this is how the project was created and I then added the zip range of 92081..96199

[Object]
OEVersion=Pro 4.1.0.2326
Type=0
IID=7024
Caption=http://www.nsopr.gov/search_frameresults.cfm
URL=http://www.nsopr.gov/search_frameresults.cfmPOST=lastname=&firstname=&county=&city=&zipcode={:92081..96199}&zipcode2=&zipcode3=&zipcode4=&zipcode5=&Submit=Search&state=CAIgnoreLogOutLinksAdditional=ConvertPOSTToFileNameAdditional=DepthFirstReferer=http://www.nsopr.gov/SetCookie=CFID=3880567; BIGipServeriir=3390482624.20480.0000; CFTOKEN=57657382; JSESSIONID=46303f7e5ce7$7C$3Fj$; ACCEPTED_TC=1
Lev=1000001
Weekday=257
LimTSize=10000
LimNumber=5000
LimTime=100
FTText.Exts=htmlhtmaspaspxjspstmstmlidcshtmlhtxtxttextxspxmlrxmlcfmwmlphpphp3
FTImages.Exts=bmpfifgifipxj2cj2kjp2jpegjpglwfpngtiftiffwbmpxbm ooooooooxoooooo
FTVideo.Exts=mpgavianimpegmovfliflcvivrmramrvasfasxwmvm1vm2vvob
FTAudio.Exts=wavriffmp3midmp2m3uravocwmaape
FTArchive.Exts=ziparcgzzarjlhalayleirarcabtarpakacejarpdf
FTUDef.Exts=jscssssivbsdtdxslswfclass
FTText.B=ooxooo
FTImages.B=ooxooo
FTVideo.B=xoxooo
FTAudio.B=xoxooo
FTArchive.B=xoxooo
FTUDef.B=xoxooo
FTOther.B=ooxooo
FTSizes=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,3,3,0,3,0
RPathBx=2
RPathIn=cgimaps xx
RFileBx=2
RFileIn=prosoma.dll?searchby=offender&id=.jpgsearch_frameset.cfm?pageid xxx
RProt=127
LastStart=192:28:100:4:176:235:226:64:
LastEnd=79:215:20:5:176:235:226:64:
S200=13
SAbr=4106
SPar=13
SSav=13
SLast=200
SSiz=56108
SMdf=13
LFiles=13
LSize=56108
Stopped=True
ImgDim=0,0,0,0
PrevURL=http://www.nsopr.gov/search_frameresults.cfm
Oleg Chernavin 02/02/2006 12:45 pm
No problem! I started working to reproduce this. But I found that the server doesn`t like the link:

http://www.nsopr.gov/search_frameresults.cfm

It loads pages with error for every request.

Oleg.
Steve Sieloff 02/02/2006 10:27 pm
Oleg --

I just started from scratch ... this project loads from 92101 thru 91699 ... it loads 92101 fine (all 291 records, offender photos ... everything as expected) ... then for each subsequent URL in queue (zipcode=92102 .. 92103 .. 92014 .. etc) all OE Prod does is load the URL with next zip, load the
link that contains the vstring results.cfm?r= where r= appears to be a record set ... but then goes on to the next zip code ... it does not appear to load the results.cfm?r= and process subsequent records/pages ... in looking at this thru proxy packets ... it appears the the site requires a seqeuncing of URLs in succession to process properly ... again, if I break each zip code into its own project the process works fine ... but I am not looking to have a unique project for each zip code in CA (90001 thru 96199) ... YIKES!!! Here is the project info that processes 92101 but then begins to skip thru remainder of links ...

URL [Object]
OEVersion=Pro 4.1.0.2323
Type=0
IID=62235
Caption=http://www.nsopr.gov/main_frameset.cfm
URL=http://www.nsopr.gov/main_frameset.cfmPOST=pageid=14&lastname=&firstname=&county=&city=&zipcode={:92101..96199}&zipcode2=&zipcode3=&zipcode4=&zipcode5=&Submit=Search&state=CAIgnoreLogOutLinksAdditional=ConvertPOSTToFileNameAdditional=DepthFirstChannels=1Referer=http://www.nsopr.gov/SetCookie=CFTOKEN=57487749; BIGipServeriir=3390482624.20480.0000; CFID=3943834; JSESSIONID=46304f462221$19D$14w; ACCEPTED_TC=1
Lev=1000001
Weekday=257
LimTSize=10000
LimNumber=5000
LimTime=100
FTText.Exts=htmlhtmaspaspxjspstmstmlidcshtmlhtxtxttextxspxmlrxmlcfmwmlphpphp3
FTImages.Exts=gifjpgjpegtiftiffxbmfifbmppngipxjp2j2cj2kwbmplwf
FTVideo.Exts=mpgavianimpegmovfliflcvivrmramrvasfasxwmvm1vm2vvob
FTAudio.Exts=wavriffmp3midmp2m3uravocwmaape
FTArchive.Exts=ziparcgzzarjlhalayleirarcabtarpakacejarpdf
FTUDef.Exts=jscssssivbsdtdxslswfclass
FTText.B=ooxooo
FTImages.B=ooxooo
FTVideo.B=xoxooo
FTAudio.B=xoxooo
FTArchive.B=xoxooo
FTUDef.B=ooxooo
FTOther.B=ooxooo
FTSizes=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,3,3,0,3,0
RPathIn=cgimaps xx
RFileBx=2
RFileIn=main_frameset.cfm?pageid=.jpgprosoma.dll?searchby=offender&id=results.cfm?r= xxxx
RProt=127
LastStart=168:214:252:29:221:235:226:64:
LastEnd=249:122:91:175:221:235:226:64:
S200=861
SAbr=4094
SPar=599
SSav=861
SLast=200
SSiz=7359333
SMdf=861
LFiles=861
LSize=7359333
Stopped=True
Flags=1
ImgDim=0,0,0,0
PrevURL=http://www.nsopr.gov/main_frameset.cfm
Steve Sieloff 02/05/2006 12:00 pm
Oleg --

Can you provide any further assistance?

Oleg --

I just started from scratch ... this project loads from 92101 thru 91699 ... it loads 92101 fine (all 291 records, offender photos ... everything as expected) ... then for each subsequent URL in queue (zipcode=92102 .. 92103 .. 92014 .. etc) all OE Prod does is load the URL with next zip, load the
link that contains the vstring results.cfm?r= where r= appears to be a record set ... but then goes on to the next zip code ... it does not appear to load the results.cfm?r= and process subsequent records/pages ... in looking at this thru proxy packets ... it appears the the site requires a seqeuncing of URLs in succession to process properly ... again, if I break each zip code into its own project the process works fine ... but I am not looking to have a unique project for each zip code in CA (90001 thru 96199) ... YIKES!!! Here is the project info that processes 92101 but then begins to skip thru remainder of links ...

URL [Object]
OEVersion=Pro 4.1.0.2323
Type=0
IID=62235
Caption=http://www.nsopr.gov/main_frameset.cfm
URL=http://www.nsopr.gov/main_frameset.cfmPOST=pageid=14&lastname=&firstname=&county=&city=&zipcode={:92101..96199}&zipcode2=&zipcode3=&zipcode4=&zipcode5=&Submit=Search&state=CAIgnoreLogOutLinksAdditional=ConvertPOSTToFileNameAdditional=DepthFirstChannels=1Referer=http://www.nsopr.gov/SetCookie=CFTOKEN=57487749; BIGipServeriir=3390482624.20480.0000; CFID=3943834; JSESSIONID=46304f462221$19D$14w; ACCEPTED_TC=1
Lev=1000001
Weekday=257
LimTSize=10000
LimNumber=5000
LimTime=100
FTText.Exts=htmlhtmaspaspxjspstmstmlidcshtmlhtxtxttextxspxmlrxmlcfmwmlphpphp3
FTImages.Exts=gifjpgjpegtiftiffxbmfifbmppngipxjp2j2cj2kwbmplwf
FTVideo.Exts=mpgavianimpegmovfliflcvivrmramrvasfasxwmvm1vm2vvob
FTAudio.Exts=wavriffmp3midmp2m3uravocwmaape
FTArchive.Exts=ziparcgzzarjlhalayleirarcabtarpakacejarpdf
FTUDef.Exts=jscssssivbsdtdxslswfclass
FTText.B=ooxooo
FTImages.B=ooxooo
FTVideo.B=xoxooo
FTAudio.B=xoxooo
FTArchive.B=xoxooo
FTUDef.B=ooxooo
FTOther.B=ooxooo
FTSizes=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,3,3,0,3,0
RPathIn=cgimaps xx
RFileBx=2
RFileIn=main_frameset.cfm?pageid=.jpgprosoma.dll?searchby=offender&id=results.cfm?r= xxxx
RProt=127
LastStart=168:214:252:29:221:235:226:64:
LastEnd=249:122:91:175:221:235:226:64:
S200=861
SAbr=4094
SPar=599
SSav=861
SLast=200
SSiz=7359333
SMdf=861
LFiles=861
LSize=7359333
Stopped=True
Flags=1
ImgDim=0,0,0,0
PrevURL=http://www.nsopr.gov/main_frameset.cfm

Oleg Chernavin 02/05/2006 03:46 pm
Steve,

I was overloaded with work. This is why I am so slow on helping you. I will try to work on it this Monday.

Oleg.
Oleg Chernavin 02/07/2006 12:18 pm
Steve,

I found out why this happens. Every search results page contains links like:

http://www.nsopr.gov/main_frameset.cfm?pageid=16&qid=1
http://www.nsopr.gov/main_frameset.cfm?pageid=16&qid=2
...
http://www.nsopr.gov/main_frameset.cfm?pageid=16&qid=99

These links are absolutely equal for all pages that were loaded. This is why Offline Explorer loads them from the first page and skips on all others - because they were already loaded. However since the contents of these links is different, you need to get them again and again.

This could be done if you run Offline Explorer from the command-line this way:

oe.exe /NoURLs

This will force it to load the same URLs again when it finds them.

Oleg.
Steve Sieloff 02/07/2006 11:24 pm
Oleg --

You are the master! It is working perfectly using your command line option of /nourls ... can you please tell me what this is doing ... I couldn`t find the option in the help files ...

Thanks a ton!!!

Steve
Oleg Chernavin 02/08/2006 04:13 am
This option allows to load the same links again and again. Suppose, two pages on a site contain a link to the same image. Offline Explorer contains a list of all links that were followed, so once it loads the image from page A, the same image will be not followed (loaded) from page B during one download session of a Project.

/NoURLs disables this list.

Oleg.
Steve Sieloff 02/08/2006 09:09 am
Oleg --

Thanks for the explanation and thanks again for the help on this site!!!

One last question ... can you only invoke NoURLs from the command line or can it be done in the normal Windows screen via an Additional= or some other switch?

Thanks again!

Steve
Oleg Chernavin 02/08/2006 09:28 am
This setting was added for debugging purposes only and it is rarely used by the users. This is why I haven`t documented it so far. But if you think it is really necessary to make it on the Project level, I can do it.

Oleg.