Problem with the link inside a page

Author Message
Steven 07/29/2011 12:53 am
I downloaded http://www.economist.com/node/21524874/print

with 5.9 version of POB

but when opening the downloaded page I noticed that the link inside is actually different from what it is when you browse the page online.

Here is what I mean:

If you open the page above, and look at the middle of the fourth paragraph, you'll find (see article, www.economist.com/node/********)

But the same place when browsed offline, there's only (see article), with no links following it.

It'll still be OK for be if the link is the same. But the problem is the hyperlinked words of "see article" actually lead to www.economist.com/******* (without "node").

I even tried to add it using parsing, but it didn't work no matter what.

Hope you can understand what I'm talking about. Thank you!
Steven 07/29/2011 02:23 am
If POB works OK with the link I provided, please try this one:

http://www.economist.com/node/21524934/print (test the link at the end of the first paragraph)
Steven 07/29/2011 03:03 am
To make it easier still, here's my project setting. (A quite simple one, download one link, with a URL substitute that adds "/print" after the original link). But when you open the project, the page that "see article" leads to isn't right. (with "node" missing in the link)



[Object]
OEVersion= 5.9.0.3374
Type=0
IID=7062
Caption=test
URL=http://www.economist.com/node/21524934
Weekday=257
LimTSize=10000
LimNumber=5000
LimTime=100
FTText.Exts=htmlhtmaspaspxjspstmstmlidcshtmlhtxtxttextxspxmlrxmlcfmwmlphpphp3
FTImages.Exts=gifjpgjpegtiftiffxbmfifbmppngipxjp2j2cj2kwbmplwf
FTVideo.Exts=mpgavianimpegmovflvfliflcvivrmramrvasfasxwmvm1vm2vvobsmilmp4
FTAudio.Exts=wavriffmp3midmp2m3uravocwmaape
FTArchive.Exts=ziparcgzzarjlhalayleirarcabtarpakacejarpdftgzexe
FTUDef.Exts=jscssssivbsdtdxslswfclassent
FTText.B=ooxooo
FTImages.B=ooxooo
FTVideo.B=xoxooo
FTAudio.B=xoxooo
FTArchive.B=xoxooo
FTUDef.B=xoxooo
FTOther.B=ooxooo
FTSizes=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,3,3,0,3,0
NotIgnoreLogout=False
RProt=255
LastStart=184:6:182:250:51:230:227:64:
LastEnd=136:26:232:251:51:230:227:64:
LastStarted=2011-7-29 14:59:04
LastEnded=2011-7-29 14:59:16
S200=3
S304=6
SAbr=1
SPar=1
SSav=3
SLast=200
SSiz=65136
SMdf=2
SHTML=1
SSuccDowns=1
LFiles=9
LSize=21839
SubstsB=Kgl3d3cuZWNvbm9taXN0LmNvbS9ub2RlLyoJd3d3LmVjb25vbWlzdC5jb20vbm9kZS8qL3ByaW50DQo=
ImgDim=0,0,0,0
PrevURL=http://www.economist.com/node/21524934/print
SkipURLs=
ConvertRSS=True
LIndexed=False
IndexFiles=False
Oleg Chernavin 07/30/2011 04:56 am
Steven, sorry for the late reply! Offline Explorer got confused with the substitution rule that converts filename to directory and made the relative link this way. You can work this around by either adding the ending slash to the URL you browse offline. Or change the initial URL:

http://www.economist.com/node/21524934/

Best regards,
Oleg Chernavin
MP Staff
Steven 07/30/2011 06:22 am
Thank you for your reply.

But actually this is only a part of a larger project which set the whole print edition page as the starting page. I've tried adding the ending slash in the URL substitute setting but that makes things worse because url downloaded by default does not contain the slash.

I've tried your method of adding the slash in when downloading only the link I provided the other day and find it doesn't work either. Maybe you misunderstood me? The link of the page downloaded in itself has no problem, what seems wrong is the hyperlinked "see article". Here, even adding the slash, "see article" leads to link like "http://127.0.0.1:800/Default/www.economist.com/21524852/print" without "node" before "21524852". This is exactly why it's difficult for me to browse offline. I've downloaded all the pages, but the inner link between these pages is wrong. I'll have to add the missing part every time I click on "see article" link.


Here is the setting of my entire project and the first three URL substitute rules are written for the old version and do not affect the downloading. It's the fourth one that actually does the work.

[Object]
OEVersion= 5.9.0.3374
Type=0
IID=1
Caption=The Economist
URL=http://www.economist.com/printedition/
MVer=5
Lev=2
Hour=13
Weekday=288
LimTSize=10000
LimNumber=5000
LimTime=100
FTText.Exts=htmlhtmaspaspxjspstmstmlidcshtmlhtxtxttextxspxmlrxmlcfmwmlphpphp3
FTImages.Exts=gifjpgjpegtiftiffxbmfifbmppngipxjp2j2cj2kwbmplwf
FTVideo.Exts=mpgavianimpegmovflvfliflcvivrmramrvasfasxwmvm1vm2vvobsmilmp4
FTAudio.Exts=wavriffmp3midmp2m3uravocwmaape
FTArchive.Exts=ziparcgzzarjlhalayleirarcabtarpakacejarpdftgzexe
FTUDef.Exts=jscssssivbsdtdxslswfclassent
FTText.B=ooxooo
FTImages.B=ooxooo
FTVideo.B=xoxooo
FTAudio.B=xoxooo
FTArchive.B=xoxooo
FTUDef.B=ooxooo
FTOther.B=ooxooo
FTSizes=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,3,3,0,3,0
NotIgnoreLogout=False
RSrvsBx=1
RSrvsIn=economist.com x
RSrvsEx=pixel.fetchback.com x
RPathIn=printeditionprint xx
RPathEx=rssmarkets/indicatorsfacebookdiggtargetfacebook?target=#blogsdebatecommentsprintsubscribetwitter xxxxxxxxxxoxx
RFileIn=printerfriendly.cfm o
RFileEx=index.cfmeconomist_printedition.xmlcover_index.cfmbuttomrecommendtopcommentfacebookdefault.htm xxxxxxxxx
RProt=255
LastStart=73:38:147:35:88:230:227:64:
LastEnd=4:138:48:38:88:230:227:64:
LastStarted=2011-7-30 18:06:15
LastEnded=2011-7-30 18:06:42
S200=86
S304=111
SPar=71
SSav=86
SLast=200
SSiz=4931427
SMdf=83
SHTML=62
SSuccDowns=72
LFiles=197
LSize=1668457
SubstsB=Kgl3d3cuZWNvbm9taXN0LmNvbS8qKiovKiovZGlzcGxheXN0b3J5LmNmbT9zdG9yeV9pZD0qCXd3dy5lY29ub21pc3QuY29tL1ByaW50ZXJGcmllbmRseS5jZm0/c3RvcnlfaWQ9Kg0KKgl3d3cuZWNvbm9taXN0LmNvbS8qKi9kaXNwbGF5c3RvcnkuY2ZtP3N0b3J5X2lkPSoJd3d3LmVjb25vbWlzdC5jb20vUHJpbnRlckZyaWVuZGx5LmNmbT9zdG9yeV9pZD0qDQoqCXd3dy5lY29ub21pc3QuY29tL2Rpc3BsYXlzdG9yeS5jZm0/c3RvcnlfaWQ9Kgl3d3cuZWNvbm9taXN0LmNvbS9QcmludGVyRnJpZW5kbHkuY2ZtP3N0b3J5X2lkPSoNCioJd3d3LmVjb25vbWlzdC5jb20vbm9kZS8qCXd3dy5lY29ub21pc3QuY29tL25vZGUvKi9wcmludA0K
ApplyAllSubsts=True
ImgDim=0,0,0,0
PrevURL=http://www.economist.com/printedition/
SkipURLs=http://ad.doubleclick.net/*http://ads.revsci.net/*http://connect.facebook.net/en_us/all.jshttp://doubleclick.net/*http://fls.doubleclick.net/*http://media.economist.com/images/subscriptions/error.gifhttp://pixel.fetchback.com/serve/fb/pdc?cat=&name=landing&sid=1874http://pixel.quantserve.com/*http://platform.twitter.com/widgets.jshttp://switch.atdmt.com/*http://www.economist.com/printhttp://www.economist.com/printedition/cover-index
ConvertRSS=True
Exported=2009-9-18 18:42:19 - C:\Documents and Settings\mlt\My Documents\te1008\
LIndexed=False
IndexFiles=False
Steven 07/30/2011 08:23 pm
For example:

Basically, I download links like:

e.com/n/1
e.com/n/2
e.com/n/3
...

and the URL substitute rule will add "print" after each link, so the actual links downloaded would be

e.com/n/1/print
e.com/n/2/print
e.com/n/3/print

inside the page "e.com/n/1" there is a inner link which points to "e.com/n/2" (here the see article link)

The problem occurred is that while POB downloaded all the"1/print,2/print,3/print" links correctly(without having to add the ending slash), when it comes to the links inside, the link the should have been

e.com/n/2/print became
e.com/2/print or e.com/n/2

either without "n" or "print"

since all those links are downloaded correctly in the first place, I don't think POB confuses the parsing as you have explained. There must be something wrong elsewhere.
Steven 08/04/2011 11:57 am
Hi! Can you help me with another problem please? Last time, it was about the links inside an article. But just now when I tried to download, despite that I set the level limit to 2. It failed to download anything other than the index page. I used the same project setting as above. Would you please have a look at it?
Steven 08/04/2011 12:08 pm
Seems that I should have added "node" to the included words instead of "print". But the downloading was endless.

Among the waiting list, there are links like:

http://www.economist.com/node/21525396/economist.com/print

I don't know why since there are actually no such links. Still a problem with parsing.
Steven 08/04/2011 12:19 pm
Here is my new project setting by the way (I've excluded economist from directory) but links like

http://www.economist.com/node/21525365/print/print

were downloaded

That didn't happen last time.


[Object]
OEVersion= 5.9.0.3374
Type=0
IID=1
Caption=The Economist
URL=http://www.economist.com/printedition/
MVer=5
Lev=2
Hour=13
Weekday=288
LimTSize=10000
LimNumber=5000
LimTime=100
FTText.Exts=htmlhtmaspaspxjspstmstmlidcshtmlhtxtxttextxspxmlrxmlcfmwmlphpphp3
FTImages.Exts=gifjpgjpegtiftiffxbmfifbmppngipxjp2j2cj2kwbmplwf
FTVideo.Exts=mpgavianimpegmovflvfliflcvivrmramrvasfasxwmvm1vm2vvobsmilmp4
FTAudio.Exts=wavriffmp3midmp2m3uravocwmaape
FTArchive.Exts=ziparcgzzarjlhalayleirarcabtarpakacejarpdftgzexe
FTUDef.Exts=jscssssivbsdtdxslswfclassent
FTText.B=ooxooo
FTImages.B=ooxooo
FTVideo.B=xoxooo
FTAudio.B=xoxooo
FTArchive.B=xoxooo
FTUDef.B=ooxooo
FTOther.B=ooxooo
FTSizes=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,3,3,0,3,0
NotIgnoreLogout=False
RSrvsBx=1
RSrvsIn=economist.com x
RSrvsEx=pixel.fetchback.com x
RPathIn=printeditionnode xx
RPathEx=rssmarkets/indicatorsfacebookdiggtargetfacebook?target=#blogsdebatecommentssubscribetwitteremail xxxxxxxxxxxxx
RFileIn=printerfriendly.cfm o
RFileEx=index.cfmeconomist_printedition.xmlcover_index.cfmbuttomrecommendtopcommentfacebookdefault.htm xxxxxxxxx
RProt=255
LastStart=90:190:66:86:0:231:227:64:
LastEnd=107:67:236:86:0:231:227:64:
LastStarted=2011-8-5 0:15:09
LastEnded=2011-8-5 0:15:16
SLast=302
SSuccDowns=80
Flags=1
SubstsB=Kgl3d3cuZWNvbm9taXN0LmNvbS8qKiovKiovZGlzcGxheXN0b3J5LmNmbT9zdG9yeV9pZD0qCXd3dy5lY29ub21pc3QuY29tL1ByaW50ZXJGcmllbmRseS5jZm0/c3RvcnlfaWQ9Kg0KKgl3d3cuZWNvbm9taXN0LmNvbS8qKi9kaXNwbGF5c3RvcnkuY2ZtP3N0b3J5X2lkPSoJd3d3LmVjb25vbWlzdC5jb20vUHJpbnRlckZyaWVuZGx5LmNmbT9zdG9yeV9pZD0qDQoqCXd3dy5lY29ub21pc3QuY29tL2Rpc3BsYXlzdG9yeS5jZm0/c3RvcnlfaWQ9Kgl3d3cuZWNvbm9taXN0LmNvbS9QcmludGVyRnJpZW5kbHkuY2ZtP3N0b3J5X2lkPSoNCioJd3d3LmVjb25vbWlzdC5jb20vbm9kZS8qCXd3dy5lY29ub21pc3QuY29tL25vZGUvKi9wcmludA0K
ApplyAllSubsts=True
ImgDim=0,0,0,0
PrevURL=http://www.economist.com/printedition/
SkipURLs=http://ad.doubleclick.net/*http://ads.revsci.net/*http://connect.facebook.net/en_us/all.jshttp://doubleclick.net/*http://fls.doubleclick.net/*http://media.economist.com/images/subscriptions/error.gifhttp://pixel.fetchback.com/serve/fb/pdc?cat=&name=landing&sid=1874http://pixel.quantserve.com/*http://platform.twitter.com/widgets.jshttp://switch.atdmt.com/*http://www.economist.com/printhttp://www.economist.com/printedition/cover-index
ConvertRSS=True
Exported=2009-9-18 18:42:19 - C:\Documents and Settings\mlt\My Documents\te1008\
LIndexed=False
IndexFiles=False
Oleg Chernavin 08/04/2011 03:00 pm
Downloads well for me. I made a limited selective download, however. Because of the mobile connection I have now. Try to set a download to another download directory.

Oleg.
Steven 08/04/2011 10:57 pm
I don't understand. What changes should I make to the settings?

And is it possible to exclude links that end with a certain word from downloading?
Oleg Chernavin 08/08/2011 07:18 am
What about the URL Filters - Filename - Excluded keyword:

/*print/print

Oleg.
Steven 08/08/2011 11:31 am
I've changed the parsing rules to bypass the "double print" problem.

Now all the links downloaded end with a certain number.

Can I use the filter to block all links that end with /print ? (not print/print in this case)
Oleg Chernavin 08/08/2011 01:15 pm
Yes:

print$

Oleg.
Steven 08/08/2011 11:49 pm
That seems to be working fine now. Thank you. ^_^
Oleg Chernavin 08/09/2011 05:10 am
You are welcome!

Oleg.