with 5.9 version of POB
but when opening the downloaded page I noticed that the link inside is actually different from what it is when you browse the page online.
Here is what I mean:
If you open the page above, and look at the middle of the fourth paragraph, you'll find (see article, www.economist.com/node/********)
But the same place when browsed offline, there's only (see article), with no links following it.
It'll still be OK for be if the link is the same. But the problem is the hyperlinked words of "see article" actually lead to www.economist.com/******* (without "node").
I even tried to add it using parsing, but it didn't work no matter what.
Hope you can understand what I'm talking about. Thank you!
http://www.economist.com/node/21524934/print (test the link at the end of the first paragraph)
[Object]
OEVersion= 5.9.0.3374
Type=0
IID=7062
Caption=test
URL=http://www.economist.com/node/21524934
Weekday=257
LimTSize=10000
LimNumber=5000
LimTime=100
FTText.Exts=htmlhtmaspaspxjspstmstmlidcshtmlhtxtxttextxspxmlrxmlcfmwmlphpphp3
FTImages.Exts=gifjpgjpegtiftiffxbmfifbmppngipxjp2j2cj2kwbmplwf
FTVideo.Exts=mpgavianimpegmovflvfliflcvivrmramrvasfasxwmvm1vm2vvobsmilmp4
FTAudio.Exts=wavriffmp3midmp2m3uravocwmaape
FTArchive.Exts=ziparcgzzarjlhalayleirarcabtarpakacejarpdftgzexe
FTUDef.Exts=jscssssivbsdtdxslswfclassent
FTText.B=ooxooo
FTImages.B=ooxooo
FTVideo.B=xoxooo
FTAudio.B=xoxooo
FTArchive.B=xoxooo
FTUDef.B=xoxooo
FTOther.B=ooxooo
FTSizes=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,3,3,0,3,0
NotIgnoreLogout=False
RProt=255
LastStart=184:6:182:250:51:230:227:64:
LastEnd=136:26:232:251:51:230:227:64:
LastStarted=2011-7-29 14:59:04
LastEnded=2011-7-29 14:59:16
S200=3
S304=6
SAbr=1
SPar=1
SSav=3
SLast=200
SSiz=65136
SMdf=2
SHTML=1
SSuccDowns=1
LFiles=9
LSize=21839
SubstsB=Kgl3d3cuZWNvbm9taXN0LmNvbS9ub2RlLyoJd3d3LmVjb25vbWlzdC5jb20vbm9kZS8qL3ByaW50DQo=
ImgDim=0,0,0,0
PrevURL=http://www.economist.com/node/21524934/print
SkipURLs=
ConvertRSS=True
LIndexed=False
IndexFiles=False
http://www.economist.com/node/21524934/
Best regards,
Oleg Chernavin
MP Staff
But actually this is only a part of a larger project which set the whole print edition page as the starting page. I've tried adding the ending slash in the URL substitute setting but that makes things worse because url downloaded by default does not contain the slash.
I've tried your method of adding the slash in when downloading only the link I provided the other day and find it doesn't work either. Maybe you misunderstood me? The link of the page downloaded in itself has no problem, what seems wrong is the hyperlinked "see article". Here, even adding the slash, "see article" leads to link like "http://127.0.0.1:800/Default/www.economist.com/21524852/print" without "node" before "21524852". This is exactly why it's difficult for me to browse offline. I've downloaded all the pages, but the inner link between these pages is wrong. I'll have to add the missing part every time I click on "see article" link.
Here is the setting of my entire project and the first three URL substitute rules are written for the old version and do not affect the downloading. It's the fourth one that actually does the work.
[Object]
OEVersion= 5.9.0.3374
Type=0
IID=1
Caption=The Economist
URL=http://www.economist.com/printedition/
MVer=5
Lev=2
Hour=13
Weekday=288
LimTSize=10000
LimNumber=5000
LimTime=100
FTText.Exts=htmlhtmaspaspxjspstmstmlidcshtmlhtxtxttextxspxmlrxmlcfmwmlphpphp3
FTImages.Exts=gifjpgjpegtiftiffxbmfifbmppngipxjp2j2cj2kwbmplwf
FTVideo.Exts=mpgavianimpegmovflvfliflcvivrmramrvasfasxwmvm1vm2vvobsmilmp4
FTAudio.Exts=wavriffmp3midmp2m3uravocwmaape
FTArchive.Exts=ziparcgzzarjlhalayleirarcabtarpakacejarpdftgzexe
FTUDef.Exts=jscssssivbsdtdxslswfclassent
FTText.B=ooxooo
FTImages.B=ooxooo
FTVideo.B=xoxooo
FTAudio.B=xoxooo
FTArchive.B=xoxooo
FTUDef.B=ooxooo
FTOther.B=ooxooo
FTSizes=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,3,3,0,3,0
NotIgnoreLogout=False
RSrvsBx=1
RSrvsIn=economist.com x
RSrvsEx=pixel.fetchback.com x
RPathIn=printeditionprint xx
RPathEx=rssmarkets/indicatorsfacebookdiggtargetfacebook?target=#blogsdebatecommentsprintsubscribetwitter xxxxxxxxxxoxx
RFileIn=printerfriendly.cfm o
RFileEx=index.cfmeconomist_printedition.xmlcover_index.cfmbuttomrecommendtopcommentfacebookdefault.htm xxxxxxxxx
RProt=255
LastStart=73:38:147:35:88:230:227:64:
LastEnd=4:138:48:38:88:230:227:64:
LastStarted=2011-7-30 18:06:15
LastEnded=2011-7-30 18:06:42
S200=86
S304=111
SPar=71
SSav=86
SLast=200
SSiz=4931427
SMdf=83
SHTML=62
SSuccDowns=72
LFiles=197
LSize=1668457
SubstsB=Kgl3d3cuZWNvbm9taXN0LmNvbS8qKiovKiovZGlzcGxheXN0b3J5LmNmbT9zdG9yeV9pZD0qCXd3dy5lY29ub21pc3QuY29tL1ByaW50ZXJGcmllbmRseS5jZm0/c3RvcnlfaWQ9Kg0KKgl3d3cuZWNvbm9taXN0LmNvbS8qKi9kaXNwbGF5c3RvcnkuY2ZtP3N0b3J5X2lkPSoJd3d3LmVjb25vbWlzdC5jb20vUHJpbnRlckZyaWVuZGx5LmNmbT9zdG9yeV9pZD0qDQoqCXd3dy5lY29ub21pc3QuY29tL2Rpc3BsYXlzdG9yeS5jZm0/c3RvcnlfaWQ9Kgl3d3cuZWNvbm9taXN0LmNvbS9QcmludGVyRnJpZW5kbHkuY2ZtP3N0b3J5X2lkPSoNCioJd3d3LmVjb25vbWlzdC5jb20vbm9kZS8qCXd3dy5lY29ub21pc3QuY29tL25vZGUvKi9wcmludA0K
ApplyAllSubsts=True
ImgDim=0,0,0,0
PrevURL=http://www.economist.com/printedition/
SkipURLs=http://ad.doubleclick.net/*http://ads.revsci.net/*http://connect.facebook.net/en_us/all.jshttp://doubleclick.net/*http://fls.doubleclick.net/*http://media.economist.com/images/subscriptions/error.gifhttp://pixel.fetchback.com/serve/fb/pdc?cat=&name=landing&sid=1874http://pixel.quantserve.com/*http://platform.twitter.com/widgets.jshttp://switch.atdmt.com/*http://www.economist.com/printhttp://www.economist.com/printedition/cover-index
ConvertRSS=True
Exported=2009-9-18 18:42:19 - C:\Documents and Settings\mlt\My Documents\te1008\
LIndexed=False
IndexFiles=False
Basically, I download links like:
e.com/n/1
e.com/n/2
e.com/n/3
...
and the URL substitute rule will add "print" after each link, so the actual links downloaded would be
e.com/n/1/print
e.com/n/2/print
e.com/n/3/print
inside the page "e.com/n/1" there is a inner link which points to "e.com/n/2" (here the see article link)
The problem occurred is that while POB downloaded all the"1/print,2/print,3/print" links correctly(without having to add the ending slash), when it comes to the links inside, the link the should have been
e.com/n/2/print became
e.com/2/print or e.com/n/2
either without "n" or "print"
since all those links are downloaded correctly in the first place, I don't think POB confuses the parsing as you have explained. There must be something wrong elsewhere.
Among the waiting list, there are links like:
http://www.economist.com/node/21525396/economist.com/print
I don't know why since there are actually no such links. Still a problem with parsing.
http://www.economist.com/node/21525365/print/print
were downloaded
That didn't happen last time.
[Object]
OEVersion= 5.9.0.3374
Type=0
IID=1
Caption=The Economist
URL=http://www.economist.com/printedition/
MVer=5
Lev=2
Hour=13
Weekday=288
LimTSize=10000
LimNumber=5000
LimTime=100
FTText.Exts=htmlhtmaspaspxjspstmstmlidcshtmlhtxtxttextxspxmlrxmlcfmwmlphpphp3
FTImages.Exts=gifjpgjpegtiftiffxbmfifbmppngipxjp2j2cj2kwbmplwf
FTVideo.Exts=mpgavianimpegmovflvfliflcvivrmramrvasfasxwmvm1vm2vvobsmilmp4
FTAudio.Exts=wavriffmp3midmp2m3uravocwmaape
FTArchive.Exts=ziparcgzzarjlhalayleirarcabtarpakacejarpdftgzexe
FTUDef.Exts=jscssssivbsdtdxslswfclassent
FTText.B=ooxooo
FTImages.B=ooxooo
FTVideo.B=xoxooo
FTAudio.B=xoxooo
FTArchive.B=xoxooo
FTUDef.B=ooxooo
FTOther.B=ooxooo
FTSizes=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,3,3,0,3,0
NotIgnoreLogout=False
RSrvsBx=1
RSrvsIn=economist.com x
RSrvsEx=pixel.fetchback.com x
RPathIn=printeditionnode xx
RPathEx=rssmarkets/indicatorsfacebookdiggtargetfacebook?target=#blogsdebatecommentssubscribetwitteremail xxxxxxxxxxxxx
RFileIn=printerfriendly.cfm o
RFileEx=index.cfmeconomist_printedition.xmlcover_index.cfmbuttomrecommendtopcommentfacebookdefault.htm xxxxxxxxx
RProt=255
LastStart=90:190:66:86:0:231:227:64:
LastEnd=107:67:236:86:0:231:227:64:
LastStarted=2011-8-5 0:15:09
LastEnded=2011-8-5 0:15:16
SLast=302
SSuccDowns=80
Flags=1
SubstsB=Kgl3d3cuZWNvbm9taXN0LmNvbS8qKiovKiovZGlzcGxheXN0b3J5LmNmbT9zdG9yeV9pZD0qCXd3dy5lY29ub21pc3QuY29tL1ByaW50ZXJGcmllbmRseS5jZm0/c3RvcnlfaWQ9Kg0KKgl3d3cuZWNvbm9taXN0LmNvbS8qKi9kaXNwbGF5c3RvcnkuY2ZtP3N0b3J5X2lkPSoJd3d3LmVjb25vbWlzdC5jb20vUHJpbnRlckZyaWVuZGx5LmNmbT9zdG9yeV9pZD0qDQoqCXd3dy5lY29ub21pc3QuY29tL2Rpc3BsYXlzdG9yeS5jZm0/c3RvcnlfaWQ9Kgl3d3cuZWNvbm9taXN0LmNvbS9QcmludGVyRnJpZW5kbHkuY2ZtP3N0b3J5X2lkPSoNCioJd3d3LmVjb25vbWlzdC5jb20vbm9kZS8qCXd3dy5lY29ub21pc3QuY29tL25vZGUvKi9wcmludA0K
ApplyAllSubsts=True
ImgDim=0,0,0,0
PrevURL=http://www.economist.com/printedition/
SkipURLs=http://ad.doubleclick.net/*http://ads.revsci.net/*http://connect.facebook.net/en_us/all.jshttp://doubleclick.net/*http://fls.doubleclick.net/*http://media.economist.com/images/subscriptions/error.gifhttp://pixel.fetchback.com/serve/fb/pdc?cat=&name=landing&sid=1874http://pixel.quantserve.com/*http://platform.twitter.com/widgets.jshttp://switch.atdmt.com/*http://www.economist.com/printhttp://www.economist.com/printedition/cover-index
ConvertRSS=True
Exported=2009-9-18 18:42:19 - C:\Documents and Settings\mlt\My Documents\te1008\
LIndexed=False
IndexFiles=False
Oleg.
And is it possible to exclude links that end with a certain word from downloading?
/*print/print
Oleg.
Now all the links downloaded end with a certain number.
Can I use the filter to block all links that end with /print ? (not print/print in this case)
print$
Oleg.
Oleg.