I tried try download all the articles URLs of this page : http://biz.yahoo.com/ic/news/510.html (including this page).
Some URLs are downloaded, URLs like :
http://us.rd.yahoo.com/finance/industry/news/latestnews/*http://biz.yahoo.com/prnews/061213/nyw043.html?.v=75
Others are not downloaded, URLs like :
http://us.rd.yahoo.com/finance/industry/news/latestnews/*http://us.rd.yahoo.com/finance/external/wsj/SIG=11p4sig33/*http://online.wsj.com/article/SB116600441346348810.html?mod=yahoo_hs&ru=yahoo
I put URLs filters (directory) like :
/finance/industry/news/latestnews/
/prnews/
/finance/external/wsj/
I don't understand why the first type of URLs is downloaded and not the second. Is there a limit of URL size?
Thanks for your help.
Estelle
Best regards,
Oleg Chernavin
MP Staff
>
> Best regards,
> Oleg Chernavin
> MP Staff
In the log, the second URL is not rejected :
HTTP1: Transferring data from http://us.rd.yahoo.com/finance/industry/news/latestnews/*http://us.rd.yahoo.com/finance/external/wsj/SIG=11pqlfj13/*http://online.wsj.com/article/SB116596138778448104.html?mod=yahoo_hs&ru=yahoo.
HTTP1: HTTP/1.0 302 Moved Temporarily
Why it is moved temporarily?
estelle
http://us.rd.yahoo.com/finance/external/wsj/SIG=11pqlfj13/*http://online.wsj.com/article/SB116596138778448104.html?mod=yahoo_hs&ru=yahoo
http://online.wsj.com/article/SB116596138778448104.html?mod=yahoo_hs&ru=yahoo
http://users1.wsj.com/lmda/do/checkLogin?mg=wsj-users1&url=http%3A%2F%2Fonline.wsj.com%2Farticle%2FSB116596138778448104.html%3Fmod%3Dyahoo_hs%26ru%3Dyahoo
Only the last address is the actual article. You will need to enable all 3 addresses in other to load the files.
Oleg.
Thanks for your response.
I still have the problem, I give you 2 others examples:
Example 1 which doesn't work:
http://us.rd.yahoo.com/finance/industry/news/latestnews/*http://us.rd.yahoo.com/finance/external/cnnm/SIG=120qcvstg/*http://money.cnn.com/2006/12/18/news/companies/lilly/index.htm?source=yahoo_quote
With these filters :
/finance/industry/news/latestnews/
/finance/external/cnnm/
/news/companies/
Example 2 which works:
http://us.rd.yahoo.com/finance/industry/news/latestnews/*http://biz.yahoo.com/rb/061218/biomet_deal.html?.v=3
With these filters :
/finance/industry/news/latestnews/
/rb/
How can I do for the first example?
Thanks for your help.
Estelle
http://money.cnn.com/*
And allow downloading from all servers in URL Filters - Server.
Oleg.
Sorry, but how can I do to download a portal page (http://biz.yahoo.com/ic/news/510.html) and all its sub-articleURLs with there contents?
I tried with the level limit 2 and 3 but I have always the same result : when the URL is long and with several redirections, it's not downloaded whatever the filters defined.
Thanks again for your help.
estelle
Oleg.
Thanks for creating a project for me, but it still doesn't work in this project!
The article from the source "at CNNMoney.com" for example, is not downloaded.
Have you an idea of the problem?
Thanks a lot.
estelle
Oleg.
http://us.rd.yahoo.com/finance/industry/news/latestnews/*http://us.rd.yahoo.com/finance/external/cnnm/SIG=12fm684aq/*http://money.cnn.com/2007/01/08/news/companies/schering_plough.reut/index.htm?source=yahoo_quote
And you have put in the project keywords like :
/finance/industry/news/latestnews/
/finance/external/
/news/companies/
I don't know what others keywords we can use to perform the downloading!
Thanks for your help.
estelle
money.cnn.com
us.rd.yahoo.com
?
Oleg.
Oleg.
HTTP8 - 09/01/2007 15:00:35 - Transferring data from http://us.rd.yahoo.com/finance/industry/news/latestnews/*http://us.rd.yahoo.com/finance/external/cnnm/SIG=12fm684aq/*http://money.cnn.com/2007/01/08/news/companies/schering_plough.reut/index.htm?source=yahoo_quote.
HTTP8 - 09/01/2007 15:00:35 - HTTP/1.0 302 Moved Temporarily
HTTP8 - 09/01/2007 15:00:35 - 274 bytes of http://us.rd.yahoo.com/finance/industry/news/latestnews/*http://us.rd.yahoo.com/finance/external/cnnm/SIG=12fm684aq/*http://money.cnn.com/2007/01/08/news/companies/schering_plough.reut/index.htm?source=yahoo_quote.
HTTP8 - 09/01/2007 15:00:35 - Download complete. Status: 302 Object Moved.
HTTP8 - 09/01/2007 15:00:35 - Delay 1 seconds before http://us.rd.yahoo.com/finance/external/cnnm/SIG=12fm684aq/*http://money.cnn.com/2007/01/08/news/companies/schering_plough.reut/index.htm?source=yahoo_quote.
Do you understand these log messages?
estelle
Oleg.
HTTP8 - 09/01/2007 15:00:36 - Connecting to Proxy server...
HTTP8 - 09/01/2007 15:00:36 - Host us.rd.yahoo.com connected. Waiting for http://us.rd.yahoo.com/finance/external/cnnm/SIG=12fm684aq/*http://money.cnn.com/2007/01/08/news/companies/schering_plough.reut/index.htm?source=yahoo_quote.
HTTP8 - 09/01/2007 15:00:36 - GET http://us.rd.yahoo.com/finance/external/cnnm/SIG=12fm684aq/*http://money.cnn.com/2007/01/08/news/companies/schering_plough.reut/index.htm?source=yahoo_quote HTTP/1.0
HTTP8 - 09/01/2007 15:00:37 - Download complete. Status: 302 Object Moved.
?
estelle
Oleg.
[Object]
OEVersion=Pro 4.5.0.2532
Type=0
IID=62233
Caption=http://biz.yahoo.com/ic/news/510.html
URL=http://biz.yahoo.com/ic/news/510.htmlAdditional=AutoExport=C:\testoffline\;11000011010;DeleteProjectFiles
Lev=1
Weekday=257
LimTSize=10000
LimNumber=5000
LimTime=100
FMGroup=2
FTText.Exts=htmlhtmaspaspxjspstmstmlidcshtmlhtxtxttextxspxmlrxmlcfmwmlphpphp3
FTImages.Exts=gifjpgjpegtiftiffxbmfifbmppngipxjp2j2cj2kwbmplwf
FTVideo.Exts=mpgavianimpegmovfliflcvivrmramrvasfasxwmvm1vm2vvob
FTAudio.Exts=wavriffmp3midmp2m3uravocwmaape
FTArchive.Exts=ziparcgzzarjlhalayleirarcabtarpakacejar
FTUDef.Exts=jscssssivbsdtdxslswf
FTText.B=ooxooo
FTImages.B=xoxooo
FTVideo.B=xoxooo
FTAudio.B=xoxooo
FTArchive.B=xoxooo
FTUDef.B=xoxooo
FTOther.B=ooxooo
FTSizes=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,3,3,0,3,0
RSrvsBx=2
RSrvsIn=biz.yahoo.commoney.cnn.comus.rd.yahoo.com xxx
RPathBx=2
RPathIn=/finance/industry/news/latestnews/ /finance/external//news/companies/ http://biz.yahoo.com/*http://*wsj.com/*http://*thestreet.com/*http://*marketwatch.com/*/finance/industry/ xxxxooox
RProt=127
LastStart=137:19:15:1:116:22:227:64:
LastEnd=6:147:225:3:116:22:227:64:
S200=45
S304=2
SPar=47
SSav=45
SLast=302
SSiz=1676045
SMdf=45
LFiles=47
LSize=1686413
Flags=1
ImgDim=0,0,0,0
PrevURL=http://biz.yahoo.com/ic/news/510.html
IPAddr=1723860162
Exported=09/01/2007 15:00:40 - C:\testoffline\
Oleg.