Size of URLs downloaded

Author Message
Estelle 12/13/2006 08:53 am
Hi Oleg,


I tried try download all the articles URLs of this page : http://biz.yahoo.com/ic/news/510.html (including this page).

Some URLs are downloaded, URLs like :
http://us.rd.yahoo.com/finance/industry/news/latestnews/*http://biz.yahoo.com/prnews/061213/nyw043.html?.v=75

Others are not downloaded, URLs like :
http://us.rd.yahoo.com/finance/industry/news/latestnews/*http://us.rd.yahoo.com/finance/external/wsj/SIG=11p4sig33/*http://online.wsj.com/article/SB116600441346348810.html?mod=yahoo_hs&ru=yahoo


I put URLs filters (directory) like :
/finance/industry/news/latestnews/
/prnews/
/finance/external/wsj/

I don't understand why the first type of URLs is downloaded and not the second. Is there a limit of URL size?
Thanks for your help.

Estelle
Oleg Chernavin 12/13/2006 09:07 am
There is no such limit. I think, you need to enable the Log (Ctrl+W), turn it on and allow logging rejected URLs. Then download the Project again and see why these URLs were rejected.

Best regards,
Oleg Chernavin
MP Staff
Estelle 12/13/2006 11:54 am
> There is no such limit. I think, you need to enable the Log (Ctrl+W), turn it on and allow logging rejected URLs. Then download the Project again and see why these URLs were rejected.
>
> Best regards,
> Oleg Chernavin
> MP Staff


In the log, the second URL is not rejected :
HTTP1: Transferring data from http://us.rd.yahoo.com/finance/industry/news/latestnews/*http://us.rd.yahoo.com/finance/external/wsj/SIG=11pqlfj13/*http://online.wsj.com/article/SB116596138778448104.html?mod=yahoo_hs&ru=yahoo.
HTTP1: HTTP/1.0 302 Moved Temporarily

Why it is moved temporarily?

estelle

Oleg Chernavin 12/14/2006 04:13 am
This is a question to the server. I loaded that link and it looks like there are several redirects that take place:

http://us.rd.yahoo.com/finance/external/wsj/SIG=11pqlfj13/*http://online.wsj.com/article/SB116596138778448104.html?mod=yahoo_hs&ru=yahoo
http://online.wsj.com/article/SB116596138778448104.html?mod=yahoo_hs&ru=yahoo
http://users1.wsj.com/lmda/do/checkLogin?mg=wsj-users1&url=http%3A%2F%2Fonline.wsj.com%2Farticle%2FSB116596138778448104.html%3Fmod%3Dyahoo_hs%26ru%3Dyahoo

Only the last address is the actual article. You will need to enable all 3 addresses in other to load the files.

Oleg.
Estelle 12/18/2006 11:19 am
Hello Oleg,

Thanks for your response.
I still have the problem, I give you 2 others examples:

Example 1 which doesn't work:
http://us.rd.yahoo.com/finance/industry/news/latestnews/*http://us.rd.yahoo.com/finance/external/cnnm/SIG=120qcvstg/*http://money.cnn.com/2006/12/18/news/companies/lilly/index.htm?source=yahoo_quote

With these filters :
/finance/industry/news/latestnews/
/finance/external/cnnm/
/news/companies/


Example 2 which works:
http://us.rd.yahoo.com/finance/industry/news/latestnews/*http://biz.yahoo.com/rb/061218/biomet_deal.html?.v=3

With these filters :
/finance/industry/news/latestnews/
/rb/


How can I do for the first example?
Thanks for your help.

Estelle
Oleg Chernavin 12/18/2006 12:21 pm
Maybe add one other keyword:

http://money.cnn.com/*

And allow downloading from all servers in URL Filters - Server.

Oleg.
Estelle 12/20/2006 11:16 am
It's too long when I put this other keyword and it doesn't work.
Sorry, but how can I do to download a portal page (http://biz.yahoo.com/ic/news/510.html) and all its sub-articleURLs with there contents?
I tried with the level limit 2 and 3 but I have always the same result : when the URL is long and with several redirections, it's not downloaded whatever the filters defined.
Thanks again for your help.
estelle
Oleg Chernavin 12/21/2006 09:47 am
OK. I created a Project for you. Please find it using the Tools menu - Published Projects - Finance section.

Oleg.
Estelle 01/09/2007 05:08 am
Hi Oleg,

Thanks for creating a project for me, but it still doesn't work in this project!
The article from the source "at CNNMoney.com" for example, is not downloaded.
Have you an idea of the problem?
Thanks a lot.

estelle
Oleg Chernavin 01/09/2007 05:35 am
Can you give its direct URL to me? Or add the corresponding server to the keywords list?

Oleg.
Estelle 01/09/2007 05:54 am
The URL is :

http://us.rd.yahoo.com/finance/industry/news/latestnews/*http://us.rd.yahoo.com/finance/external/cnnm/SIG=12fm684aq/*http://money.cnn.com/2007/01/08/news/companies/schering_plough.reut/index.htm?source=yahoo_quote


And you have put in the project keywords like :
/finance/industry/news/latestnews/
/finance/external/
/news/companies/

I don't know what others keywords we can use to perform the downloading!
Thanks for your help.

estelle

Oleg Chernavin 01/09/2007 06:24 am
Does the servers list include:

money.cnn.com
us.rd.yahoo.com

?

Oleg.
Estelle 01/09/2007 07:57 am
Yes.
Oleg Chernavin 01/09/2007 08:08 am
Strange. Can you please open Log Window (Ctrl+W), allow filters to show Rejected URLs and allow logging. Then start downloading the Project and see - the log should show all URLs that were rejected from the download and exact reason.

Oleg.
Estelle 01/09/2007 09:30 am
The URL is not rejected :

HTTP8 - 09/01/2007 15:00:35 - Transferring data from http://us.rd.yahoo.com/finance/industry/news/latestnews/*http://us.rd.yahoo.com/finance/external/cnnm/SIG=12fm684aq/*http://money.cnn.com/2007/01/08/news/companies/schering_plough.reut/index.htm?source=yahoo_quote.

HTTP8 - 09/01/2007 15:00:35 - HTTP/1.0 302 Moved Temporarily
HTTP8 - 09/01/2007 15:00:35 - 274 bytes of http://us.rd.yahoo.com/finance/industry/news/latestnews/*http://us.rd.yahoo.com/finance/external/cnnm/SIG=12fm684aq/*http://money.cnn.com/2007/01/08/news/companies/schering_plough.reut/index.htm?source=yahoo_quote.

HTTP8 - 09/01/2007 15:00:35 - Download complete. Status: 302 Object Moved.
HTTP8 - 09/01/2007 15:00:35 - Delay 1 seconds before http://us.rd.yahoo.com/finance/external/cnnm/SIG=12fm684aq/*http://money.cnn.com/2007/01/08/news/companies/schering_plough.reut/index.htm?source=yahoo_quote.


Do you understand these log messages?
estelle
Oleg Chernavin 01/09/2007 09:39 am
This is normal - the URL has moved to another address by the server and Offline Explorer is about to load it - what happens next after the 1 second delay?

Oleg.
Estelle 01/09/2007 10:56 am
Nothing really interesting :

HTTP8 - 09/01/2007 15:00:36 - Connecting to Proxy server...
HTTP8 - 09/01/2007 15:00:36 - Host us.rd.yahoo.com connected. Waiting for http://us.rd.yahoo.com/finance/external/cnnm/SIG=12fm684aq/*http://money.cnn.com/2007/01/08/news/companies/schering_plough.reut/index.htm?source=yahoo_quote.

HTTP8 - 09/01/2007 15:00:36 - GET http://us.rd.yahoo.com/finance/external/cnnm/SIG=12fm684aq/*http://money.cnn.com/2007/01/08/news/companies/schering_plough.reut/index.htm?source=yahoo_quote HTTP/1.0

HTTP8 - 09/01/2007 15:00:37 - Download complete. Status: 302 Object Moved.


?

estelle
Oleg Chernavin 01/09/2007 11:08 am
Can you post your Project settings here? Select it, click Copy button on toolbar and paste to the forum message.

Oleg.
Estelle 01/09/2007 11:16 am
OK

[Object]
OEVersion=Pro 4.5.0.2532
Type=0
IID=62233
Caption=http://biz.yahoo.com/ic/news/510.html
URL=http://biz.yahoo.com/ic/news/510.htmlAdditional=AutoExport=C:\testoffline\;11000011010;DeleteProjectFiles
Lev=1
Weekday=257
LimTSize=10000
LimNumber=5000
LimTime=100
FMGroup=2
FTText.Exts=htmlhtmaspaspxjspstmstmlidcshtmlhtxtxttextxspxmlrxmlcfmwmlphpphp3
FTImages.Exts=gifjpgjpegtiftiffxbmfifbmppngipxjp2j2cj2kwbmplwf
FTVideo.Exts=mpgavianimpegmovfliflcvivrmramrvasfasxwmvm1vm2vvob
FTAudio.Exts=wavriffmp3midmp2m3uravocwmaape
FTArchive.Exts=ziparcgzzarjlhalayleirarcabtarpakacejar
FTUDef.Exts=jscssssivbsdtdxslswf
FTText.B=ooxooo
FTImages.B=xoxooo
FTVideo.B=xoxooo
FTAudio.B=xoxooo
FTArchive.B=xoxooo
FTUDef.B=xoxooo
FTOther.B=ooxooo
FTSizes=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,3,3,0,3,0
RSrvsBx=2
RSrvsIn=biz.yahoo.commoney.cnn.comus.rd.yahoo.com xxx
RPathBx=2
RPathIn=/finance/industry/news/latestnews/ /finance/external//news/companies/ http://biz.yahoo.com/*http://*wsj.com/*http://*thestreet.com/*http://*marketwatch.com/*/finance/industry/ xxxxooox
RProt=127
LastStart=137:19:15:1:116:22:227:64:
LastEnd=6:147:225:3:116:22:227:64:
S200=45
S304=2
SPar=47
SSav=45
SLast=302
SSiz=1676045
SMdf=45
LFiles=47
LSize=1686413
Flags=1
ImgDim=0,0,0,0
PrevURL=http://biz.yahoo.com/ic/news/510.html
IPAddr=1723860162
Exported=09/01/2007 15:00:40 - C:\testoffline\
Oleg Chernavin 01/09/2007 11:37 am
I found it. The keyword /news/companies/ contains a space symbol at the end. Please remove the space and it will work.

Oleg.