Trouble downloading Washington Post pages

Author Message
Marc C 02/04/2004 01:28 am
Hello all,

My organization is trying to download opinion/editorials from the Washington Post such as this:

http://www.washingtonpost.com/ac2/wp-dyn?pagename=article&node=&contentId=A33312-2003Dec26&notFound=true

This is 1 level down from a Yahoo! Opinion & Editorial archive with a link like this:

http://us.rd.yahoo.com/dailynews/fc/World/mideast_conflict/opinion___editorials/SIG=1253nf48s/*http://www.washingtonpost.com/wp-dyn/articles/A33312-2003Dec26.html

As you can see, the Wash Post`s web server converts the Yahoo! link to a dynamically generated page (I think), which doesn`t seem to get followed by OE 2.9. My map for www.washingtonpost.com looks like this:

[ac2] (empty)
+[wp-adv] (advertisement stuff)
- [wp-dyn]
-[articles]
A10067-2002Oct10.html
A11135-2002Apr7.html
...
...
A9254-2002Jun6.html
+[opinion]
+[wp-srv]

As you can see, [ac2] never gets populated, even though ultimately that is the folder on WashPost`s server where the html file resides.

Any help is much appreciated.

Regards,
Marc
Oleg Chernavin 02/04/2004 04:48 am
Marc,

I followed the second link, but Yahoo told me that there is no such page. Can you tell me a link to all Yahoo.com Opinions and Editorials?

Best regards,
Oleg Chernavin
MP Staff
Marc C 02/05/2004 02:22 pm
Oleg, please try:

http://story.news.yahoo.com/fc?tmpl=fc&cid=34&lp=1&ll=b1&pg=1&mod=opinion___editorials&in=world&cat=mideast_conflict_archive

and click on one of the Washington Post links.

> Marc,
>
> I followed the second link, but Yahoo told me that there is no such page. Can you tell me a link to all Yahoo.com Opinions and Editorials?
>
> Best regards,
> Oleg Chernavin
> MP Staff
Oleg Chernavin 02/06/2004 07:40 am
OK. You need to make the change to the Project configuration - set Level to 2, because Yahoo contains a link to a non-existing page on WashingtonPost, which redirects to the actual article.

I would also suggest you to use URL Filters | Filename | Custom configuration to add two keywords to the Included filename keywords:

http://*yahoo.com/*www.washingtonpost.com*/*
http://www.washingtonpost.com/*/*

This will filter exactly the pages you want to download.

Oleg.