How can I avoid downloading multiple search results of a car selection at www.mobile.de?

Author Message
Matthias Meisner 09/17/2004 06:02 am
Hello,

while downloading survey and detail pages of a car selection at www.mobile.de (for instance, "Audi Allroad"), I don`t find a way to suppress the download of duplicates of these pages with OEE.

Therefore I habe the following question:

Can OEE be configured in such a way that one can download only one instance of all search survey and all detail pages of a car selection at www.mobile.de?

(I think the solution of this problem (if any exists at all) will involve the usage of a programming script and maybe the usage of TextPipePro (to analyze the links which appear in the HTML code of the currently loaded page). Probably OEE will download one document, TextPipePro analyzes all links in the source code and passes the results back again to OEE, which downloads the next document based on the results of the analyzing process, and so on.)
Oleg Chernavin 09/18/2004 02:54 am
I think, it should be easy to do, but I cannot read German, so can you please explain me better, which links/sections do you want and which - not.

Thank you!

Best regards,
Oleg Chernavin
MP Staff
Matthias Meisner 09/20/2004 10:36 am
> I think, it should be easy to do, but I cannot read German, so can you please explain me better, which links/sections do you want and which - not.
>
> Thank you!
>
> Best regards,
> Oleg Chernavin
> MP Staff

Hello Oleg!

I want to download the first survey page of a search for "Audi Allroad" (its URL is

http://www.mobile.de/SIDK5b2oFXitxVsndwoWFuYlA-t-vaNexlCsAsCsK%F3P~BmSB10LsearchPublicJ1095683355A1LsearchPublicD1100CCarS-t-vpLtt~BmPA1B20A0/cgi-bin/searchPublic.pl?_form=search&sr_make=1900&sr_model=Allroad&sr_priceFrom=-2&sr_priceTo=-2&sr_category=1100&sr_powerRange=-2&sr_registrationDateFrom=-2&sr_registrationDateTo=-2&sr_mileageFrom=-2&sr_mileageTo=-2&sr_engineType=-2&sr_doorCount=-2&sr_color=-2&sr_country=-2&sr_zip=&sr_zipRadiusTo=-2&sr_damaged=-2&sr_daysOldTo=-2&sr_sortOrder=0&doSearch.x=34&doSearch.y=9)

and alll survey pages that are linked to this page (you load all these pages if you consecutively click on the page numbers "1", "2", "3", ..., which are in the upper half of the first survey page).
Currently there are 749 cars of the type "Audi Allroad" found in the database of www.mobile.de, so altogether there are 38 survey pages to load).

In addition to this, I also want to download the detail page to each of the currently 749 cars. You load the detail pages, if you click on the link "Details ansehen" to every listed car (a URL of a detail page is for example

http://www.mobile.de/SIDw0dkqiNrcFpG.TbDHbkSmQ-t-vaNexlCsAsCsK%F3P%F3R~BmSB10LsearchPublicJ1095683355A1LsearchPublicD1100CCarY-t-vctpLtt~BmPA1A1B20C749%81%40-t-vCaMkMoSm_X_Y_x_ysO~BSRA6D1100D1900GAllroadGALLROADA0A0A0A0A0/cgi-bin/da.pl?bereich=pkw&id=11111111144084132&top=8&

(this URL points to the detail page of the car, which is listed on the 8th position in the survey page)).

And now it comes: I want to download all these survey and detail pages only once (that means, I want to avoid downloading duplicates of these pages). How is this possible with OEE?

(I can give you some background information about what must be done to avoid multiple downloads:

1. Every detail page contains a link "eMail an den Anbieter". If you click on this link, the same detail page will be loaded again (but with a different Session ID than the one before). Therefore it would solve the problem of downloading detail pages only once if the Session ID could be cut off the URL (that is, if detail pages could be stored like this on the hard disk:

www.mobile.de\cgi-bin\da.pl?bereich=pkw&id=11111111144084132&top=8&
)

I tried to use URL Substitutes as you and Anonymous mentioned in my former question. I wasn`t successful: the detail pages were always stored with Session IS names "SID...." directly underneath the directory www.mobile.de!

(I used several variations of the following configuration within the "URL Substitutes" window:

a) Input of "http://www.mobile.de/sid*/cgi-bin/da.pl?* within the "URL" text field;

b) Input of "sid*/" within the "Replace:" field;

c) Input of "" within the "With:" field

d) I left the resulting item and the checkbox "Apply all matching rules" unchecked.)

How is it possible to cut the SID segment out of the URL address?

2. The task to avoid downloading multiple instances of survey pages is more complex:

a) In analogy to Step 1, the SID segment should be cut out of the URL address (so that a survey page is stored like this on the hard disk:

www.mobile.de\cgi-bin\searchPublic.pl?bereich=pkw&top=21&
)

b) Every link within the range of survey pages, which references to the first survey page, must be suppressed (this is because the URL of a reference from survey page x (with x > 1) to survey page 1 looks like this:

http://www.mobile.de/sid*/cgi-bin/searchPublic.pl?bereich=pkw&top=1&

whereas the URL of the "original" first survey page is

http://www.mobile.de/sid*/cgi-bin/searchPublic.pl?_form=search&sr_make=1900&sr_model=Allroad&sr_priceFrom=-2&sr_priceTo=-2&s
Oleg Chernavin 09/20/2004 10:47 am
I would suggest you to do the following - make your search in the Internal Browser of Offline Explorer. Take the URL of the first Search Results page and create a new Project with that URL.

Now uncheck the Level setting and go to URL Filters | Filename. Select Custom configuration and add the following keywords to the Included list:

searchPublic.pl
da.pl

Click OK button and try to download the Project.

Also, I would recommend you not to remove the SID number, because the site uses it to show the search results.

Oleg.

Matthias Meisner 09/20/2004 07:51 pm
> I would suggest you to do the following - make your search in the Internal Browser of Offline Explorer. Take the URL of the first Search Results page and create a new Project with that URL.
>
> Now uncheck the Level setting and go to URL Filters | Filename. Select Custom configuration and add the following keywords to the Included list:
>
> searchPublic.pl
> da.pl
>
> Click OK button and try to download the Project.
>
> Oleg.

Hello Oleg!

This doesn`t solve the problem. For instance, if I have a search result that contains 15 survey pages, the number of survey and detail pages that will be downloaded is 30 times higher than needed. This is not what I want!

Don`t you know the answer to my question? I think there must be a way to cut the SID segment out of the URL address while downloading the files.

> Also, I would recommend you not to remove the SID number, because the site uses it to show the search results.

This doesn`t matter. I`ll parse the results with TextPipePro.


Matthias
Oleg Chernavin 09/21/2004 07:46 am
> This doesn`t solve the problem. For instance, if I have a search result that contains 15 survey pages, the number of survey and detail pages that will be downloaded is 30 times higher than needed. This is not what I want!

But what you want? The above settings should load all search result pages and all details pages of the search.

Does OE load anything else?

> Don`t you know the answer to my question? I think there must be a way to cut the SID segment out of the URL address while downloading the files.

You can only use URL Substitutes to remove it:

URL:
*
Replace /SID*/
With
--- keep this field empty

Oleg.