How to avoid multiple instances of search results at www.mobile.de?
|Matthias Meisner||09/16/2004 02:41 pm|
While downloading survey and detail pages of a car selection at www.mobile.de (for instance, "Audi Allroad"), I don`t find a way to suppress the download of duplicates of these pages with OEP.
Therefore I habe the following question:
Can OEP (or Offline Explorer Enterprise) be configured so that one can download only one instance of all search survey and all detail pages of a car selection at www.mobile.de?
(I think the solution of this problem (if any exists at all) will involve the usage of a programming script and maybe the usage of TextPipePro (to analyze the links which appear in the HTML code of the currently loaded page). Probably OEP will download one document, TextPipePro analyzes all links in the source code and passes the results back again to OEP, which downloads the next document based on the results of the analyzing process, and so on.
Is this possible?)
|Oleg Chernavin||09/18/2004 02:54 am|
|I think, it should be easy to do, but I cannot read German, so can you please explain me better, which links/sections do you want and which - not.
|Matthias Meisner||09/20/2004 10:28 am|
|> I think, it should be easy to do, but I cannot read German, so can you please explain me better, which links/sections do you want and which - not.
> Thank you!
> Best regards,
> Oleg Chernavin
> MP Staff
I want to download the first survey page of a search for "Audi Allroad" (its URL is
and alll survey pages that are linked to this page (you load all these pages if you consecutively click on the page numbers "1", "2", "3", ..., which are in the upper half of the first survey page).
Currently there are 749 cars of the type "Audi Allroad" found in the database of www.mobile.de, so altogether there are 38 survey pages to load).
In addition to this, I also want to download the detail page to each of the currently 749 cars. You load the detail pages, if you click on the link "Details ansehen" to every listed car (a URL of a detail page is for example
(this URL points to the detail page of the car, which is listed on the 8th position in the survey page)).
And now it comes: I want to download all these survey and detail pages only once (that means, I want to avoid downloading duplicates of these pages). How is this possible with OEP?
(I can give you some background information about what must be done to avoid multiple downloads:
1. Every detail page contains a link "eMail an den Anbieter". If you click on this link, the same detail page will be loaded again (but with a different Session ID than the one before). Therefore it would solve the problem of downloading detail pages only once if the Session ID could be cut out of the URL (that is, if detail pages could be stored like this on the hard disk:
I tried to use URL Substitutes as you and Anonymous mentioned in my former question. I wasn`t successful: the detail pages were always stored with Session ID names "SID...." directly underneath the directory "www.mobile.de\"!
(I used several variations of the following configuration within the "URL Substitutes" window:
a) Input of "http://www.mobile.de/sid*/cgi-bin/da.pl?* within the "URL" text field;
b) Input of "sid*/" within the "Replace:" field;
c) Input of "" within the "With:" field
d) I left the resulting item and the checkbox "Apply all matching rules" unchecked.)
How is it possible to cut the SID segment out of the URL address?
2. The task to avoid downloading multiple instances of survey pages is more complex:
a) In analogy to Step 1, the SID segment should be cut out of the URL address (so that a survey page is stored like this on the hard disk:
b) Every link within the range of survey pages, which references to the first survey page, must be suppressed (this is because the URL of a reference from survey page x (with x > 1) to survey page 1 looks like this:
whereas the URL of the "original" first survey page is
|Oleg Chernavin||09/20/2004 10:39 am|
|I would suggest you to do the following - make your search in the Internal Browser of Offline Explorer. Take the URL of the first Search Results page and create a new Project with that URL.
Now uncheck the Level setting and go to URL Filters | Filename. Select Custom configuration and add the following keywords to the Included list:
Click OK button and try to download the Project.
Also, I would recommend you not to remove the SID number, because the site uses it to show the search results.
|Matthias Meisner||09/20/2004 07:45 pm|
|> I would suggest you to do the following - make your search in the Internal Browser of Offline Explorer. Take the URL of the first Search Results page and create a new Project with that URL.
> Now uncheck the Level setting and go to URL Filters | Filename. Select Custom configuration and add the following keywords to the Included list:
> Click OK button and try to download the Project.
This doesn`t solve the problem. For instance, if I have a search result that contains 15 survey pages, the number of survey and detail pages that will be downloaded is 30 times higher than needed. This is not what I want!
Don`t you know the answer the answer to my question? I think there must be a way to cut the SID segment out of the URL address while downloading the files.
> Also, I would recommend you not to remove the SID number, because the site uses it to show the search results.
This doesn`t matter.
|Oleg Chernavin||09/21/2004 07:46 am|
|I replied your message in another topic to avoid duplicates.
|Matthias Meisner||09/22/2004 12:43 pm|
|> I replied your message in another topic to avoid duplicates.
Thanks for your answer!
Now I can download the search result pages only once (as desired)!
In spite of that, I don`t understand the following situation:
I`ve tested your configuration on the car selection "Jaguar Daimler" because of the small test set (currently only 8 survey pages to download). I expected that the downloading process would cover only the 8 files which contain the search results, and that attempts to download other (more) than these 8 files would cause OEP to abort the corresponding transaction.
But this didn`t happen! OEP downloaded 41 files instead of only 8 files! But, as mentioned before, there were only 8 files in the target directory. Therefore I guess that the 8 files were repeatedly overwritten with the 41 downloaded files (though I instructed OEP to download only new and modified files and activated the checkbox "Check file size", too).
Why were the 8 target files repeatedly overwritten while downloading 41 files? How can OEP be configured to abort the unnecessary download transactions?
(To make my statements transparent for you, here is my test configuration for "Jaguar Daimler":
a) Text within the input text field "Addresses (URLs):" (see "Project")
SetCookie=Dwww.mobile.deSESTEST=1; BSUID=1; Dwww.mobile.deFRQSTR=18264418,18264418,18264418,18264418; Dwww.mobile.deWIDYMD=#17432:DIV#; Dwww.mobile.deKIDYMD=#111363:DIVP#86193:DIVU#112847:DITA#109927:DITB#109939:DIPC#71482:DINA#111287:DINA#108238:DIMA#; POPUPCHECK=1095874127812; ASLTRG=28#24#.net#11165#344#.pool.mediaWays.net##de##-312#342#1228#0#2; Dwww.mobile.deVISKAM=#111363-61064#86193-60802#; Dwww.mobile.deVISWEB=#17432-119409-61064#
b) Radio button "Download only Modified and new files" selected;
c) Checkbox "Check file size" checked;
d) Uncheck all File Filters with the exception of "Text" and "Other";
e) Settings for URL Filters:
e.i) "Load all protocols"
e.ii) "Load files only within the starting Server"
e.iii) "Load files from all directories"
e.iv) Custom filenames configuration:
I) "View included files keywords":
II) "View excluded files keywords":
All items for "View included files keywords" and "View excluded files keywords" are marked.
f) URL Substitutes:
With: (empty field)
The checkbox in front of the URL "*" and the checkbox "Apply all matching rules" are both unmarked!
g) That`s all!)
Oleg, thanks for your answer in advance!
P.S.: I`ll tell you my problems trying to download details pages only once in a following question!
|Oleg Chernavin||09/23/2004 09:04 am|
Can I ask you to send me your Project settings by E-mail? Please select the Project, click Copy button on toolbar and paste to Notepad, save it to a text file and send the file to me to email@example.com ?