How to download search results at www.mobile.de?
|Matthias Meisner||09/11/2004 12:46 pm|
I`m trying to download search results (that is, survey and detailed pages) at www.mobile.de (for instance, search results for "Audi Allroad").
I`ve entered the following parameters with regard to "URL Filters":
- "Load only the starting protocol"
- "Load files only within the starting Server"
- "Custom directories configuration" selected; "http://www.mobile.de/sid*/cgi-bin/" entered within the input field of "View included directories keywords"
- "Custom filenames configuration" selected; keywords "da.pl" and "searchpublic.pl" entered within the input field of "View included files keywords"
The result is that OE downloads only one page (i.e. the starting survey page of the search).
What did I wrong? Or isn`t OE able to download the requested pages?
I will highly appreciate if someone can tell me the solution.
|09/13/2004 03:19 am|
|> Or isn`t OE able to download the requested pages?
OEP can download the requested pages. But in most cases you will get multiple identical pages (double, triple, ... pages, the more resultpages your searchresult offers). This is caused by the SID in the directory structure:
Perhaps there is a way to avoid this. I didn`t have luck with my first tests with URL Substitutes, Cookies ...
I wonder if Oleg can find a way to easily avoid the multiple (nearly) identical page downloads?
@ Oleg: A good starting point for tests would be a search request with 2 or 3 resultpages, i.e.:
But let`s get back to your question, Matthias.
Try the following to search for your PKW:
Click on "Suchen"
Modell: Audi Allroad
Hit CTRL+ALT when clicking on "Suche starten"
OEP will create a new Project, something like:
(I`ve cut out the SID-directory: http://www.mobile.de/SID*/ )
This is the starting URL.
The other settings:
No "Level limit"
Load using URL filters settings
Add the following keyword to the Extensions list: bild
Load from any site
Load only from the starting server
Load using URL filters settings
Load all protocols
Load files only within the starting Server
Load files from all directories
Custom filenames configuration
View included files keywords:
I won`t recommend to download a searchresult with more than 5 resultpages (120 PKW`s), because you will get a lot of junk (multiple nearly identical pages with different SID`s).
(Note: Some links won`t be useable offline; i.e. use BACK and FORWARD instead)
I would also be glad if Oleg finds a better setting, without getting those junk pages... ;-)
I`m sure that there are a lot more sites around with those /SID*/ directories.
|Oleg Chernavin||09/13/2004 07:40 am|
|Thank you for helping!
I think that the only way to get rid of the SID* directory is to use URL Substitutes.
|09/13/2004 04:39 pm|
|> Thank you for helping!
No problem, although the settings aren`t offering really good results (duplicates)...
> I think that the only way to get rid of the SID* directory is to use URL Substitutes.
Most files are stored on disk as "sid..." under the topmost directory www.mobile.de.
It`s easy to avoid the download of links with other SIDs than in the starting URL or cut off completely the SID-directory. But then some files would be missing. The download process on the site produces many different temporary SID-directories (www.mobile.de/SID*/*). I can`t find a way to download the search survey completely with only one SID.
But there is another issue:
The Export process doesn`t convert links like the following correctly:
Stored on disk (before and after the Export process):
The original URL was something like:
The only way you will be able to see the additional pictures (Details ansehen... Weitere Bilder (above the car image):
Filenames format: Keep as-is
Uncheck "Use standard extensions for files with no file types"
|Oleg Chernavin||09/14/2004 06:20 am|
|Yes, I am afraid, it is almost impossible to convert such scripted links correctly. Sorry.
|Matthias Meisner||09/16/2004 02:17 pm|
|Hallo Anonymous and Oleg!
Thanks for your answers.
Hallo Anonymous, I tested the configuration described in your first reply, and it worked on principal (but there are many junk pages, as you mentioned).
I`ll put the topic how to possibly suppress the download of these irrelevant pages in another question.