Downloading only certain filenames
|Pablo||12/14/2006 07:37 am|
|Hello, I wanted to know how to spider a complete site, but download (store) only certain filenames, like "news.php?Id=10000", "news.php?Id=10001", etc.
In this case, they are news sites, where every news has an Id number. I want to store only the news files, but not all the others.
As I understand, if I set filters to store only "news.php" filenames, the site will not be correctly spidered, because files like index.php and other similars will not be downloaded.
How can I do this?
|Oleg Chernavin||12/14/2006 07:45 am|
|Well, the URLs field of the Project supports DeleteAfterParsing= command, but you will have to tell it to remove the files of almost all combinations.
|Pablo||12/14/2006 08:09 am|
|So there is no other solution than inserting as many DeleteAfterParsing as kinds of files exist in the site?
Don't you think it would be interesting to have a StoreOnlyFilenames=file command?
Another nice solution would be to have a "StoreOnlyIncludedFiles" comand. In this case, you can specify in the "included files" section the list of files to store. But ALL would be parsed.
I think my need is not very weird... in all the sites I download, I see that there are many "glue" files with no interesting information in them. The interesting information is only on certain files like "filename?ID=xxxxx", where xxxxx is the news number...
Thank you very much,
> Well, the URLs field of the Project supports DeleteAfterParsing= command, but you will have to tell it to remove the files of almost all combinations.
> Best regards,
> Oleg Chernavin
> MP Staff
|Oleg Chernavin||12/14/2006 08:32 am|
|I understand, but we didn't plan this yet.
|Pablo||12/14/2006 12:12 pm|
I hope in the near future...
> I understand, but we didn't plan this yet.
|Oleg Chernavin||12/14/2006 12:25 pm|
|Well, there is one other way - to use Content Filters - if these news.xxx pages contain some kind of unique words...