repatable in url

Author Message
Ananta 07/06/2012 08:41 am
Hi,

I just want to know how to filter this kind of url 'http://www.site.com/area/city-2713//area/city-2713/'

In url there are repeatable words '/area/city-2713' with '//area/city-2713/', the different is in last url it uses // instead of /

My question is how to filter so OE only capture http://www.site.com/area/city-2713 not http://www.site.com/area/city-2713//area/city-2713/ because they are leads to the same page,

,also OE capture every url like this 'http://www.site.com/area/city-2713//area/city-2713/' three times, and seems don't detect it as duplicate


btw,I also want to say thank you for previous answer and I must say this company support is legendary
Oleg Chernavin 07/06/2012 09:42 am
Thank you very much for the kind words! You may use the Project Properties dialog - URL Filters - Directory section and add the keyword to the Excluded list:

/area/*/area/

I think, it should be enough.

Best regards,
Oleg Chernavin
MP Staff
Ananta 07/06/2012 10:14 am
Yes, it works

Another one, how to filter this url http://www.site.com/area/home-city-new-755/home-city-new-755/

I just want to save http://www.site.com/area/home-city-new-755 not http://www.site.com/area/home-city-new-755/home-city-new-755/

Oleg Chernavin 07/06/2012 10:31 am
What about:

home-city-*home-city-

?

Also, please try to uncheck the "Suppress server errors" box in the Properties dialog - Parsing section. Maybe this will help to get rid of such weird URLs at all.

Oleg.
Ananta 07/06/2012 10:55 am
I think this use filter home-city-*home-city- in http://www.site.com/area/home-city-new-755/home-city-new-755/ can't be done because home-city-new is dynamic url, I just use it as example

I already suspend to file and it generate 600,000+ queue urls with many strange urls inside

Do you think it's better for me to start from beginning with "Suppress server errors" uncheck, or I can resume from file with "Suppress server errors" uncheck and maybe OE can automatically filter all strange urls inside .wdqh file or any other way?
Oleg Chernavin 07/06/2012 11:55 am
Yes, resume from the file. It should try each of these links and abandon if the server returns an error.

Oleg.
Ananta 07/06/2012 11:58 pm
I decided to use file extension filter so OE only save aspx and jpg, because when resume from file it crawling more than millions urls with duplicate urls

what I mean duplicate
http://www.site.com/area/home-city-new-755/home-city-new-755/
http://www.site.com/area/home-city-new-755/

this 2 urls leads to the same page, but when I open http://www.site.com/area/home-city-new-755/home-city-new-755/ it page don't have css/template, just crumble of text without border, graphic, etc

Another question, last time when you gave me exe fix for restore function it works faster in 300,000+ files, but when I tried in boe file with 600,000+ files it speed reduce by half

So, I just want to know, is speed of restore really depends on how many files inside and maybe other factors or can you make fix that restore speed always remain same no matter how much files inside?
Oleg Chernavin 07/08/2012 10:57 am
I will think on how to improve this further. So far, I have no better ideas, but I will keep working on this.

Oleg.