|AlexBaldwin||02/23/2017 09:25 am|
|On a side note: Damn, Customer service is fast!
So I had a few inquiries about filters.
I think there should be an exclude wildcard, such that if I write
It will match any two characters except ab or ef
Unless there's a work-around I didn't see?
I needed something similar in one of my project such that:
A website had page classed into categories. The url looked like special:ancientpages or special:users.
What I wanted to do was to download the whole website, while downloading only the categories I wanted.
So I a filter like special:[#analysis:#Theory] would exclude all the categories except analysis and theory.
Sometimes when you download a website, you end up with copies of the same page like so:
and so on. I tried to use url substitutes in the parsing to no avail.
The only way I "managed" the problem was to create multiple substitutes, for each numbers from 1 to n
like -1.html is replaced to .html, then -2.html is replaced with .html....til I felt satisfied that I had enough(5)
|Oleg Chernavin||02/23/2017 09:38 am|
|1.The regexp is quite limited there. What if you would use the Included list instead? Specify analysis and theory and everything else would be excluded.
2. You could use substitutes rule, like:
Howrever it looks strange to me that it downloads such copies. Could it be because of links to such files on the site? Can you give me the site URL and let me know where I can see such links, on which pages?
|AlexBaldwin||02/24/2017 01:38 pm|
|Hi, the website I am trying to download is http://artofproblemsolving.com/wiki/index.php/Main_Page
The page are generated by php, so I'm not too sure how it ends up giving -x links
|Oleg Chernavin||02/26/2017 05:23 pm|
|I downloaded the site with Level=2 and didn't see such links. I found many like:
But they are correct and lead to different valuable contents.