RegExp in URL Substitutes?

Author Message
Karsten 10/19/2004 01:10 pm
Hi Oleg

After playing around with OE Enterprise for a couple of days now, I have a problem when trying to download a php forum.

If I let OE download everything, it will end up with a queue of several million links, since there are a lot of filters and sorting options. One thread will be downloaded several times with different sort, search and prune options.

A solution that worked (partly) was to set up some URL substitutes.

For example, one forum page can look like:

http://www.someforum.com/forumdisplay.php?f=12&daysprune=1&sort=lastpost&order=desc
http://www.someforum.com/forumdisplay.php?f=12&daysprune=1&sort=lastpost&order=asc
http://www.someforum.com/forumdisplay.php?f=12&sort=&order=&daysprune=10

and so on.

If i set up substitutions like

URL: *
REPLACE: &daysprune=1
WITH: (empty)

URL: *
REPLACE: &sort=asc
WITH: (empty)

etc

I should end up with URLs like http://www.someforum.com/forumdisplay.php?f=12

One problem with this approach is that there are many different parameters, which should be easy to deal with, with RegExps like &sort=*[a-z]. I cannot find any mention of this anywhere online or in the documentation, though. Is it implemented in OE Enterprise, and if so, what is the syntax?

Another problem is that after setting up a long list with all the possible substitutions, OE seems to ignore a lot of them. Will it only make one substitution in each URL?

I hope you can help with this so I can enjoy your otherwise excellent application. I`m sure if I find out how to use it, it would be worth the small price.

10/20/2004 03:22 am
> A solution that worked (partly) was to set up some URL substitutes.

You could use URL Filters in order to avoid some file downloads.

Of course your 3. line in your example should be:
http://www.someforum.com/forumdisplay.php?f=12&sort=asc&order=&daysprune=10

> If i set up substitutions like
>
> URL: *
> REPLACE: &daysprune=1
> WITH: (empty)
>
> URL: *
> REPLACE: &sort=asc
> WITH: (empty)
>
> I should end up with URLs like http://www.someforum.com/forumdisplay.php?f=12

No, this wouldn`t work. Try this:

URL: *
REPLACE: &daysprune=1*
WITH: (empty)

URL: *
REPLACE: &sort=asc*
WITH: (empty)


> I cannot find any mention of this anywhere online or in the documentation, though.
> Is it implemented in OE Enterprise, and if so, what is the syntax?

The syntax is described in the Help file:

Advanced features... Using URL Macros
Advanced features... Fine tuning downloads using Project URL Filters...
(and click on the green link: "Custom configuration")
Advanced features... URL Substitutes

Oleg has written in:
http://www.metaproducts.com/mp/mpSupport_User_Forums_Message.asp?id=4966

----------
URL:
*
Replace:
custid=*[0-9]$
With:
(keep this field empty)

URL:
*
Replace:
custid=*[0-9]&
With:
(keep this field empty)

This should work.
----------

But it does *not* work correct in URL Substitutes!
([a-z]; [0-9]; $)

Maybe it`s a bug or these keywords are yet not implemented in URL Substitutes.

> Another problem is that after setting up a long list with all the possible substitutions,
> OE seems to ignore a lot of them.

I don`t think so.

> Will it only make one substitution in each URL?

Have you checked "Apply all matching rules"?

If OE finds a matching rule, the URL will be changed. This new *changed* URL will be checked against the next rule in the list, and so on....

(Checked and unchecked rules in the rulelist are working independently. These are complete different sort of substitutes. OE first goes through all checked rules (online substitutes), downloads the resulting URL and checks if it has to change the URL-Name on the disk (unchecked rules).)
Karsten 10/20/2004 06:15 pm
> You could use URL Filters in order to avoid some file downloads.
>
> Of course your 3. line in your example should be:
> http://www.someforum.com/forumdisplay.php?f=12&sort=asc&order=&daysprune=10
>
I don`t really want to exclude them, just trim them down to avoid (some of) the parameters

>
> No, this wouldn`t work. Try this:
>
> URL: *
> REPLACE: &daysprune=1*
> WITH: (empty)
>
> URL: *
> REPLACE: &sort=asc*
> WITH: (empty)
>
>
The problem here is that some of the parameters sometimes appear at the end of a line or at the start (with a ? instead of an &). I fixed this problem by setting up 2 subs for each parameter:

*sort=asc&
&sort=asc$

the first will deal with those at the beginning and in the middle of a line and the latter those at the end of a line.

> The syntax is described in the Help file:
>
> Advanced features... Using URL Macros
> Advanced features... Fine tuning downloads using Project URL Filters...
> (and click on the green link: "Custom configuration")
> Advanced features... URL Substitutes
>
Thanks. It was hidden too deep for me to find and a search didn`t find any mention of RegEx. That was welcome information.


> Oleg has written in:
> http://www.metaproducts.com/mp/mpSupport_User_Forums_Message.asp?id=4966
>
> ----------
> URL:
> *
> Replace:
> custid=*[0-9]$
> With:
> (keep this field empty)
>
> URL:
> *
> Replace:
> custid=*[0-9]&
> With:
> (keep this field empty)
>
> This should work.
> ----------
>
> But it does *not* work correct in URL Substitutes!
> ([a-z]; [0-9]; $)
>
> Maybe it`s a bug or these keywords are yet not implemented in URL Substitutes.
>
Yeah, I found that article, which was the reason I set out to use RegExs in the first place. Too bad they don`t work in URL subs, but hopefully they will.


> > Another problem is that after setting up a long list with all the possible substitutions,
> > OE seems to ignore a lot of them.
>
> I don`t think so.
>
> > Will it only make one substitution in each URL?
>
> Have you checked "Apply all matching rules"?
>
> If OE finds a matching rule, the URL will be changed. This new *changed* URL will be checked against the next rule in the list, and so on....
>
> (Checked and unchecked rules in the rulelist are working independently. These are complete different sort of substitutes. OE first goes through all checked rules (online substitutes), downloads the resulting URL and checks if it has to change the URL-Name on the disk (unchecked rules).)

Yes, I checked "Apply All...". I think the reason that some of my lines weren`t substituted is that my expressions didn`t match all possible cases. I think I solved that by using the substitution pairs I mentioned earlier.

Thanks for your help, it got me going in the right direction.