Exclude Dynamic URLs
|Bob||12/03/2004 09:16 pm|
|I am using a trial version of OE 3.5. The site I am attempting to download has foiled my efforts, even though this is obviously a very slick product. My issue is that there is a logoff button but on each page, it has a completely different url, so a filter does not seem to work.
This is an example:
<a href="/AFI~V4126109~C33107~R0~OF25~N/0/70910848/78153709/78153710/78153713/34853741/34850750"onMouseover="imgAct(`exit`)"onMouseout="imgInact(`exit`)"><img src="/images/discexitn.gif" alt="End Session" border=0 vspace=0 hspace=0 name="exit"></a>
I was thinking of some ways to get around this, and wondered if any of them were available:
1)Is there some way to prevent for example the first 5 links on each page from being followed (the header is always the same)
2) Do a regular expression match on the the line a url is on to exclude or include it based on a match
3) an extension to preprocess and modify the page being downloaded so I could strip the offending links off before OE processes it? (perhaps I can find a proxy server that does such things)
Any ideas or suggestions would be greatly appreciated.
|Defenestration||12/04/2004 01:52 pm|
|Have you tried placing "IgnoreLogoutLinks" (without quotes) on the next line after the URL, in the Addresses (URL`s) field in project properties ?
This tells OE to ignore any logoff links.
Press F12 on the Project Properties dialog for more additional commands.
Oleg - Any chance of documenting each command ? (eg. What it does, When to use, etc.)
|Bob||12/04/2004 04:51 pm|
|Yes, I did try this but unfortunately it doesn`t work. If this command only examines the href of an anchor, then there is nothing meaningful in this case to distinguish that this is a logoff link. There is an ALT="End Session" in the anchor, so it would be fairly identifiable as a logout link if other parts of the anchor were examined. Exit is also a good keyword to look at in other properties of the anchor in this case.
|Defenestration||12/04/2004 05:02 pm|
|`I`m not completely sure what IgnoreLogoutLinks checks for. You`re probably right in that OE doesn`t examine this part of the link when determining logoff links.
If this is the case then it should be trivial to fix, and I`m sure Oleg will oblige in the not to distant future.
If only other software developers were like Oleg and released fixes within a day or two (where possible), the world would be a much happier place:o)
|Defenestration||12/04/2004 05:21 pm|
|I searched the forums and came across this reply from Oleg about how IgnoreLogoutLinks works:
"IgnoreLogoutLinks simply filters out all links with logoff logout expire or signoff words (in the entire URL - server, directory or filename part). "
So it would appear that IgnoreLogoutLinks should also search for "End Session" and "Exit" when trying to determine what constitutes a logoff link.
Even better would be a way to customize the keywords used by IgnoreLogoutLinks. eg. maybe through a textfile or registry key. If this file/registry key is not present, then the default list of keywords ("End Session" and "Exit" to be included) will be used, otherwise the keywords in the file/registry key will be used by IgnoreLogoutLinks.
This could also be extended to other commands, where necessary.
|Oleg Chernavin||12/06/2004 04:22 am|
|Customizing IgnoreLogoff command will not help in this case. Can you please tell me, do other (normal) URLs on the site look like the logoff URL or they are different?
|Bob||12/06/2004 06:06 pm|
|They look very similar - i.e. all urls consist of a non meaningful path. The same logout button has a different url on every page. However, other attributes of the anchor are consistent, such as the alt text. Any evaluation would have to look at other attributes as well for this site to be downloaded.
The logout/account links are in a table at the start of the html. I would guess would be fairly common for links such as logout. That is why an option to skip the first n links could be quite useful.
|Oleg Chernavin||12/07/2004 08:36 am|
|Can you send me the URL of the site with the username/password to email@example.com ? I will try to see if there is a quick way to find the solution.