Setting up Content Filters

09/01/2016 07:10 am

I'm trying to download a website and skip (don't save) the files that contain these two strings:

<title> 404
<TITLE>302 File moved</TITLE>

I've set the "Content Filters" dialogue in Project Properties like this:

-- Text keywords
Keywords: "<title> 404" "<TITLE>302 File moved</TITLE>"
Search for all keywords: checked
Search inside HTML tags: checked
-- When keywords are not found in the page
Save these pages: checked

All the other checkboxes are left unchecked.

My assumption is that works like this:

1) Page is downloaded
2) Parser searches for "<title> 404" and "<TITLE>302 File moved</TITLE>"
3) If none of these is found, page is saved, otherwise it's discarded

However, even with these settings, the pages containing "<TITLE>302 File moved</TITLE>" are still saved, so I guess it works in a different way. Can you please help me with finding out where my settings are wrong?

Thank you.
Oleg Chernavin
09/01/2016 07:13 am

You need to uncheck the Search for all keywords box. Because when it is checked, Offline Explorer requires both of these words to be present in a single web page. And as I understand, only one of these lines can be in a web page.

So, it is either 302 or 404.

Would this work?

Best regards,
Oleg Chernavin
MP Staff
09/01/2016 11:00 am
Thank you Oleg, I'll try it out.

However, I find this dialogue (especially the "Search for all keywords" option) counter-intuitive from the user standpoint. I'll try to explain.

Let's say I have two strings that I want to filter out. So I put them in the "Keywords" field and then check "Search for all keywords", because that's what I want to do - search for all of them and only do the action if none is found. Then I go to "When keywords are not found in a page" and check "Save these pages".

So in my thinking, I checked the options that mean "search for all keywords and when none of them is found, save page". However, it doesn't work that way, which might be slightly confusing.

I think it would be way more intuitive if there were two separate checkboxes - "Only apply when all keywords are found" in "When keywords are found in a page" and "Only apply when none of the keywords are found" in "When keywords are not found in a page".

Or, even better, a radio buttons to switch between logical AND and logical OR - so it would be possible to choose between "Only when all keywords are found" and "When at least one keyword is found" in "When keywords are found in a page" and between "Only when none of the keywords are found" and "When at least one keyword is not found" in "When keywords are not found in a page".

That way, it would be easy to understand what the filter logic is going to do and how it will apply the rules.

Just a suggestion :)

Thanks again for helping me out.
Oleg Chernavin
09/26/2016 07:08 pm
Yes, I like that it gives more logic and flexibility.

I added these options. Can you please take a look at the updated version:

Please let me know if it is OK or anything should be improved/fixed. Thank you!