Feature: Enhanced AutoSave browsing; Page vs Media distinction.

Author Message
ToolmakerSteve 12/25/2004 12:11 pm
Here is a feature I`d like to see in OEP:

A. While AutoSave browsing, a button to force download of all links 1 away [IMG tags TWO away].


Goal: Speed-up surfing to following pages;
PLUS in the background acquire all the large media / archives referenced by this page.


Solution: Sort the URLs into the following categories, and download in this order:

1) .html pages [so you`ve immediately got the "outline" of each level-1 page].

2) IMG tags TWO away, Per .html page, IMG tags & whatever else is needed to display that page [so if you stare at level-0 page long enough, the first few level-1 pages get completely loaded].

3) .php pages that have no "?xxx" part of the url. These are typically "normal" html pages [as contrasted with media / archives for downloading].

4) .php pages with "?xxx". These are hard to categorize; it might result in a normal html page to view,
or it might be the results of a search,
or on sites such as www.boxedart.com, all copyrighted media is reached via .php REDIRECT "wrappers": until you access the .php URL, you don`t know that it is media, rather than a normal .html page. The idea is:
4.1) download the php;
4.2) look for redirect,
4.2.1) IF NO redirection, treat like "3) .php",
4.2.2) ELSE look at the file name for ".zip", ".avi", ".jpg" etc.
4.2.3) IF FOUND, treat redirected URL as "5) media/archive files"
4.2.4) ELSE treat redirected URL like "3) .php".

5) media/archive files. ".avi", ".jpg", ".zip" etc.
Queue these for loading in the background. Limit to ONE connection, and add an option "Delay between background downloads".
Reason: to avoid getting banned on www.boxedart.com, it is important to download copyrighted files slowly - wait at least 30 seconds between each - and stop downloading for the day after 100 ~ 200 MB.
(And this would be the polite way to auto-download any site with lots of media.)

Which reminds me: I`d like to MODIFY the three
Project Properties / Advanced / "Stop Loading ..." options,
to "SUSPEND" rather than "STOP". (Or maybe that`s how it works now, and the wording just needs to be changed..) The user can then manually STOP the suspended project, or resume it later.

NOTE that graphics reached by IMG tags were already loaded in "2) IMG tags TWO away".
ToolmakerSteve 12/25/2004 03:04 pm
----- Refinements that could then be added to "A) `Pre-Load Linked` logic" -----

B. AutoSave "Page Slideshow".

Goal: BASED ON "A) `Pre-Load Linked` logic while AutoSave browsing",
but AUTOMATICALLY download pages, while user is WATCHING them in the browser,
looking for trouble or for interesting stuff, at which time MANUALLY intervenes.

Solution:

----------
B.1) A button "Page Slideshow", that acts like "AutoSave", with these additions:

a. Does part of the "A) `Pre-Load Linked` logic" - but only "A.1) .html" thru "A.3) .php with no "?xxx".
See "3." for handling of "A.4)" and "A.5)".

b. if the user doesn`t click anything, after specified number of seconds, goes to "next queued page".
This is an "Auto-Page" or "Slideshow" feature.
Control with an option "Delay between Auto-Pages" or "Delay between Slideshow Pages" [n seconds].

Possible "refined" version of algorithm:
To give the user pages that he is most likely to be interested in, keep TWO lists of Pages:
"Priority" pages are those that are 1-level from ANY page that the user MANUALLY went to (See "2.").
If user is manually moving from page to page a la AutoSave, this list will grow.
"Other" pages are parsed from the original project settings; for example "All pages within 5 levels of the specified URL(s)".

The reason to keep two lists is that the user might just begin by watching:
OEP downloads level-0 URL(s), parses those to find level 1 URL(s) & puts those on "Priority" [because they are 1-level from pages that user specified]; after Auto-Page Delay, OEP shows first level-1 page.
However, as the level-1 pages are parsed, the resulting level-2 pages should be placed on "Other" lists.
Reason: At any time, the user might start MANUALLY clicking on links; whenever he does so, the "Other" page parsing should be momentarily suspended, and the new page should be put at the FRONT of the parsing list, and marked as manually specified - all ITS level-1 links (AND level-2 IMG tags etc.) should immediately be pre-loaded by "A) `Pre-Load Linked` logic". These level-1 links have "Priority" over the "Other" pages.

Because pages may be reachable from multiple locations, it is necessary to keep track, for every page,
whether it has been shown in the browser pane yet.
[Project in memory, as well as Descr.WD3 in each folder, should keep track of which pages have actually been shown in the browser - otherwise we`ll lose track of what the user has `Viewed` via Slideshow or AutoSave browsing, and what was merely `Pre-Loaded`.]
If All "Priority" pages have been downloaded & `Viewed` by the user, then Slideshow starts picking "Other" pages to view. However, unless user MANUALLY intervenes (See "2.") to indicate interest, the `Pre-Load Linked` logic should now be directing parsed pages to "Other" pages, as these pages are NOT w/i level-1 of a page user has shown interest in.


----------
B.2) Manual Intervention buttons (and key commands):

2.a) Background-Queue all media links for this page.
Background-queued links would be downloaded one at a time -- see "A.5) media/archive files" &
Pause / Continue.
Reason to pause: it isn`t clear whether the user now wants to manually pick the next page, or continue in slide show mode. Hit again to Continue [Slideshow picks next page]. Or click on a link to MANUALLY go there next. Or Hit "d)" as equivalent Continue. Or Hit "e)" to Go Back.

2.b) Skip MEDIA links for this page
DON`T add "A.5) media/archive files" to the download queue;
SPECIAL HANDLING OF "A.4) .php pages with "?xxx":
mark them as "don`t redirect to media"; provisionally queue them as if they are html pages, but when their turn comes, if they turn out to be a redirect to a media/archive URL, then discard the redirect URL]

& Pause / Continue, as in "a)". Hit again to Continue [Slideshow picks next page].

2.c) Skip ALL links for this page [DON`T add ANY links to the download queue; if page-links have
Oleg Chernavin 12/27/2004 06:30 am
Well, I mixed up with so much details. But if you want to download some page 1 level deep, just enter its online address to the Address bar on top of Offline Explorer, set Level to 1 there and click Download button. This is a very easy way.

Best regards,
Oleg Chernavin
MP Staff
ToolmakerSteve 12/27/2004 10:08 pm
Okay, that was too complex. Let me explain the problem, and then I`ll try to suggest more basic features that would help:

Website administrators are understandably getting smarter about detecting and banning attempts to use tools such as OE to automatically download large portions of their copyrighted material.

I have now encountered several membership sites at which OE gets me in trouble, no matter how carefully I set for a responsible download [using a single connection and a low bandwidth].

The only way a tool such as OE could be successfully used at these websites, is if OE acts more like a normal user browsing the site.
=> In particular, it is vital to not overlook any login request pages that may appear.
And to not download an "excessive" amount of protected files, which are typically Archive or Video files.

=> The basic desire I am expressing is for a mode that is somewhat like "browsing with AutoSave", but is automated like "Download":

Something like this:

User says "download, showing pages as they are downloaded".

OE goes to first URL listed, and shows it in the browser window (and therefore downloading any additional URLs needed to show it, such as IMG tags).

That first page remains visible for a few seconds, while user decides whether to download any media and archives mentioned from that page.

Then OE goes to any page [.html or .php ..., but not .zip or .wmv ...] that is linked to that first page.

And this is repeated, down to the specified Level, similar to existing "download" feature.

But there are several important differences from the existing "download":

1. User sees each page in the browser as it is downloaded. User doesn`t just see the Channel .. URL info, they see the actual page in the browser, for a delay time that they can set.

2. A way to have OE chug through all the .html and .php files, using the Connection and Speed options user has selected - but add a new filtering choice that can be added to each File Filter: "Load using Cautious Settings".
User might set "Video" and "Archive" to "Load using Cautious Settings".
Then I would like new Options / Advanced / Cautious Settings, that specifies # Internet Connections and Speed for such "Cautious" downloads. Default would be "1 internet connection", and "Background Speed".

With such settings, OE would collect all "Video" and "Archive" files on a new "Cautious URLs" list. These would be downloaded using a single internet connection, and 1 KB/sec, while OE continued to browse through all the other files at its normal rate.

I would also like an option "Approve each Cautious URL". If this option were checked, as OE encountered each cautious URL, before adding it to its cautious URL list, it would ask the user for confirmation.
The idea here is that the page that linked to this URL is still visible in the browser, and the user can examine that browser page to decide whether they want this URL or not.

It would be a nicety to have some additional buttons;
If I decide this page is referring to good stuff, be able to say "Sure, I`ll take all the media on this page", so that user doesn`t have to approve those media one by one.
If I decide this page is not what I wnat, be able to say "Skip all media on this page".


-----
Concluding Thought: What I`m seeking is a convenient way to use the automation ability of OE, without being an irresponsible user who will get banned from membership sites. I am trying to describe a way to have "automation" -- by having OE go from page to page -- but not "blindly" -- I can see each page, so that I can intervene if needed. If I don`t choose to intervene, then OE would go through the .html and .php pages at whatever speed I set, while I watched, and it would "pile up" a list of media/archive URLs that could then be SLOWLY downloaded in the background -- avoiding being banned by sites that monitor suspicious bursts of downloading. And I would have "skipped" any media I di
ToolmakerSteve 12/28/2004 05:48 am
Brainstorm: It may be possible to accomplish some of what I want, by manipulating the Queue.

For instance, select all the .zip files whose Referer is a particular page, and delete them (if unwanted), or move them to the end of the queue (so they aren`t downloaded yet).

This gives me an idea: Would it be easy to add the following two features?

1. In the Queue window, be able to Pause / Unpause INDIVIDUAL FILES. OE would leave Paused files in the queue; it would only download Unpaused files. A Paused file could then be Unpaused later, if the user now wished to download it.

2. A File Filter option; when selected, such files would be added to the queue "Paused".

Then user could enable this new "Paused" option as part of the File Filter for "Video" and "Archive" files.

The result is that those files would just pile up in the queue; OE would temporarily skip these files, downloading any files that were not paused.

At any time, user could go to queue window, and start "Unpausing" .zip files that he wanted to download, and "Delete Selected URLs" any .zip files he didn`t want.

Or copy to a text file any URLs to be downloaded another day.

=> Heh, having thought this through, I see how to approximate this feature today.
I just tried it out:
I downloaded a website, with URL Filters / Filename / excluded keywords of .zip.
This got all the non-zip files.
Then I turned off the zip exclusion, and set a long delay between downloads.
Then downloaded a second time.
The result is that OE parsed all the already downloaded files, building up a big queue of .zip entries (the long delay kept them from actually getting downloaded).

Then I Suspended the download, and copied the URLs to a text file.
Now, I can download those .zip files a few each day, to avoid being banned.
Very Cool!
Oleg Chernavin 12/28/2004 09:40 am
Thank you for these ideas. Yes, this can be achieved only manually now - when you load only Web pages (maybe with images) and then allow other files and do "Download missing files".

Of course, the semi-visual way with the browser might be better, but it is really complex to implement. I think, I will be able to make the files freezing, when they are moved to the queue end and paused until you allow them. But it will be harder to operate for you - you would have to find all .zip files in the Queue (they are mixed there with pages and images usually) and then command them to be paused.

In fact, the starting Project twice (with Archives disabled and then enabled) is easier. At least, I do so sometimes.

So, it might be not worth to have that individual files pause feature.

What do you think?

Oleg.