crashing on a large project

Author Message
placebo 03/23/2010 09:37 pm
http://img696.imageshack.us/img696/850/snap1g.gif


Dear Sirs,

i am trying to download a large project with maybe 150000-200000 files (totalling ~2.0GB estimated; simple html, gifs, jpegs, pdfs -- nothing fancy). My PC has 256MB RAM, but swap file (virtual memory) is set to 4000MB.
OE does a great job but then i ran into problems: first i ran out of HDD space so i had to free HDD space, then i ran out of RAM memory so i had to enlarge the swap file (set it fixed to 4000MB), then OE (OEP) still crashes.

Of course i am using "Do not Download existing files"-mode for re-running the interrupted project. It goes well for a long while (the download queue is always short, 3-300 files, Parsing is a growing number up to 25000, then going downward down to 17000, then crashes, see screenshot). I am running 40 download channels, and ProcessExplorer (procexp.exe) shows that OE.exe starts off with 40MB and then after a while it uses up to 1.6GB Private Bytes (that's why i though i should fix the swap file to 4GB ;); it doesnt get higher than this.

I have a good feeling when the Private Bytes stagnates at 1.6GB and the 'oscillating' Parsing number tends to decline. Looks all stable. It takes about 2 hours from Parsing 25000 to 17000. All stable. But then OE.exe would crash - surprise and disappointment ;)

I've ran this project years ago on a powerful computer and OE (old version) never crashed. It is the first time that i see OEP crashing on this project. I fear that it has something to do with this PC's low RAM memory.

I am on a WinXP-SP3 pc (no further updates), clean install.
Maybe it is my Windows (updates needed?),
maybe it is my weak PC (256MB as cause for all trouble?),
maybe it is the OE/OEP (some kind of bug?).

You've posted a new version of OEP, so i will test that version on my large project.
If you need screenshot or starting URL, let me know - i could unveil ;)
Thanks for all, best
Pete
placebo 03/24/2010 02:46 am
Hello again,

the night is over.. i've tested the new version, it was running all night long well, then all of a sudden (as expected) OEP would just crash. Here a full screen shot:

http://img34.imageshack.us/img34/8273/snapshotfull.gif

It is getting closer to the end (Parsing 0).. but still quite some way to go (Parsing 14000).
Dear MP staff, what is your first-shot evaluation of the situation.. what is likely to be the reason for my problems?

Thanks, best
Pete
Oleg Chernavin 03/24/2010 08:12 am
Well, a big parsing number usually means that Offline Explorer downloads faster than it can process. Can you set lower number of connections and some delay between downloads on the Internet tab of the Ribbon? This should help.

Best regards,
Oleg Chernavin
MP Staff
placebo 03/24/2010 11:08 am
Hello, thanks for the help, i am still working on the project (i.e. download project).

I've updated my WinXP-Sp3 with all windowsupdate.microsoft.com downloads and re-run the large project. This time OEP went through, quick, without a glitch, no crash -- but maybe i forgot to establish the needed VPN connection for my download (IP authenticated website), i am not sure, so i am re-running, and re-running the large project, until i am sure that all files are on HDD.

High Parsing Numbers = OE downloads FASTER than OE can process?

interesting, i didnt know this.
For sure i will set 2secs delay. I will tweak the number of channels down too if needed (from 40 to 20). In any case, i will keep you posted on the progress and the finish of the project. I have the feeling that the windowsupdate did/does much of the trick (stabilizing OEP) -- i will confirm/reconfirm this later.

Thanks again, CU later,
Pete
Oleg Chernavin 03/25/2010 05:42 am
Thank you! I also suspect that 256 MB RAM could be a real restriction, but not sure - because the page file should allow more virtual RAM. Of course, processing of the pages will be slower on low real RAM, but it should not give errors.

Oleg.
placebo 03/29/2010 03:05 pm
Hello! :))

I have pretty much completed the project: OE didnt crash again neither did my PC (or HDD) again. The first key to success was setting the URL filters (and File filters!) right so that OE would not download both from "www.serverone.domaingoal.com" and from "serverone.domaingoal.com", and then the number of channels (20 was okay, a re-run with 10 was as okay) and time delay between downloads (2 was okay, a re-run with 3 seconds was okay too). OE did a solid downloading job in both cases!

QUESTION1:
OE didnt download a folder (plus subfolders) because i erroneously set --in the re-run-- a URL-filter which caused OE to filter it. Many html-webpages were affected by that filter (so that the offline webpages point to the online http source). So, ive removed that URL-filter. Now i have a modified project (project settings).

Will a "Download only Modified and new files" update (=replace) all affected webpages and download all missing pages and in particular that missing folder?

I am fearing that i have to re-run the whole project again (17h running time)..

QUESTION2:
the download statistics tell that "Not found: 1", i.e. OE didnt find a URL (maybe webpage programmer mistyped the link). How can I find out *which* url OE is referring to? (I am guessing that this info is in any Descr.wd3-file.. Shall i do a text search on all Descr.wd3-files?)

Thanks for some helpful info, best
Pete
placebo 03/30/2010 09:01 am
Hello!
I am still on it. Please could you help with the following?

[URL1] http://mike.painting.balloon.com/*.*
[URL2] http://www.mike.painting.balloon.com/*.*

I am not sure why OEP creates/downloads the 2 folders, with practically the same contents and size (the project is too large that it is unfeasible to complete it and then exactly compare the 2 folders). Maybe because both exist online? Or maybe both are valid URL's for the same 1(single) file on their server? So maybe the file(s) exist just once one the balloon.com-server, but both URL's can be used to download it?
(Actually *some* links on the URL1-html-webpages seem to link to [URL2].. but that might be just an old link or error...)

I would not mind OEP creating a '~duplicate' folder on HDD.. but in this case the project IS TOO large, and OEP is running for days and it never ends (the download queue stays in the range 40000-60000, without OEP actually downloading them fast. my HDD is making all noise of business, but i am wondering why OEP isnt SIMPLY downloading the remaining 40000 files with full DSL speed. Transfer speed is now 0 or 15kB/s or 0 or 10kB/s, unless i restart the project from scratch. Then transfer speed would be 250kB/s and OEP happily SIMPLY *DOWNLOADING* the queue.)

QUESTION1:
Can OEP really differentiate (distinguish) between URL1 and URL2?

QUESTION2:
I've tried all kinds of URL Filters and/or File Filters and/or URL Omissions, but nothing worked ever. I want just URL1-folder on my HDD, and other linked (due to File Filter settings) folders such as:

[URL3] http://www3.painting.balloon.com
[URL4] http://www.painting.balloon.com
[URL5] http://download.painting.balloon.com

So, how do I configure my project settings that OEP downloads [URL1,3,4,5] but nothing from [URL2]?

Since i am on this project for days (weeks?) i am willing to restart from scratch. Please tell me how to set the download settings, thanks! :D

Best, Pete
Oleg Chernavin 03/30/2010 09:03 am
Sorry for the late answer!

Q1: It is enough to Ctrl+F5 on it or "Do not load existing files". This will scan all downloaded files for the links that can be downloaded (according to the changed Project settings), but not yet on the disk. Parsing the files will take time, but it will be much less that with the "Load modified and new" option.

Q2: It can be understood from logs only. So far, Offline Explorer doesn't keep this information. But we plan to include such feature in future versions, maybe even 6.0.

Oleg.
placebo 03/30/2010 04:39 pm
Hello, thanks for getting back!! :))

I've re-run the project (4th time) from scratch, just to go sure. 8 channels, 4 secs delay. Took 17 hours. To circumvent the low RAM problem, i've found a work around: When OE reached a critical size (occupied RAM private bytes, according to Process Explorer), I've suspended the project to a file. Then terminated the OE.exe, restarted OE, and resumed the project from file. OE would occupy much less RAM now, then gradually 'becoming bigger in size (RAM)'. Then again, suspending to file, killing, restarting, resuming. I did that 3-4 times. Great work around solution! (I am wondering if a future version of OE could handle such RAM memory management on its own, without me having to do that workaround suspend-kill-relaunch-resume method.. but that's another story ;)

re: Q1:
I would have *never* thought that the option "Do not load existing files" does the trick, thanks for the hint. Please amend the OE help-file (help, FAQ's) ;)

re: Q2:
I've tried to run a log, but even with 'minimal configuration', the log is huge (total file number is ~85000) and logging it might lead to a crash of my PC (hardware or operating system) or OEP ;) since my RAM is 256K only.

============================================

statistics (info i dont understand):
Map entries: 88125 (interesting/important number?)
Downloaded files: 88129 (???) (this is the highest number of all "881**"-numbers seen here. what does that mean?)
Parsed files: 44943 (i guess that this i not important info to the OE users. OE internal info)
Not found: 1 (i am curious to know, hehe...)
Error while downloading: 0 (that sounds great!)
Aborted/Limited: 44946 (???) (what does that mean?)
Total files: 88105 (???) (what does that mean? why is there a difference of 24?)

Then i scanned my HDD and figured out manually:
total file number is 88125 (excl. the OE-generated start "default.htm").
number of *.PRIMARY-files: 20, totalling 137MB (pdf's and html's-files)
ERROR's automatically logged: 20 (all referring to the 20 primary files. "Error code=00000020 Sharing violation")

-----------
Q1: the only logics i can see is that 88105 ("Total files") plus the 20 (*.primary) would sum up to "88125" (the number which i counted), is that correct?
Q2: what do the other numbers (???) mean, and why is there still a difference/discrepancy of "4" (88129 - 88125)?
Q3: what was the problem (error code) with the 20 files?
Q3: can i just delete the *.PRIMARY-extension, and everything is fine? (or shall i re-download them/how?)


Thanks for all!
Maybe some much more info on PRIMARY-files in the help-file would be welcome :)
Best, Pete
Oleg Chernavin 03/31/2010 06:05 am
Map - it is the number of files that directly belong to the Project. They are listed in its map and will be exported, backed up, etc.

Downloaded - the number of files actually loaded during the last download session.

Not found - server reported 404 error for some links.

Aborted - this is because you suspended to file.

Total - I need to check this....

Q1 - yes, perhaps so. I will look for it.

Q2 - perhaps, some file was retried.

Q3 - this happens sometimes. I still haven't found out why - very hard to reproduce.

Q4 - yes, it is safe to delete them. They are created immediately after the download. Then Offline Explorer starts processing (parsing) the file and creates a usual copy with all links converted for offline browsing. After processing of the file is complete, .primary copy is removed.

Oleg.
placebo 03/31/2010 09:43 pm
All right, thanks Oleg for all the answers!!
I am *right now* definitely done with the project, so that we can 'close' this thread. (Right now i am preparing the data, compile for DVD, archive/burn them --- this is a very satisfying task since i am happy with the download result).

I did the 'Do not download existing files'-button (Ctrl+F5) 2 times and that helped to get rid of some of the *.primary files and some missing files were downloaded. Comparing that run (folder) with my previous 2 good runs (i kept the 2 large folders on disk) revealt that run#3 was still missing some folders and files which were present with, e.g. run#1. So luckily i could find and transfer missing folders and files from run#1 and run#2 to otherwise close-to-perfect run#3 (complementation of download folder of run#3). So let me summarize my observation or experience..

With large projects (this one was ~80000-90000 files) and even if the webpages are simple in structure (i.e. no "modern, state of the art" webpages), OE would do a close to perfect download job -- it's just for the user a major problem to FIND OUT to what degree the job is close to perfect..

As seen with some other offline browser tool, if OEP could list nothing else than just the errors encountered (for links to folder and files which the user expects to see downloaded to HDD), than the 'downloaded files' PLUS the 'info on not-downloaded files due to some encountered problem/error' gives a complete picture, and i would be satisfied about having such a complete picture.

In my example, OE missed out some folders and files, although i re-pressed the Ctrl+F5 twice (=2 runs with long time parsing). Luckily i was able to find out(!) that even after those 2 re-runs my folder (of run#3) was *still* incomplete because i had some other runs to compare with (same relevant project settings). Otherwise i would have assumed that run#3 reached completion (perfection). NOW, i've supplemented my run#3-download folder, but.. there is no info or guarantee that the folder is 100% complete and perfect now. It could still be missing folders which all 3 runs missed out on.

Yes please, in future, some kind of automatically retained details on encountered errors (an ERROR log) would be highly useful for the user's information. The user would 1) have a complete picture about the situation (his download folder + ERRORS = complete), and 2) the user could then take some appropriate action and try to fix the error or download manually.

Thanks for all,
best, Pete



Oleg Chernavin 04/01/2010 05:32 am
OK. I will add this for my plans for future, maybe 6.0 version. Thank you!

Oleg.