Using OE Enterprise, trial, just downloaded&installed 4 days ago.
Have successfully added & downloading the project which contains a large number of huge pdf files (300-700Mb per each pdf).
The problem is: after downloading each single huge pdf file, it takes alot of CPU power and time for AGES to get the file parsed. Say, now it already has its 3 days of non-stop XeonQuad-2.2Ghz\2048 RAM with 100% of CPU usage - and yet parsed just 6 files... Without the parsing, all the pdfs are coming with .primary extension and thus useless. The StatusBar always shows "Parsing (3476)" (and increasing after each single file is downloaded&added to the parsing queue).
The smaller (1...10Mb) PDFs being parsed OK.
The fact is that I do NOT want these pdfs to be parsed. They are containing NO extra links.
Please inform me how to EXCLUDE the selected file extension(s) from the list of files to be parsed (disable this file extension from parsing, at all). This should help me.
Thanks,
Alex
Best regards,
Oleg Chernavin
MP Staff
I believe that you could use any huge pdfs (>500mb per each) for testing purposes. You can even create this yourself (scan some images to 600dpi, make several copies following each other and save this junk to pdf). Or you can easily find any other pdf datastore (e-book shop or something)...
The question is not about the pdfs themselves - the question is how to exclude the selected file extension from "to be parsed" rule (or, alternatively - how to teach your program to deal selected file extensions as binary data. I believe that the program does NOT parsing dlls, exe's and other binaries, does it?). On Teleport Pro (which I've using for years), you can easily do that just within a couple of clicks.
Thanks.
Alex
ftp://ftp.cuhk.hk/.1/cpatch/gis/mapinfo/source/mi_ug80.pdf
It is about 59 MB. Offline Explorer parses it in 3 seconds on my computer.
ftp://ftp.imag.fr/pub/bibliotheque/theses/1999/Thierry.Raphael/these.dir/these.pdf
This is 150 MB and it takes 8 seconds. My system is Core 2 Duo E6600 @ 2.4 GHz, 2 GB RAM. Can you please check how long does it take to parse the above files on your computer? Perhaps, the PDFs you have contain something special that causes Offline Explorer to slow down on their parsing.
Oleg.
ftp://ftp.worldofspectrum.org/pub/sinclair/books/*.*
For me, it is "hanged" parsing for 10 min now on the file "AdvancedSpectrumMachineLanguage.pdf.primary" (97722984 bytes) while other files are coming and coming downloaded (860mb downloaded now, and 38 files are in "Parsing..." queue), and all files after "AdvancedSpectrumMachineLanguage.pdf" are still with .primary extensions, if I see inside the folders on my HDD.
This is exactly what I mentioned.
PS: 20 min and "Parsing...(47) now. Still "AdvancedSpectrumMachineLanguage.pdf.primary" not gone (parsing by).
Mine is Intel Xeon QuadCore, with 100% of all 4 cores usage right now (all power at the process OE.EXE).
I am pretty sure that if I leave it overnight - all files from the server will be downloaded successfully, but not yet parsed after the current one, and those will remain with .primary extensions. And if I stop OE right now - the .primary never be gone, and next time I run this project - all the "missed" .primary files will be downloaded again and again (even they are at my HDD already but with .primary extensions), and pushed to be parsed again, and the whole story loops... Tested so many times with my local server...:(
Please suggest how to exclude ALL files with .pdf extension (or any other selectable extension) from being parsed. How to download them but NOT to parse - same as .EXE or .ZIP or any other binary\octet-streaming data, for example?
Thanks,
Alex
Oleg.
> usually it takes minutes to load huge files - 500-700 MBs and while it will take seconds to parse each of
> them, the parsing queue should not be overloaded. Only unless you load from a very fast server in
> your local network.
Dear, the problem is about PARSING, not about DOWNLOADING. Downloading is very fast here (we have a fiber-optic connection and I am sitting directly at the server, right before the connection's sharing). The files are being downloaded really fast, but right AFTER the downloading (after the files get disappeared from the downloading thread and pushed to PARSING...(1) - they get stuck in this stage. I can be already _downloaded_ the whole project (no active downloading threads anymore), but the PARSING still active. The problem that sometimes it keeps active forever (ok, maybe not forever but unless I lost my patience to this all - 5...7 days of non-stop parsing sometimes).
Your program is great and has many unique features and much more powerful than Teleport (especially for huge projects, while Teleport has a really annoying limitation "not over 65K crosslinks per project"), but just one - with Teleport I have never waiting for days for the one single folder with pdf's...... just this one minus to your soft. All other options are pluses. :)
Thanks,
Alex
Oleg.
Oleg.
SkipParsingFiles=*.pdf,*.exe
Oleg.
Yes, agreed - in most cases. But MOST is not equal to ALL. :)
It is a better idea to have the program's logic be a little more adjustable (scripting-based and opensource - in ideal), but not a fixed one. More fixed - means less tweakable. Just for the special cases like mine (and only God knows - how often your customers facing this problem on their work but not reported to developer)...:)
Also I seen somebody else in this forum just recommended you to have a scripting-defined program's logic, too. It is a great idea - believe me. Say, you can combine your program with a Perl-like scripts (with all of Perl's string adjusting and parsing power)... - and your soft will be unbeatable forever. With scripts, your soft will be able to download ALL with no exceptions - dump database, catch all non-standart protocols (include secured or crypted ones), all password-protected areas and all dynamic-generated pages (if the user's hands not so curved to define the correct script, of course - but that will be up to user, not to program's limitations). Just for example - I can download all my pdf's project just with 10-lines Perl commandline script. No parsing of unnecessary, no waiting for something unknown, no errors in the process. Imagine, if you can provide such of scripting power to your program...:)
Hope to see something like this in future versions, and I am ready to be a beta-tester of that...:)
Thanks,
Alex
But we do have this in our plans.
Oleg.
> than just executing a fixed code.
Yes, agreed. But..."What you pay is what you get" (c) :)
>And when you have too many scripts attached (users will be able to write theirs) - this takes much time
>to run each of them.
Partially agreed. As you said right before - usually to DOWNLOAD the file is much longer than to PROCESS the one. And all modern CPUs (even low-end ones) are powerful enough to "chewing up the data" - I dont think that even a 10-20 parallel scripting threads will slowing down the whole system. Remember - the WWW servers (with php,asp,cgi etc) easily can do 10-20 concurrent threads per each user (and hundreds concurrent users per time). Of course all of this depending to the script's complicasy - but usually they would be not too complicated (as your program itself is well enough). Just for the very special cases the scripts will go.
That's how I imagine that.
Or there could be a two versions of your software - Lite (as it is now) and Pro (scriptable). Your choice, buddy.
> But we do have this in our plans.
Where can I apply for the beta? :)
PS: I'll try your updated SR1 today. Will let u know the results.
Thanks,
Alex
Thanks,
Alex
Oleg.
Oleg.