.pdf's being parsed for ages :(

Author Message
Alex 08/29/2007 12:36 pm
Hi there.

Using OE Enterprise, trial, just downloaded&installed 4 days ago.
Have successfully added & downloading the project which contains a large number of huge pdf files (300-700Mb per each pdf).
The problem is: after downloading each single huge pdf file, it takes alot of CPU power and time for AGES to get the file parsed. Say, now it already has its 3 days of non-stop XeonQuad-2.2Ghz\2048 RAM with 100% of CPU usage - and yet parsed just 6 files... Without the parsing, all the pdfs are coming with .primary extension and thus useless. The StatusBar always shows "Parsing (3476)" (and increasing after each single file is downloaded&added to the parsing queue).

The smaller (1...10Mb) PDFs being parsed OK.

The fact is that I do NOT want these pdfs to be parsed. They are containing NO extra links.

Please inform me how to EXCLUDE the selected file extension(s) from the list of files to be parsed (disable this file extension from parsing, at all). This should help me.

Thanks,
Alex
Oleg Chernavin 08/30/2007 06:14 am
Can you please give me a link to these PDFs? I want to see what is wrong myself and optimize the parsing code. Thank you!

Best regards,
Oleg Chernavin
MP Staff
Alex 08/30/2007 06:53 am
Negative. The server is inside our local LAN (our backup server) and it cannot be accessible from outside of campus due to secure reasons (anyway, you can try that - ftp://10.1.1.112/backups_pub/MegaOne_project/2007-08/*). But there is nothing wrong with pdf's - they are just a scanned science documents being wrapped to .pdf format (sic!). No text, no extra links inside - nothing to be parsed, especially being parsed for _ages_. Just a scanned data (scanned as image, not OCR).

I believe that you could use any huge pdfs (>500mb per each) for testing purposes. You can even create this yourself (scan some images to 600dpi, make several copies following each other and save this junk to pdf). Or you can easily find any other pdf datastore (e-book shop or something)...

The question is not about the pdfs themselves - the question is how to exclude the selected file extension from "to be parsed" rule (or, alternatively - how to teach your program to deal selected file extensions as binary data. I believe that the program does NOT parsing dlls, exe's and other binaries, does it?). On Teleport Pro (which I've using for years), you can easily do that just within a couple of clicks.

Thanks.
Alex 09/04/2007 09:22 am
No suggestions? :(

Alex
Oleg Chernavin 09/04/2007 12:00 pm
I am checking this. I found two quite big PDFs online and downloaded them:

ftp://ftp.cuhk.hk/.1/cpatch/gis/mapinfo/source/mi_ug80.pdf

It is about 59 MB. Offline Explorer parses it in 3 seconds on my computer.

ftp://ftp.imag.fr/pub/bibliotheque/theses/1999/Thierry.Raphael/these.dir/these.pdf

This is 150 MB and it takes 8 seconds. My system is Core 2 Duo E6600 @ 2.4 GHz, 2 GB RAM. Can you please check how long does it take to parse the above files on your computer? Perhaps, the PDFs you have contain something special that causes Offline Explorer to slow down on their parsing.

Oleg.
Alex 09/04/2007 01:29 pm
Hey, try this:

ftp://ftp.worldofspectrum.org/pub/sinclair/books/*.*

For me, it is "hanged" parsing for 10 min now on the file "AdvancedSpectrumMachineLanguage.pdf.primary" (97722984 bytes) while other files are coming and coming downloaded (860mb downloaded now, and 38 files are in "Parsing..." queue), and all files after "AdvancedSpectrumMachineLanguage.pdf" are still with .primary extensions, if I see inside the folders on my HDD.
This is exactly what I mentioned.


PS: 20 min and "Parsing...(47) now. Still "AdvancedSpectrumMachineLanguage.pdf.primary" not gone (parsing by).

Mine is Intel Xeon QuadCore, with 100% of all 4 cores usage right now (all power at the process OE.EXE).
I am pretty sure that if I leave it overnight - all files from the server will be downloaded successfully, but not yet parsed after the current one, and those will remain with .primary extensions. And if I stop OE right now - the .primary never be gone, and next time I run this project - all the "missed" .primary files will be downloaded again and again (even they are at my HDD already but with .primary extensions), and pushed to be parsed again, and the whole story loops... Tested so many times with my local server...:(

Please suggest how to exclude ALL files with .pdf extension (or any other selectable extension) from being parsed. How to download them but NOT to parse - same as .EXE or .ZIP or any other binary\octet-streaming data, for example?

Thanks,
Alex
Oleg Chernavin 09/04/2007 01:30 pm
OK. Your logic is correct. If the files get added again and again, the parsing queue will grow. However usually it takes minutes to load huge files - 500-700 MBs and while it will take seconds to parse each of them, the parsing queue should not be overloaded. Only unless you load from a very fast server in your local network.

Oleg.
Alex 09/04/2007 01:51 pm
> OK. Your logic is correct. If the files get added again and again, the parsing queue will grow. However
> usually it takes minutes to load huge files - 500-700 MBs and while it will take seconds to parse each of
> them, the parsing queue should not be overloaded. Only unless you load from a very fast server in
> your local network.

Dear, the problem is about PARSING, not about DOWNLOADING. Downloading is very fast here (we have a fiber-optic connection and I am sitting directly at the server, right before the connection's sharing). The files are being downloaded really fast, but right AFTER the downloading (after the files get disappeared from the downloading thread and pushed to PARSING...(1) - they get stuck in this stage. I can be already _downloaded_ the whole project (no active downloading threads anymore), but the PARSING still active. The problem that sometimes it keeps active forever (ok, maybe not forever but unless I lost my patience to this all - 5...7 days of non-stop parsing sometimes).

Your program is great and has many unique features and much more powerful than Teleport (especially for huge projects, while Teleport has a really annoying limitation "not over 65K crosslinks per project"), but just one - with Teleport I have never waiting for days for the one single folder with pdf's...... just this one minus to your soft. All other options are pluses. :)

Thanks,
Alex
Oleg Chernavin 09/04/2007 01:51 pm
I will try to reproduce the situation with multiple PDFs in the queue.

Oleg.
Oleg Chernavin 09/04/2007 01:58 pm
I understand that parsing is not necessary in your case. But in most cases it is good to look for links in PDFs. I can make an optional feature to skip parsing PDFs for you, if you want.

Oleg.
Oleg Chernavin 09/04/2007 03:21 pm
I just released 4.8 SR1 version, which includes the URLs field command:

SkipParsingFiles=*.pdf,*.exe

Oleg.
Alex 09/04/2007 11:36 pm
> in most cases it is good to look for links in PDFs.

Yes, agreed - in most cases. But MOST is not equal to ALL. :)
It is a better idea to have the program's logic be a little more adjustable (scripting-based and opensource - in ideal), but not a fixed one. More fixed - means less tweakable. Just for the special cases like mine (and only God knows - how often your customers facing this problem on their work but not reported to developer)...:)

Also I seen somebody else in this forum just recommended you to have a scripting-defined program's logic, too. It is a great idea - believe me. Say, you can combine your program with a Perl-like scripts (with all of Perl's string adjusting and parsing power)... - and your soft will be unbeatable forever. With scripts, your soft will be able to download ALL with no exceptions - dump database, catch all non-standart protocols (include secured or crypted ones), all password-protected areas and all dynamic-generated pages (if the user's hands not so curved to define the correct script, of course - but that will be up to user, not to program's limitations). Just for example - I can download all my pdf's project just with 10-lines Perl commandline script. No parsing of unnecessary, no waiting for something unknown, no errors in the process. Imagine, if you can provide such of scripting power to your program...:)

Hope to see something like this in future versions, and I am ready to be a beta-tester of that...:)

Thanks,
Alex
Oleg Chernavin 09/05/2007 01:54 am
Yes, I am thinking about this. The biggest issue is performance. The software will have to run scripts that usually takes more time than just executing a fixed code. And when you have too many scripts attached (users will be able to write theirs) - this takes much time to run each of them.

But we do have this in our plans.

Oleg.
Alex 09/06/2007 10:04 am
> The biggest issue is performance. The software will have to run scripts that usually takes more time
> than just executing a fixed code.
Yes, agreed. But..."What you pay is what you get" (c) :)

>And when you have too many scripts attached (users will be able to write theirs) - this takes much time
>to run each of them.
Partially agreed. As you said right before - usually to DOWNLOAD the file is much longer than to PROCESS the one. And all modern CPUs (even low-end ones) are powerful enough to "chewing up the data" - I dont think that even a 10-20 parallel scripting threads will slowing down the whole system. Remember - the WWW servers (with php,asp,cgi etc) easily can do 10-20 concurrent threads per each user (and hundreds concurrent users per time). Of course all of this depending to the script's complicasy - but usually they would be not too complicated (as your program itself is well enough). Just for the very special cases the scripts will go.
That's how I imagine that.

Or there could be a two versions of your software - Lite (as it is now) and Pro (scriptable). Your choice, buddy.

> But we do have this in our plans.
Where can I apply for the beta? :)

PS: I'll try your updated SR1 today. Will let u know the results.

Thanks,
Alex
Alex 09/25/2007 11:48 am
I am also think that it should be a good idea to control the PARSING queue somehow (as well as it is now for the DOWNLOAD queue). Just to start/stop/delete/move the files to be parsed within the queue.

Thanks,
Alex
Oleg Chernavin 09/25/2007 11:52 am
Yes, I thought about this. But I am nore sure where to place the control for it in the user interface. Too many features are there already.

Oleg.
Oleg Chernavin 09/25/2007 01:00 pm
Yes, I will think about this.

Oleg.