Pdf files and parsing issues

User Forums
Offline Explorer Enterprise Edition
Pdf files and parsing issues

Author

Message

Akram

06/01/2013 02:23 pm

Hi,

This is a continuation of our previous talk.

I will resume the problem.

With the new version the website is downloaded like a shame with all of the pdf files.

I tried to reduce the files parsed frow the pdf so i played with the url substitute to get all the pdf files from html pages
so the ones that i get from pdf are almost all bad links

1. to improve parsing of pdf :
error of parsing
Parser error 2: Out of memory URL: http://ia700304.us.archive.org/11/items/adakm/atlasda.pdf
Parser error 2: Out of memory URL: http://ia601502.us.archive.org/17/items/WAQ79051/79051.pdf
http://ia601502.us.archive.org/17/items/WAQ79051/79051.pdf
http://ia600301.us.archive.org/21/items/waq114543_854/114543.pdf
http://ia700806.us.archive.org/13/items/waqtmhmc/tmhmc.pdf
http://ia600402.us.archive.org/9/items/waq98398/98398.pdf

links don't exist in the pdf downloaded but are created
like
http://ia701202.us.archive.org/16/items/waq78361/01_78361.pdf
give this url
http://ia701202.us.archive.org/16/items/waq78361/krahab0.pdf

2. what does this mean ?
Reget is not supported. URL: http://ia601509.us.archive.org/9/items/waq2801/02_2802.pdf

3. Is it possible to activate .primary only for certain type of files
the project become very big when creating .primary for pdf files

4. when I disable parsing certain type of files does this prevent .primary to be created or not

5. I will send you via email
the project settings and the queue of files downloaded but are requeued
(I did sort them by type)
I Am sorry, I tried to make a small project but couldn't

6. with the my url substitute (filename):
www.site.com/a/b/aaa/
become
www.site.com/a/a/a/aaa
but other url substitute give me errors. I.e. : all the links in the page aaa are bad
example :
http://ia601509.us.archive.org/9/items/waq2801/
this one is good
http://archive.org/details/items/items/waq2801
this one is bad
http://archive.org/items/waq2801
maybe we get this because of oee not supporting file and directory (same link)

7. Could you add supporting links of file and folder with this idea
links to filename are kept intact
for the directory add .dir or something other

Oleg Chernavin

06/01/2013 03:21 pm

1. I improved it further:

http://www.metaproducts.com/download/betas/OEE3957.zip

2. It is if a connection was broken half-way, OE tries to resume and not redownload the file from 0. But not all servers support this method.

3. Sorry, no.

4. Yes, primary files should not be created in this case.

5. Perhaps, my todays fixes will deal with that.

6, 7. I need more details and particular examples for that.

Best regards,
Oleg Chernavin
MP Staff

Akram

06/02/2013 07:24 am

Hi,

1. and 5. I will try it after.

2. I get it.

3. :-(
but 4. will help me very well :-)

6. and 7. I will try to make a good explanation in few days

Thank you very match.

Pdf files and parsing issues

MetaProducts Systems Privacy Practices

Personal Information

Web Tracking Information

Information Security and Quality

Business Relationship

Cookies

Requests for Information and Legal Requirements

MetaProducts Systems Web Site Copyright

MetaProducts Systems End User License Agreement

TRADEMARKS

IMPORTANT: PLEASE READ THIS AGREEMENT CAREFULLY BEFORE USING THE SOFTWARE.

END USER LICENSE AGREEMENT

LICENSE OF UNREGISTERED SOFTWARE

LICENSE OF REGISTERED SOFTWARE

DISTRIBUTION OF UNREGISTERED SOFTWARE

TERM OF LICENSE

ACCEPTANCE OF THIS LICENSE AGREEMENT

LIMITATIONS OF USE

DISCLAIMER OF WARRANTY AND LIABILITY

OTHER RESTRICTIONS

INVALID PROVISIONS

ENTIRE AGREEMENT

GOVERNING LAW

MetaProducts Systems Terms of Use

TERMS OF USE

COPYRIGHT

MetaProducts Systems Trademarks