Pdf files and parsing issues

Author Message
Akram 06/01/2013 02:23 pm
Hi,

This is a continuation of our previous talk.

I will resume the problem.

With the new version the website is downloaded like a shame with all of the pdf files.

I tried to reduce the files parsed frow the pdf so i played with the url substitute to get all the pdf files from html pages
so the ones that i get from pdf are almost all bad links

1. to improve parsing of pdf :
error of parsing
Parser error 2: Out of memory URL: http://ia700304.us.archive.org/11/items/adakm/atlasda.pdf
Parser error 2: Out of memory URL: http://ia601502.us.archive.org/17/items/WAQ79051/79051.pdf
http://ia601502.us.archive.org/17/items/WAQ79051/79051.pdf
http://ia600301.us.archive.org/21/items/waq114543_854/114543.pdf
http://ia700806.us.archive.org/13/items/waqtmhmc/tmhmc.pdf
http://ia600402.us.archive.org/9/items/waq98398/98398.pdf

links don't exist in the pdf downloaded but are created
like
http://ia701202.us.archive.org/16/items/waq78361/01_78361.pdf
give this url
http://ia701202.us.archive.org/16/items/waq78361/krahab0.pdf

2. what does this mean ?
Reget is not supported. URL: http://ia601509.us.archive.org/9/items/waq2801/02_2802.pdf

3. Is it possible to activate .primary only for certain type of files
the project become very big when creating .primary for pdf files

4. when I disable parsing certain type of files does this prevent .primary to be created or not

5. I will send you via email
the project settings and the queue of files downloaded but are requeued
(I did sort them by type)
I Am sorry, I tried to make a small project but couldn't

6. with the my url substitute (filename):
www.site.com/a/b/aaa/
become
www.site.com/a/a/a/aaa
but other url substitute give me errors. I.e. : all the links in the page aaa are bad
example :
http://ia601509.us.archive.org/9/items/waq2801/
this one is good
http://archive.org/details/items/items/waq2801
this one is bad
http://archive.org/items/waq2801
maybe we get this because of oee not supporting file and directory (same link)

7. Could you add supporting links of file and folder with this idea
links to filename are kept intact
for the directory add .dir or something other
Oleg Chernavin 06/01/2013 03:21 pm
1. I improved it further:

http://www.metaproducts.com/download/betas/OEE3957.zip

2. It is if a connection was broken half-way, OE tries to resume and not redownload the file from 0. But not all servers support this method.

3. Sorry, no.

4. Yes, primary files should not be created in this case.

5. Perhaps, my todays fixes will deal with that.

6, 7. I need more details and particular examples for that.

Best regards,
Oleg Chernavin
MP Staff
Akram 06/02/2013 07:24 am
Hi,

1. and 5. I will try it after.

2. I get it.

3. :-(
but 4. will help me very well :-)

6. and 7. I will try to make a good explanation in few days

Thank you very match.