download missing files non working well

Author Message
Akram 02/15/2013 03:27 pm
hi,

we are amazed with your software.

Could you please help with :

We finish downloading a website like www.waqfeya.com for example, with url substitue and additional=keepprimary ...
we browse it like a shame however
when we try to download missing files we get some of the files already downloaded
when exporting thoses files don't get exported or link translation is bad (no image, no style in some files).

I will mail you with the the project properties

thank you in advance

Oleg Chernavin 02/18/2013 06:33 am
I got the E-mail with the Project settings. Can you please describe me some examples of such redownloaded and non-exportable files? And where (on which pages) to look for them? It will help me to reproduce the problem.

Thank you!

Best regards,
Oleg Chernavin
MP Staff
Akram 02/18/2013 03:15 pm
XXX/category.php?cid=87
XXX/book.php?bid=7813
resalty.XXX/index.php
resalty.XXX/index.php/category-15
This problem persists with other websites

I think it's, maybe related to how big the file is ? I checked and unchecked supress website error but no luck.


And another little problem that got to my nerve is how to add / in the end of url if it has no extension. For example a/am became a/am/ and a/am.exe don't get modified


thank you.

Akram 02/20/2013 05:17 pm
Hi,

I just happen to pass by and found a new version of oee. I will try it to found if it resolve my problems. Thank you for the hard work.
Oleg Chernavin 02/21/2013 12:16 am
Yes, please let me know how it works. If still fails, I will continue my investigations. Thank you!

Oleg.
Akram 02/21/2013 03:25 pm
I get the same thing almost 1800 files to redown ? please help.
Akram 02/22/2013 06:06 am
Hi

for this :

>And another little problem that got to my nerve is how to add / in the end of url
>if it has no extension. For example a/am became a/am/ and a/am.exe don't get modified

I found a workaround for it. Thank you however.
Oleg Chernavin 02/22/2013 08:13 am
I fixed this. Here is the updated oe.exe file:

http://www.metaproducts.com/download/betas/OEP3908.zip

Oleg.
Akram 02/22/2013 01:00 pm
Thanks. But i have an oee version not oep
Oleg Chernavin 02/22/2013 01:19 pm
Sorry, I was unable to find you in our orders database. We send out such Offline Explorer Enterprise updates only to its registered users.

Oleg.
Akram 02/22/2013 01:26 pm
And again, it's the same problem. Files keep beeing downloaded when using download missing files only.
Akram 02/22/2013 01:48 pm
Sorry for my last message. I think it works well. I will check it once more and tell you.
Akram 02/22/2013 04:33 pm
Hi,

The same problem persits. I tried to download some files of another website and stuck with same problem, http://alhazme.net/ and a lot of other ones ...

However the parser is a little quick than before.

Oleg Chernavin 02/25/2013 06:10 am
I downloaded this site with Level=1 and it didn't try to get files in the Download Missing Files mode. Can you tell me which links it tried to get again and again?

Oleg.
Akram 02/26/2013 05:54 pm
I think the problem is with my project settings. Could you please verify my url substitue, because I think the problem is maybe there.

I will try it with another pc to see changes, and will contact you.
Oleg Chernavin 02/27/2013 04:39 am
I don't have the settings for http://alhazme.net/. Can you please post them here? Select the Project, press Ctrl+C and paste to the forum message.

Oleg.
Akram 03/03/2013 03:26 pm
Hi,

For alhazme the problem is solved. All the website is downloaded, than with ctrl+f5 get only the links with error and one that was downloaded before. After the second download of the only one link remaining no link remain. Another ctrl+F5 to test and no link at all. Thank you very match.

However for the first website the problem remain almost the same :
Some link that where redownloaded every time, now are not redownloaded
Other ones like these are redownloaded every time

http://resalty.waqfeya.com/index.php from http://resalty.waqfeya.com/
and so on.
http://www.waqfeya.com/book.php?bid=1043 from http://www.waqfeya.com/category.php?cid=87

Oleg Chernavin 03/05/2013 06:45 am
OK. I need the settings again.

Oleg.
Akram 03/05/2013 12:31 pm
Hi,

Another sub problem that I were about to tell you when we finish with the problem of missing files download
Is the pdf files in archive.org are not all well downloaded.

I almost downloaded 100 gb of files 3 times with no success
I remarqued three things in this project
* 302 moved files (archive.org) downloaded but not found by oee nor are browsable for the most of them
* that can be due to the same problem
* maybe there is some problem with my url substitute or filters.

The website dedewnet.com is downloaded well with oee last version.
But with the last oep that you gave me there are still missing files to download each time (no substitute nor filters)
only for "www." and ".net" to ".com"

And I mailed you with the project settings for waqfeya.com
Oleg Chernavin 03/07/2013 12:41 pm
I didn't receive the settings. Can you resend them again?

Oleg.
Akram 03/08/2013 06:19 am
Have you received them yesterday, I did resend them to you.
Akram 03/12/2013 05:25 pm
Hi,

Oleg, are you ok ?

Sorry for distrubing you ! but is there anything new. If you didn't receive the settings I will paste them directly here in post.

Waiting for your response.
Oleg Chernavin 03/17/2013 03:12 pm
I am sorry for the late reply! I got the settings, but the long lines were corrupted when copying/pasting to Offline Explorer. Can you please post them here? I will try to reproduce the issue tomorrow and see what can be done.

Thank you!

Oleg.
Akram 03/22/2013 05:57 am
Hi,

I did resend you the settings a couple of days in an attached file so you don't get errors. Did you get them ?

I'm waiting for you response and eventual solution.
Oleg Chernavin 03/22/2013 06:44 am
Yes, the attached text file was still broken - special dividing symbols were replaced with spaces. Please select the Project, press Ctrl+C and paste to this forum message.

Oleg.
Akram 03/27/2013 04:54 pm
Hi,

Is there nything new on my problem dear Oleg ? I did resend you the right settings.

Thank you.
Oleg Chernavin 03/28/2013 08:46 am
I replied you on March, 25th by E-mail. I wrote the following:

OK. I downloaded the Project partially. It browses well. Downloading using Ctrl+F5 doesn't try to get existing files. What should I look for?

Oleg.
Akram 03/29/2013 10:45 am
I don't get it.

Maybe it's minor problem with my installation because i changed a lot betwenn Oee and oep beta.

I will try to reinstall it or use the pc of another user to see changes.

I will try to post after the week-end.
Akram 04/03/2013 08:55 am
Hi,

I tried it in other pcs and the same problem remains in some pages so I changed the url substitutes in the remaining ones :
category-1 (category-1 is a file)
category-1/thesis (category-1 is a folder)

I changed the first to category1 for example (impossible to have a folder and a file with the same name)

For the other links I didn't see the problem's facts ???

See you soon
Akram 04/06/2013 03:30 am
Hi,

Good news, the problem is solved it's due partially to the pc used in.

Thank you very match for you time.
Oleg Chernavin 04/06/2013 04:54 am
This is good to hear! Was it antivirus or another software interfered?

Oleg.
Akram 04/22/2013 03:13 pm
Hi,

Sorry for not answering for a long period.

Oleg, I think it's caused by one/all of the following :
1. temp files not purged well
2. problem with url like this one xxx/cat/ and xxx/cat (file and directory)
2. maybe the use of two version at the same time to download the same project (testing time)

The project just got donwloaded well with some changements after reading some topics.


But I got these minor problems.

1. The OEE window hangs when parsing at some stage
the memory is not at the limit (56 Mb used in average)
and all the files have been downloaded by oee

2. Some errors in the parser (don't know yet if this affects the project or not)
ERROR - ... - Error reading from file: C:\DOCUME~1\XXX\LOCALS~1\Temp\htt67.tmp
Error code=00000000 Referer=http://ia600709.us.archive.org/34/items/waqsfmkn_3/
I assume that the parser fails to parse pdf files to get links from their bookmarks

3. Some times I get Error Out of memory
Is it possible to pause the parser or to skip parsing files by size not by name
because I want to parse pdf files but not bigger once if this affects the parser

P.S. :
If you think it's better to make a new topic just tell me
Akram 04/22/2013 03:15 pm
Sorry I didn't read the last message well.

I'm not sure if it's virus related but no software was interfering, I am sure.
Akram 04/22/2013 03:43 pm
Hi,

some of the errors
Parser error 2: Out of memory URL: http://ia600402.us.archive.org/9/items/waq98398/98398.pdf
Error reading from file: X:\ia700306.us.archive.org\7\items\41455waq\41455.pdf.primary Error code=00000000 Referer=http://ia700306.us.archive.org/7/items/41455waq/
Oleg Chernavin 04/23/2013 06:36 am
Looks like I have to change PDF files parsing. They are loaded into memory completely and then processed.

I have to split them into smaller parts to keep memory usage low. Thank you for the tests!

Oleg.
Akram 04/24/2013 08:17 am
Hi,

Never mind it.

Thank you too.
Oleg Chernavin 04/24/2013 08:27 am
I made the code to use twice less memory for PDF parsing already. And also it would work a bit faster.

Oleg.
Akram 04/24/2013 08:28 am
Another strange behavior of the parser !

In the project's settings I check "download only missing files" and check the two options below it.
Ctrl+F5 and F9 and after some hours of parsing I get links like those :

http://archive.org/details/xx/
and
http://archive.org/XX/items/xx/

and when downloaded it says 304 not modified

It's strange, is not it ?
Files been queued and downloaded when they do exist already.


And please explain me the difference between URL and HTML Text in URL substitute.
Oleg Chernavin 04/24/2013 09:55 am
Does the Project use URL Substitutes?

URL and Text. The first changes URLs before they get queued. The last changes text in downloaded HTML files.

Oleg.
Akram 04/24/2013 07:20 pm
Hi,

Yes for the first one, and get all possible subdirectory is checked also.
Oleg Chernavin 04/25/2013 05:26 am
Can you make some small Project that loads maximum a new hundred files and reproduces this issue?

Oleg.
Akram 04/27/2013 10:05 am
Hi,

I don't understand what you said.
Oleg Chernavin 04/28/2013 06:58 am
I need to reproduce this problem. Do you have some sample Project that shows the issue when downloading not a huge amount of files?

Oleg.
Akram 04/28/2013 11:09 am
I see.

For the moment I don't have a small project.

I will redownload mine (a very big project) and reproduce the issue another time.

Yesterday I export the project and get alot of previous url substitute (not the last ones) ...

So now I am not sure that the problem comes from oee.

In fact, I change the url substitute a lot and don't redownload the project,I just download the missing files.

However Keepprimary is always used.
Akram 04/30/2013 05:04 am
Hi,

One of the things causing the problem is checking index downloaded links
when unchecked a lot of files (already downloaded) are not queued and oee crashes less.

Maybe because the index didn't complete adding the file or the index got corrupted at a certain point.

And when I index with "optimize the search index" I don't know when it will end. Is there a message ?

Oleg Chernavin 04/30/2013 07:36 am
Yes, this is a weak side. I will work on it to make it solid.

Thank you!

Oleg.