FTP folder list - parsing error?

Author Message
Bubba 08/05/2009 02:31 am
Hi there.

While trying in downloading an address ftp://ftp.worldofspectrum.org (a WHOLE site), I am getting TONS of incorrectly passed URLs in the queue (whose, of course, taking their time\traffic and generating "NOT FOUND" error).

Example...
the root FTP site is:
ftp://ftp.worldofspectrum.org/AUP/
ftp://ftp.worldofspectrum.org/bin/
ftp://ftp.worldofspectrum.org/etc/
ftp://ftp.worldofspectrum.org/lib/
ftp://ftp.worldofspectrum.org/pub/
ftp://ftp.worldofspectrum.org/welcome.msg

but while downloading somewhere deep in the folders, OE is adding to the queue URLs like these:
ftp://ftp.worldofspectrum.org/pub/sinclair/games-adverts/g/Ghostbusters2_6.jpg/AUP
ftp://ftp.worldofspectrum.org/pub/sinclair/games-adverts/g/Ghostbusters2_6.jpg/bin
ftp://ftp.worldofspectrum.org/pub/sinclair/games-adverts/g/Ghostbusters2_6.jpg/etc
ftp://ftp.worldofspectrum.org/pub/sinclair/games-adverts/g/Ghostbusters2_6.jpg/lib
ftp://ftp.worldofspectrum.org/pub/sinclair/games-adverts/g/Ghostbusters2_6.jpg/pub
ftp://ftp.worldofspectrum.org/pub/sinclair/games-adverts/g/Ghostbusters2_6.jpg/pub/
ftp://ftp.worldofspectrum.org/pub/sinclair/games-adverts/g/Ghostbusters2_6.jpg/pub/sinclair/
ftp://ftp.worldofspectrum.org/pub/sinclair/games-adverts/g/Ghostbusters2_6.jpg/pub/sinclair/incoming
ftp://ftp.worldofspectrum.org/pub/sinclair/games-adverts/g/Ghostbusters2_6.jpg/usr
ftp://ftp.worldofspectrum.org/pub/sinclair/games-adverts/g/Ghostbusters2_6.jpg/welcome.msg

(they are incorrect and unavailable on the server of course, and the correct one MUST be a single file ftp://ftp.worldofspectrum.org/pub/sinclair/games-adverts/g/Ghostbusters2_6.jpg - which is NOT downloaded!)

or
ftp://ftp.worldofspectrum.org/pub/sinclair/games-adverts/g/Ghostbusters2_6.jpg/AUP
ftp://ftp.worldofspectrum.org/pub/sinclair/games-adverts/g/Ghostbusters2_6.jpg/bin
ftp://ftp.worldofspectrum.org/pub/sinclair/games-adverts/g/Ghostbusters2_6.jpg/etc
ftp://ftp.worldofspectrum.org/pub/sinclair/games-adverts/g/Ghostbusters2_6.jpg/lib
ftp://ftp.worldofspectrum.org/pub/sinclair/games-adverts/g/Ghostbusters2_6.jpg/pub
ftp://ftp.worldofspectrum.org/pub/sinclair/games-adverts/g/Ghostbusters2_6.jpg/pub/
ftp://ftp.worldofspectrum.org/pub/sinclair/games-adverts/g/Ghostbusters2_6.jpg/pub/sinclair/
ftp://ftp.worldofspectrum.org/pub/sinclair/games-adverts/g/Ghostbusters2_6.jpg/pub/sinclair/incoming
ftp://ftp.worldofspectrum.org/pub/sinclair/games-adverts/g/Ghostbusters2_6.jpg/usr
ftp://ftp.worldofspectrum.org/pub/sinclair/games-adverts/g/Ghostbusters2_6.jpg/welcome.msg
(instead of a single file ftp://ftp.worldofspectrum.org/pub/sinclair/games-adverts/g/Ghostbusters2_6.jpg)

Someone please help me in this trouble?
WinXP Pro SP3, OE 5.4 Trial, the project is:
----------
ftp://ftp.worldofspectrum.org/
SkipParsingFiles=*.pdf,*.zip,*.rar
Additional=DepthFirst
----------

Thanks,
Bubba
Oleg Chernavin 08/05/2009 09:45 am
I was unable to reproduce this problem. Can you please repeat it then once such incorrect URLs appear in the Queue, right-click few of them and copy their referrers, then paste them to the forum message. I will try to figure out what is wrong.

Thank you!

Best regards,
Oleg Chernavin
MP Staff
Bubba 08/05/2009 11:01 pm
> I was unable to reproduce this problem. Can you please repeat it then once such incorrect URLs appear in the Queue, right-click few of them and copy their referrers, then paste them to the forum message.

Sure. Right below.

Forgot to tell''ya: the problem happens not on each FTP file, but for the some. As the count of files in that FTP is large - so finally we get TONS of wrong URLS (but once again - not per each FTP file). Maybe it will take ya 5-10-15 min to load and catch the problem once or twice. For us, it happens often.

Thanks,
Bubba
------------------------------------------------
ftp://ftp.worldofspectrum.org/pub/sinclair/magazines/16-48Magazine/16-48Magazine05_Front.jpg/
ftp://ftp.worldofspectrum.org/pub/sinclair/magazines/16-48Magazine/16-48Magazine05_Front.jpg/AUP
ftp://ftp.worldofspectrum.org/pub/sinclair/magazines/16-48Magazine/16-48Magazine05_Front.jpg/bin
ftp://ftp.worldofspectrum.org/pub/sinclair/magazines/16-48Magazine/16-48Magazine05_Front.jpg/etc
ftp://ftp.worldofspectrum.org/pub/sinclair/magazines/16-48Magazine/16-48Magazine05_Front.jpg/lib
ftp://ftp.worldofspectrum.org/pub/sinclair/magazines/16-48Magazine/16-48Magazine05_Front.jpg/pub
ftp://ftp.worldofspectrum.org/pub/sinclair/magazines/16-48Magazine/16-48Magazine05_Front.jpg/pub/
ftp://ftp.worldofspectrum.org/pub/sinclair/magazines/16-48Magazine/16-48Magazine05_Front.jpg/pub/sinclair/
ftp://ftp.worldofspectrum.org/pub/sinclair/magazines/16-48Magazine/16-48Magazine05_Front.jpg/pub/sinclair/incoming
ftp://ftp.worldofspectrum.org/pub/sinclair/magazines/16-48Magazine/16-48Magazine05_Front.jpg/usr
ftp://ftp.worldofspectrum.org/pub/sinclair/magazines/16-48Magazine/16-48Magazine05_Front.jpg/welcome.msg
ftp://ftp.worldofspectrum.org/pub/sinclair/games-maps/t/Terminus-ThePrisonPlanet_Part2.png/
ftp://ftp.worldofspectrum.org/pub/sinclair/games-maps/t/Terminus-ThePrisonPlanet_Part2.png/AUP
ftp://ftp.worldofspectrum.org/pub/sinclair/games-maps/t/Terminus-ThePrisonPlanet_Part2.png/bin
ftp://ftp.worldofspectrum.org/pub/sinclair/games-maps/t/Terminus-ThePrisonPlanet_Part2.png/etc
ftp://ftp.worldofspectrum.org/pub/sinclair/games-maps/t/Terminus-ThePrisonPlanet_Part2.png/lib
ftp://ftp.worldofspectrum.org/pub/sinclair/games-maps/t/Terminus-ThePrisonPlanet_Part2.png/pub
ftp://ftp.worldofspectrum.org/pub/sinclair/games-maps/t/Terminus-ThePrisonPlanet_Part2.png/pub/
ftp://ftp.worldofspectrum.org/pub/sinclair/games-maps/t/Terminus-ThePrisonPlanet_Part2.png/pub/sinclair/
ftp://ftp.worldofspectrum.org/pub/sinclair/games-maps/t/Terminus-ThePrisonPlanet_Part2.png/pub/sinclair/incoming
ftp://ftp.worldofspectrum.org/pub/sinclair/games-maps/t/Terminus-ThePrisonPlanet_Part2.png/usr
ftp://ftp.worldofspectrum.org/pub/sinclair/games-maps/t/Terminus-ThePrisonPlanet_Part2.png/welcome.msg
ftp://ftp.worldofspectrum.org/pub/sinclair/magazines/16-48Magazine/16-48Magazine07_Back.jpg/
ftp://ftp.worldofspectrum.org/pub/sinclair/magazines/16-48Magazine/16-48Magazine07_Back.jpg/AUP
ftp://ftp.worldofspectrum.org/pub/sinclair/magazines/16-48Magazine/16-48Magazine07_Back.jpg/bin
ftp://ftp.worldofspectrum.org/pub/sinclair/magazines/16-48Magazine/16-48Magazine07_Back.jpg/etc
ftp://ftp.worldofspectrum.org/pub/sinclair/magazines/16-48Magazine/16-48Magazine07_Back.jpg/lib
ftp://ftp.worldofspectrum.org/pub/sinclair/magazines/16-48Magazine/16-48Magazine07_Back.jpg/pub
ftp://ftp.worldofspectrum.org/pub/sinclair/magazines/16-48Magazine/16-48Magazine07_Back.jpg/pub/
ftp://ftp.worldofspectrum.org/pub/sinclair/magazines/16-48Magazine/16-48Magazine07_Back.jpg/pub/sinclair/
ftp://ftp.worldofspectrum.org/pub/sinclair/magazines/16-48Magazine/16-48Magazine07_Back.jpg/pub/sinclair/incoming
ftp://ftp.worldofspectrum.org/pub/sinclair/magazines/16-48Magazine/16-48Magazine07_Back.jpg/usr
ftp://ftp.worldofspectrum.org/pub/sinclair/magazines/16-48Magazine/16-48Magazine07_Back.jpg/welcome.msg
ftp://ftp.worldofspectrum.org/pub/sinclair/magazines/16-48Magazine/16-48Magazine08_
Oleg Chernavin 08/06/2009 06:42 am
I was unable to see the error. Let''s try this - can you please create a new Project with 3 URLs:

ftp://ftp.worldofspectrum.org/pub/sinclair/magazines/16-48Magazine/16-48Magazine05_Front.jpg
ftp://ftp.worldofspectrum.org/pub/sinclair/games-maps/t/Terminus-ThePrisonPlanet_Part2.png
ftp://ftp.worldofspectrum.org/pub/sinclair/magazines/16-48Magazine/16-48Magazine07_Back.jpg

Level=1

then allow logging (Ctrl+Q, press first button on the Log toolbar, allow all in the Filters button menu.) And start the download. If the error reproduces, copy the whole log and paste it to the forum message. Thank you!

Oleg.
Oleg Chernavin 08/10/2009 05:39 am
I think, it could be a problem with PDF parsing - can you try to remove all extensions except PDF from this list and see if the download works well? PDFs are really parsed, because they may contain links.

Images are not parsed at all. The only exception is when a Web server returns HTML page instead of an image. But this is an FTP server, so it should not be so.

Oleg.
Bubba 08/10/2009 10:53 pm
Dear Oleg,

SkipParsingFiles=*.pdf is in the project since long ago (see the 1st post) - and yes, there were problems with those too. It was infinitely long parsing for some pdf files, with 100% of the CPU usage. Adding SkipParsingFiles=*.pdf was bypassed the problem that time.

But this case is different. May it happen because I am behind a proxy (and it is an HTTP one)? I can''t catch any tail of the problem, and nothing strange in logs after all...it seems that OE accidentally tries to parse selected (or randomed?) binaries in its queue, indeed - and the results are same to all of those tries: adding an FTP root catalog after the file name treaten as a folder (1.jpg -> 1.jpg/bin/).

Dunno what can I do for more, except manually adding all of binary data types to the "SkipParsingFiles=" line. Lucky me - there are not too many of them on that FTP.

Thanks.
Oleg Chernavin 08/11/2009 07:09 am
HTTP Proxy explains that. It returns some kind of HTML files instead of the images, etc. Can you please check files that correspond to referers above? Are they HTMLs on the disk?

Oleg.
Bubba 08/12/2009 11:43 pm
> Can you please check files that correspond to referers above? Are they HTMLs on the disk?

No, I can''t. The files are NOT on the disk at all - instead of''em, I see the FOLDERS with the same names - like a FOLDER "SinclairUser02600066.jpg", filled with a copy of the FTP''s ROOT structure: SinclairUser02600066.jpg/BIN, SinclairUser02600066.jpg/PUB etc. :(

Those URLs being not saved as a "files" - right because OE treats''em as a "folders", somehow. We can see it since the OE''s queue...:( The fix is simple, but so boring: manually delete the "errored" URLs from the queue, and restart the project as "Download missed files only". This will help will those missed URLs - but also bring some fresh ones, as the problem still unpredictable.

Thanks,
Bubba
Oleg Chernavin 08/13/2009 06:14 am
I am quite sure it is related with the proxy. It may incorrectly pass the request to a file and get a directory listing (root one as I understand) from FTP. Is there any way to download directly with no proxy?

Oleg.