Sometimes oee can't download URLs that contain multi-byte characters

Author Message
Mark 03/25/2010 07:27 am
I found that some parsed URLs with multi-byte characters were internally converted to UTF8-encoded forms, while the others were converted to ANSI-encoded forms. Unfortunately, a lot of web servers don't support ANSI-encoded URL format, so OEE sometimes can not download the required URLs.

Question:
1. Is there any way to force OEE to convert such URLs to UTF8-encoded forms or to ANSI-encoded forms?

2. OEE converts the offline folder or file names which originally contain multi-byte characters to something like "%XX%XX...". Is there any way to automatically convert them to its corresponding multi-byte characters?
Oleg Chernavin 03/25/2010 04:39 pm
Can you please give me a few real examples of such URLs and links? I will see what can be done to improve this.

Thank you!

Best regards,
Oleg Chernavin
MP Staff
Mark 03/25/2010 11:56 pm
Because the URLs I encouraged are in internal network and can’t be accessed outside, so I use Google to find two real examples on Internet. Both the following tests were made on a WinXP SP3 computer using Simply Chinese Language.

Example 1:
Try to download “http://www.ccut.edu.tw/adminSection/ace/downloads/???913-080002.pdf“ (requested URL) from the starting page “http://www.ccut.edu.tw/adminSection/front/showContent.asp?m_id=108&site_id=ace“ (starting URL, original charset is utf-8).
The requested URL’s UTF8-encode and ANSI-encoded (more clearly speaking, encoded with code page 936) forms are “http://www.ccut.edu.tw/adminSection/ace/downloads/%E6%B0%A3%E5%A3%93%E4%B9%99913-080002.pdf” and “http://www.ccut.edu.tw/adminSection/ace/downloads/%9A%E2%89%BA%D2%D2913-080002.pdf”, separately. Easy to test that only UTF8 URL form is supported on server www.ccut.edu.tw. Fortunately, OEE internally converts the requested URL to its UTF8 form, so the requested page can be downloaded.

Example 2:
Try to download “http://kyc.tjuci.edu.cn/xxcxx/xkdm/??????.htm“ (requested URL) from the starting page “http://kyc.tjuci.edu.cn/xxcxx/xkdm/%E5%AD%A6%E7%A7%91%E5%88%86%E7%B1%BB%E4%B8%8E%E4%BB%A3%E7%A0%81.htm“ (starting URL, original charset is gb2312).
The requested URL’s UTF8-encode and ANSI-encoded (more clearly speaking, encoded using code page 936) forms are “http://kyc.tjuci.edu.cn/xxcxx/xkdm/%E8%87%AA%E7%84%B6%E7%A7%91%E5%AD%A6%E9%83%A8%E5%88%86.htm” and “http://kyc.tjuci.edu.cn/xxcxx/xkdm/%D7%D4%C8%BB%BF%C6%D1%A7%B2%BF%B7%D6.htm”, separately. Easy to test that both URL forms are supported on server kyc.tjuci.edu.cn. So, no matter whichever URL form OEE internal uses (Tests showed that OEE uses ANSI form in practice), the requested URL can always be downloaded.

It seems that the charset of the parent page from which the requested URL is parsed decides which URL encode-form oee internally uses when the parsed URL contains multi-byte chrarcters. So let us suppose that if the starting page in the first example was saved with gb2312 (or GB 18030) charset:, what would happen when OEE downloads the project described in the first example?
Obviously, OEE will converts the requested URL to its ANSI form, and it can not be downloaded form www.ccut.edu.tw.
Oleg Chernavin 03/30/2010 10:08 am
I am confused! I tested both URLs:

http://www.ccut.edu.tw/adminSection/front/showContent.asp?m_id=108&site_id=ace
and
http://kyc.tjuci.edu.cn/xxcxx/xkdm/%E5%AD%A6%E7%A7%91%E5%88%86%E7%B1%BB%E4%B8%8E%E4%BB%A3%E7%A0%81.htm

they were downloaded correctly with Offline Explorer Pro and all links were also downloaded and can be easily browsed offline. Offline Explorer makes all links ANSI, so they can be easily browsed on any system, even non-Unicode one.

Maybe there are some public examples of links in non-UTF-8 charset? So, I could test on them?

Oleg.
Hank 03/31/2010 09:50 am
Maybe, the easier way to test this problom is to build a local web site. The necessary files have been uploaded to rapidshare.com. Please download them from "http://rapidshare.com/files/370341167/OEETEST.rar.html" and follow the steps described in readme.txt .
Mark 04/01/2010 12:25 pm
Oh, the archieve's password is OEETEST.
Oleg Chernavin 04/01/2010 01:06 pm
Yes, I just started testing it - unpacked and read the readme.txt file.

Oleg.
Oleg Chernavin 04/06/2010 11:32 am
I finished the improvement. The links should be supported now. Here is the updated oe.exe file:

http://www.metaproducts.com/download/betas/OEP3180.ZIP

Oleg.
Mark 04/06/2010 09:23 pm
Firstly, many thanks for your hard work!

I tested the new OEP release six times, three time on Simple Chinese WinXp Sp3 and three time on English WinXp Sp3. With each OS platform, one test uses file:// protocol, and two tests use http:// protocol (both OEP’s internal web server and a third-party web server were tested).
Tests show that all the problems remain in the new release except that the URLs with multi-byte characters are internally converted to its UTF-8 format.

Maybe it is uneasy to solve this problem, for it is not only related to OE’s internal parsing mechanism but also related to web server’s ability. I have found that some web serves support both UFT8 encoded and ANSI encoded URL, while the others only support either.

A possible solution: Adding an optional switch ConvURL2SpecficForm=
Without the switch, OE works in an unchanged way. Otherwise, OE should internally convert the multi-byte URLs to the form specified by the switch.
Oleg Chernavin 04/07/2010 04:42 am
It should work correctly with 3rd party server. The links will be internally converted to ASCII. My main concern is the ability to correctly download the links now.

I understand the conversion issues, but making fully Unicode filenames and links will require a lot of changes in the code. I plan to work on this sometime later.

Oleg.