Question:
1. Is there any way to force OEE to convert such URLs to UTF8-encoded forms or to ANSI-encoded forms?
2. OEE converts the offline folder or file names which originally contain multi-byte characters to something like "%XX%XX...". Is there any way to automatically convert them to its corresponding multi-byte characters?
Thank you!
Best regards,
Oleg Chernavin
MP Staff
Example 1:
Try to download “http://www.ccut.edu.tw/adminSection/ace/downloads/???913-080002.pdf“ (requested URL) from the starting page “http://www.ccut.edu.tw/adminSection/front/showContent.asp?m_id=108&site_id=ace“ (starting URL, original charset is utf-8).
The requested URL’s UTF8-encode and ANSI-encoded (more clearly speaking, encoded with code page 936) forms are “http://www.ccut.edu.tw/adminSection/ace/downloads/%E6%B0%A3%E5%A3%93%E4%B9%99913-080002.pdf” and “http://www.ccut.edu.tw/adminSection/ace/downloads/%9A%E2%89%BA%D2%D2913-080002.pdf”, separately. Easy to test that only UTF8 URL form is supported on server www.ccut.edu.tw. Fortunately, OEE internally converts the requested URL to its UTF8 form, so the requested page can be downloaded.
Example 2:
Try to download “http://kyc.tjuci.edu.cn/xxcxx/xkdm/??????.htm“ (requested URL) from the starting page “http://kyc.tjuci.edu.cn/xxcxx/xkdm/%E5%AD%A6%E7%A7%91%E5%88%86%E7%B1%BB%E4%B8%8E%E4%BB%A3%E7%A0%81.htm“ (starting URL, original charset is gb2312).
The requested URL’s UTF8-encode and ANSI-encoded (more clearly speaking, encoded using code page 936) forms are “http://kyc.tjuci.edu.cn/xxcxx/xkdm/%E8%87%AA%E7%84%B6%E7%A7%91%E5%AD%A6%E9%83%A8%E5%88%86.htm” and “http://kyc.tjuci.edu.cn/xxcxx/xkdm/%D7%D4%C8%BB%BF%C6%D1%A7%B2%BF%B7%D6.htm”, separately. Easy to test that both URL forms are supported on server kyc.tjuci.edu.cn. So, no matter whichever URL form OEE internal uses (Tests showed that OEE uses ANSI form in practice), the requested URL can always be downloaded.
It seems that the charset of the parent page from which the requested URL is parsed decides which URL encode-form oee internally uses when the parsed URL contains multi-byte chrarcters. So let us suppose that if the starting page in the first example was saved with gb2312 (or GB 18030) charset:, what would happen when OEE downloads the project described in the first example?
Obviously, OEE will converts the requested URL to its ANSI form, and it can not be downloaded form www.ccut.edu.tw.
http://www.ccut.edu.tw/adminSection/front/showContent.asp?m_id=108&site_id=ace
and
http://kyc.tjuci.edu.cn/xxcxx/xkdm/%E5%AD%A6%E7%A7%91%E5%88%86%E7%B1%BB%E4%B8%8E%E4%BB%A3%E7%A0%81.htm
they were downloaded correctly with Offline Explorer Pro and all links were also downloaded and can be easily browsed offline. Offline Explorer makes all links ANSI, so they can be easily browsed on any system, even non-Unicode one.
Maybe there are some public examples of links in non-UTF-8 charset? So, I could test on them?
Oleg.
Oleg.
http://www.metaproducts.com/download/betas/OEP3180.ZIP
Oleg.
I tested the new OEP release six times, three time on Simple Chinese WinXp Sp3 and three time on English WinXp Sp3. With each OS platform, one test uses file:// protocol, and two tests use http:// protocol (both OEP’s internal web server and a third-party web server were tested).
Tests show that all the problems remain in the new release except that the URLs with multi-byte characters are internally converted to its UTF-8 format.
Maybe it is uneasy to solve this problem, for it is not only related to OE’s internal parsing mechanism but also related to web server’s ability. I have found that some web serves support both UFT8 encoded and ANSI encoded URL, while the others only support either.
A possible solution: Adding an optional switch ConvURL2SpecficForm=
Without the switch, OE works in an unchanged way. Otherwise, OE should internally convert the multi-byte URLs to the form specified by the switch.
I understand the conversion issues, but making fully Unicode filenames and links will require a lot of changes in the code. I plan to work on this sometime later.
Oleg.