Highlighted
Absent Member.
Absent Member.
3520 views

Convert from UTF-8 to Unicode

Jump to solution

[Migrated content. Thread originally posted on 03 November 2011]

Hello,

I am trying to convert a chinese string codified in UTF-8, to Unicode.

In order to achieve that, I am working with a .NET managed code project. Here is my program:

       program-id. Program1 as "testcodepage.Program1".
       data division.
       working-storage section.
      * 01 greektext      type Byte[] value x"AFACAE9E".
       01 chinesetext    type Byte[] value x"E58F91E7A5A8".
       01 wsEncoding     type Encoding.
       01 codePageValues type Byte[].
       01 unicodeValues  string.
       01 b              type Byte.
       01 unicodeString  string.
       01 enumerator     type System.Globalization.TextElementEnumerator.
       01 s              string.
       01 i              binary-long.
       01 any-key        pic x.
       01 final-string   pic n(2).
       01 ind            pic 9.
       procedure division.
           invoke type System.IO.File::WriteAllBytes("chinese.txt", chinesetext)
           
        *> Specify the code page to correctly interpret byte values
      *     set wsEncoding to type Encoding::GetEncoding(737) *>(DOS) Greek code page
           set wsEncoding to type Encoding::UTF8
           set codePageValues to type System.IO.File::ReadAllBytes("chinese.txt")

        *> Same content is now encoded as UTF-16
           set unicodeValues to wsencoding::GetString(codePageValues)

        *> Show that the text content is still intact in Unicode string
        *> (Add a reference to System.Windows.Forms.dll)
           invoke type System.Windows.Forms.MessageBox::Show(unicodeValues)

        *> Same content "ψυχή" is stored as UTF-8
           invoke type System.IO.File::WriteAllText("chinese_unicode.txt", unicodeValues)

        *> Conversion is complete. Show the bytes to prove the conversion.
           display "8-bit encoding byte values:"
           perform varying b thru codePageValues
              invoke type Console::Write("{0:X}-", b)
           end-perform
           display " "   
           display "Unicode values:"
       
           set unicodeString to type System.IO.File::ReadAllText("chinese_unicode.txt")
           set enumerator to type System.Globalization.StringInfo::GetTextElementEnumerator(unicodeString)
           move 1 to ind
           perform until exit
              if enumerator::MoveNext()
                 set s to enumerator::GetTextElement()
                 set i to type Char::ConvertToUtf32(s, 0) 
                 set final-string(ind:1) to i *> How to perform the conversion to PIC(N)????
                 add 1 to ind
                 invoke type Console::Write("{0:X}-", i)
              else
                 exit perform
              end-if
           end-perform

        *> Show the chinese string converted
           invoke type System.Windows.Forms.MessageBox::Show(final-string)               
           display " "
           display "Press any key to exit."
           accept any-key
           goback.
           
       end program Program1.


I hope anybody could help me with this problem.

Thank you
0 Likes
1 Solution

Accepted Solutions
Highlighted
Absent Member.
Absent Member.

RE: Convert from UTF-8 to Unicode

Jump to solution
uuuffff... this will be difficult to explain!

The solution is just move the utf8 string to the TextBox, and I didn't need to perform any conversion, neither use any directive.

As simple as doing this:
               move datos-869n(1:ind1) to final-string
               set self::txtChinese::Text to final-string

The point was that I thought I need to convert from utf-8 to unicode to display the data. (?_?)

So, my problem is completely solved.

Thanks anyway for your help.

View solution in original post

0 Likes
4 Replies
Highlighted
Absent Member.
Absent Member.

RE: Convert from UTF-8 to Unicode

Jump to solution
I have found the (partial) solution:

       program-id. Program1 as "testcodepage.Program1".
       data division.
       working-storage section.
      * 01 greektext      type Byte[] value x"AFACAE9E".
       01 chinesetext    type Byte[] value x"E58F91E7A5A8".
       01 wsEncoding     type Encoding.
       01 codePageValues type Byte[].
       01 unicodeValues  string.
       01 b              type Byte.
       01 unicodeString  string.
       01 enumerator     type System.Globalization.TextElementEnumerator.
       01 s              string.
       01 i              binary-long.
       01 c              type Char.
       01 any-key        pic x.
       01 final-string   pic n(2).
       01 ind            pic 9.
       procedure division.
           invoke type System.IO.File::WriteAllBytes("chinese.txt", chinesetext)
           
        *> Specify the code page to correctly interpret byte values
      *     set wsEncoding to type Encoding::GetEncoding(737) *>(DOS) Greek code page
           set wsEncoding to type Encoding::UTF8
           set codePageValues to type System.IO.File::ReadAllBytes("chinese.txt")

        *> Same content is now encoded as UTF-16
           set unicodeValues to wsencoding::GetString(codePageValues)

        *> Show that the text content is still intact in Unicode string
        *> (Add a reference to System.Windows.Forms.dll)
           invoke type System.Windows.Forms.MessageBox::Show(unicodeValues)

        *> Same content "ψυχή" is stored as UTF-8
           invoke type System.IO.File::WriteAllText("chinese_unicode.txt", unicodeValues)

        *> Conversion is complete. Show the bytes to prove the conversion.
           display "8-bit encoding byte values:"
           perform varying b thru codePageValues
              invoke type Console::Write("{0:X}-", b)
           end-perform
           display " "   
           display "Unicode values:"
       
           set unicodeString to type System.IO.File::ReadAllText("chinese_unicode.txt")
           set enumerator to type System.Globalization.StringInfo::GetTextElementEnumerator(unicodeString)
           move 1 to ind
           perform until exit
              if enumerator::MoveNext()
                 set s to enumerator::GetTextElement()
                 set i to type Char::ConvertToUtf32(s, 0) 
                 *>set final-string(ind:1) to i as  *> How to perform the conversion to PIC(N)????
                 set c to i
                 set final-string(ind:1) to c
                 add 1 to ind
                 invoke type Console::Write("{0:X}-", i)
              else
                 exit perform
              end-if
           end-perform

        *> Show the chinese string converted
           invoke type System.Windows.Forms.MessageBox::Show(final-string)               
           display " "
           display "Press any key to exit."
           accept any-key
           goback.
           
       end program Program1.


The problem now is about showing these characteres. For any reason, MessageBox and textBox do not display them.

Any ideas?
0 Likes
Highlighted
Absent Member.
Absent Member.

RE: Convert from UTF-8 to Unicode

Jump to solution
I have found a unicode chinese font, and it seems to work almost well. The problem is that I must left an space between characters. In order to do so, I have changed these lines:
set final-string(ind:1) to c
add 1 to ind

... for the following:
set final-string(ind:1) to c
add 2 to ind

If I do not include spaces between characters, the textbox only shows the last one...

The final result is a line with too much space between characters. How do you suggest I could solve this problem?

Regards
0 Likes
Highlighted
Micro Focus Expert
Micro Focus Expert

RE: Convert from UTF-8 to Unicode

Jump to solution
I think that I just found out what your original problem was.
You are using PIC N characters but you are not setting the directive:

$set nsymbol"national"

The default for nsymbol is: $set nsymbol"dbcs" which is not unicode.

If you add this directive then the messagebox displays the chinese characters correctly.
0 Likes
Highlighted
Absent Member.
Absent Member.

RE: Convert from UTF-8 to Unicode

Jump to solution
uuuffff... this will be difficult to explain!

The solution is just move the utf8 string to the TextBox, and I didn't need to perform any conversion, neither use any directive.

As simple as doing this:
               move datos-869n(1:ind1) to final-string
               set self::txtChinese::Text to final-string

The point was that I thought I need to convert from utf-8 to unicode to display the data. (?_?)

So, my problem is completely solved.

Thanks anyway for your help.

View solution in original post

0 Likes
The opinions expressed above are the personal opinions of the authors, not of Micro Focus. By using this site, you accept the Terms of Use and Rules of Participation. Certain versions of content ("Material") accessible here may contain branding from Hewlett-Packard Company (now HP Inc.) and Hewlett Packard Enterprise Company. As of September 1, 2017, the Material is now offered by Micro Focus, a separately owned and operated company. Any reference to the HP and Hewlett Packard Enterprise/HPE marks is historical in nature, and the HP and Hewlett Packard Enterprise/HPE marks are the property of their respective owners.