Convert from UTF-8 to Unicode

[Migrated content. Thread originally posted on 03 November 2011]

Hello,

I am trying to convert a chinese string codified in UTF-8, to Unicode.

In order to achieve that, I am working with a .NET managed code project. Here is my program:

       program-id. Program1 as "testcodepage.Program1".
       data division.
       working-storage section.
      * 01 greektext      type Byte[] value x"AFACAE9E".
       01 chinesetext    type Byte[] value x"E58F91E7A5A8".
       01 wsEncoding     type Encoding.
       01 codePageValues type Byte[].
       01 unicodeValues  string.
       01 b              type Byte.
       01 unicodeString  string.
       01 enumerator     type System.Globalization.TextElementEnumerator.
       01 s              string.
       01 i              binary-long.
       01 any-key        pic x.
       01 final-string   pic n(2).
       01 ind            pic 9.
       procedure division.
           invoke type System.IO.File::WriteAllBytes("chinese.txt", chinesetext)
           
        *> Specify the code page to correctly interpret byte values
      *     set wsEncoding to type Encoding::GetEncoding(737) *>(DOS) Greek code page
           set wsEncoding to type Encoding::UTF8
           set codePageValues to type System.IO.File::ReadAllBytes("chinese.txt")

        *> Same content is now encoded as UTF-16
           set unicodeValues to wsencoding::GetString(codePageValues)

        *> Show that the text content is still intact in Unicode string
        *> (Add a reference to System.Windows.Forms.dll)
           invoke type System.Windows.Forms.MessageBox::Show(unicodeValues)

        *> Same content "ψυχή" is stored as UTF-8
           invoke type System.IO.File::WriteAllText("chinese_unicode.txt", unicodeValues)

        *> Conversion is complete. Show the bytes to prove the conversion.
           display "8-bit encoding byte values:"
           perform varying b thru codePageValues
              invoke type Console::Write("{0:X}-", b)
           end-perform
           display " "   
           display "Unicode values:"
       
           set unicodeString to type System.IO.File::ReadAllText("chinese_unicode.txt")
           set enumerator to type System.Globalization.StringInfo::GetTextElementEnumerator(unicodeString)
           move 1 to ind
           perform until exit
              if enumerator::MoveNext()
                 set s to enumerator::GetTextElement()
                 set i to type Char::ConvertToUtf32(s, 0) 
                 set final-string(ind:1) to i *> How to perform the conversion to PIC(N)????
                 add 1 to ind
                 invoke type Console::Write("{0:X}-", i)
              else
                 exit perform
              end-if
           end-perform

        *> Show the chinese string converted
           invoke type System.Windows.Forms.MessageBox::Show(final-string)               
           display " "
           display "Press any key to exit."
           accept any-key
           goback.
           
       end program Program1.


I hope anybody could help me with this problem.

Thank you
  • I have found the (partial) solution:

           program-id. Program1 as "testcodepage.Program1".
           data division.
           working-storage section.
          * 01 greektext      type Byte[] value x"AFACAE9E".
           01 chinesetext    type Byte[] value x"E58F91E7A5A8".
           01 wsEncoding     type Encoding.
           01 codePageValues type Byte[].
           01 unicodeValues  string.
           01 b              type Byte.
           01 unicodeString  string.
           01 enumerator     type System.Globalization.TextElementEnumerator.
           01 s              string.
           01 i              binary-long.
           01 c              type Char.
           01 any-key        pic x.
           01 final-string   pic n(2).
           01 ind            pic 9.
           procedure division.
               invoke type System.IO.File::WriteAllBytes("chinese.txt", chinesetext)
               
            *> Specify the code page to correctly interpret byte values
          *     set wsEncoding to type Encoding::GetEncoding(737) *>(DOS) Greek code page
               set wsEncoding to type Encoding::UTF8
               set codePageValues to type System.IO.File::ReadAllBytes("chinese.txt")

            *> Same content is now encoded as UTF-16
               set unicodeValues to wsencoding::GetString(codePageValues)

            *> Show that the text content is still intact in Unicode string
            *> (Add a reference to System.Windows.Forms.dll)
               invoke type System.Windows.Forms.MessageBox::Show(unicodeValues)

            *> Same content "ψυχή" is stored as UTF-8
               invoke type System.IO.File::WriteAllText("chinese_unicode.txt", unicodeValues)

            *> Conversion is complete. Show the bytes to prove the conversion.
               display "8-bit encoding byte values:"
               perform varying b thru codePageValues
                  invoke type Console::Write("{0:X}-", b)
               end-perform
               display " "   
               display "Unicode values:"
           
               set unicodeString to type System.IO.File::ReadAllText("chinese_unicode.txt")
               set enumerator to type System.Globalization.StringInfo::GetTextElementEnumerator(unicodeString)
               move 1 to ind
               perform until exit
                  if enumerator::MoveNext()
                     set s to enumerator::GetTextElement()
                     set i to type Char::ConvertToUtf32(s, 0) 
                     *>set final-string(ind:1) to i as  *> How to perform the conversion to PIC(N)????
                     set c to i
                     set final-string(ind:1) to c
                     add 1 to ind
                     invoke type Console::Write("{0:X}-", i)
                  else
                     exit perform
                  end-if
               end-perform

            *> Show the chinese string converted
               invoke type System.Windows.Forms.MessageBox::Show(final-string)               
               display " "
               display "Press any key to exit."
               accept any-key
               goback.
               
           end program Program1.


    The problem now is about showing these characteres. For any reason, MessageBox and textBox do not display them.

    Any ideas?
  • I have found a unicode chinese font, and it seems to work almost well. The problem is that I must left an space between characters. In order to do so, I have changed these lines:
    set final-string(ind:1) to c
    add 1 to ind

    ... for the following:
    set final-string(ind:1) to c
    add 2 to ind

    If I do not include spaces between characters, the textbox only shows the last one...

    The final result is a line with too much space between characters. How do you suggest I could solve this problem?

    Regards
  • I think that I just found out what your original problem was.
    You are using PIC N characters but you are not setting the directive:

    $set nsymbol"national"

    The default for nsymbol is: $set nsymbol"dbcs" which is not unicode.

    If you add this directive then the messagebox displays the chinese characters correctly.
  • Verified Answer

    uuuffff... this will be difficult to explain!

    The solution is just move the utf8 string to the TextBox, and I didn't need to perform any conversion, neither use any directive.

    As simple as doing this:
                   move datos-869n(1:ind1) to final-string
                   set self::txtChinese::Text to final-string

    The point was that I thought I need to convert from utf-8 to unicode to display the data. (?_?)

    So, my problem is completely solved.

    Thanks anyway for your help.