Using Unicode

Unicode is a character encoding scheme that enables text display for most of the world's languages. Support for Unicode characters is built into PowerBuilder. This means that you can display characters from multiple languages on the same page of your application, create a flexible user interface suitable for deployment to different countries, and process data in multiple languages.

About Unicode

Before Unicode was developed, there were many different encoding systems, many of which conflicted with each other. For example, the same number could represent different characters in different encoding systems. Unicode provides a unique number for each character in all supported written languages. For languages that can be written in several scripts, Unicode provides a unique number for each character in each supported script.

For more information about the supported languages and scripts, see the Unicode website at http://www.unicode.org/cldr/charts/latest/supplemental/scripts_and_languages.html.

Encoding forms

There are three Unicode encoding forms: UTF-8, UTF-16, and UTF-32. Originally UTF stood for Unicode Transformation Format. The acronym is used now in the names of these encoding forms, which map from a character set definition to the actual code units that represent the data, and to the encoding schemes, which are encoding forms with a specific byte serialization.

  • UTF-8 uses an unsigned byte sequence of one to four bytes to represent each Unicode character.

  • UTF-16 uses one or two unsigned 16-bit code units, depending on the range of the scalar value of the character, to represent each Unicode character.

  • UTF-32 uses a single unsigned 32-bit code unit to represent each Unicode character.

Encoding schemes

An encoding scheme specifies how the bytes in an encoding form are serialized. When you manipulate files, convert blobs and strings, and save DataWindow data in PowerBuilder, you can choose to use ANSI encoding, or one of three Unicode encoding schemes:

  • UTF-8 serializes a UTF-8 code unit sequence in exactly the same order as the code unit sequence itself.

  • UTF-16BE serializes a UTF-16 code unit sequence as a byte sequence in big-endian format.

  • UTF-16LE serializes a UTF-16 code unit sequence as a byte sequence in little-endian format.

UTF-8 is frequently used in Web requests and responses. The big-endian format, where the most significant value in the byte sequence is stored at the lowest storage address, is typically used on UNIX systems. The little-endian format, where the least significant value in the sequence is stored first, is used on Windows.

Unicode support in PowerBuilder

PowerBuilder uses UTF-16LE encoding internally. The source code in PBLs is encoded in UTF-16LE, any text entered in an application is automatically converted to Unicode, and the string and character PowerScript datatypes hold Unicode data only. Any ANSI or DBCS characters assigned to these datatypes are converted internally to Unicode encoding.

Support for Unicode databases

Most PowerBuilder database interfaces support both ANSI and Unicode databases.

A Unicode database is a database whose character set is set to a Unicode format, such as UTF-8 or UTF-16. All data in the database is in Unicode format, and any data saved to the database must be converted to Unicode data implicitly or explicitly.

A database that uses ANSI (or DBCS) as its character set can use special datatypes to store Unicode data. These datatypes are NChar, NVarChar, and NVarChar2. Columns with one of these datatypes can store Unicode data, but data saved to such a column must be converted to Unicode explicitly.

For more specific information about each interface, see Connecting to Your Database.

String functions

PowerBuilder string functions, such as Fill, Len, Mid, and Pos, take characters instead of bytes as parameters or return values and return the same results in all environments. These functions have a "wide" version (such as FillW) that is obsolete and will be removed in a future version of PowerBuilder because it produces the same results as the standard version of the function. Some of these functions also have an ANSI version (such as FillA). This version is provided for backwards compatibility for users in DBCS environments who used the standard version of the string function in previous versions of PowerBuilder to return bytes instead of characters.

You can use the GetEnvironment function to determine the character set used in the environment:

environment env
getenvironment(env)

choose case env.charset
case charsetdbcs!
   // DBCS processing
   ...
case charsetunicode!
   // Unicode processing
   ...
case charsetansi!
   // ANSI processing
   ...
case else
   // Other processing
   ...
end choose

Encoding enumeration

Several functions, including Blob, BlobEdit, FileEncoding, FileOpen, SaveAs, and String, have an optional encoding parameter. These functions let you work with blobs and files with ANSI, UTF-8, UTF-16LE, and UTF-16BE encoding. If you do not specify this parameter, the default encoding used for SaveAs and FileOpen is ANSI. For other functions, the default is UTF-16LE.

The following examples illustrate how to open different kinds of files using FileOpen:

// Read an ANSI File
Integer li_FileNum
String s_rec
li_FileNum = FileOpen("Employee.txt")
// or:
// li_FileNum = FileOpen("Emplyee.txt", &
//    LineMode!, Read!)
FileRead(li_FileNum, s_rec)

// Read a Unicode File
Integer li_FileNum
String s_rec
li_FileNum = FileOpen("EmployeeU.txt", LineMode!, &
   Read!, EncodingUTF16LE!)
FileRead(li_FileNum, s_rec)

// Read a Binary File
Integer li_FileNum
blob bal_rec
li_FileNum = FileOpen("Employee.imp", Stream Mode!, &
   Read!)
FileRead(li_FileNum, bal_rec)

Initialization files

The SetProfileString function can write to initialization files with ANSI or UTF16-LE encoding on Windows systems, and ANSI or UTF16-BE encoding on UNIX systems. The ProfileInt and ProfileString PowerScript functions and DataWindow expression functions can read files with these encoding schemes.

Exporting and importing source

The Export Library Entry dialog box lets you select the type of encoding for an exported file. The choices are ANSI/DBCS, which lets you import the file into PowerBuilder 9 or earlier, HEXASCII, UTF8, or Unicode LE.

The HEXASCII export format is used for source-controlled files. Unicode strings are represented by hexadecimal/ASCII strings in the exported file, which has the letters HA at the beginning of the header to identify it as a file that might contain such strings. You cannot import HEXASCII files into PowerBuilder 9 or earlier.

If you import an exported file from PowerBuilder 9 or earlier, the source code in the file is converted to Unicode before the object is added to the PBL.

External functions

When you call an external function that returns an ANSI string or has an ANSI string argument, you must use an ALIAS clause in the external function declaration and add ;ansi to the function name. For example:

FUNCTION int MessageBox(int handle, string content, string title, int showtype)
LIBRARY "user32.dll" ALIAS FOR "MessageBoxA;ansi"

The following declaration is for the "wide" version of the function, which uses Unicode strings:

FUNCTION int MessageBox(int handle, string content, string title, int showtype)
LIBRARY "user32.dll" ALIAS FOR "MessageBoxW"

If you are upgrading an application from PowerBuilder 9 or earlier, PowerBuilder replaces function declarations that use ANSI strings with the correct syntax automatically.

Setting fonts for multiple language support

The default font in the System Options and Design Options dialog boxes is Tahoma.

Setting the font in the System Options dialog box to Tahoma ensures that multiple languages display correctly in the Layout and Properties views in the Window, User Object, and Menu painters and in the wizards.

If the font on the Editor Font page in the Design Options dialog box is not set to Tahoma, multiple languages cannot be displayed in Script editors, the File and Source editors, the ISQL view in the DataBase painter, and the Debug window.

You can select a different font for printing on the Printer Font tab page of the Design Options dialog box for Script editors, the File and Source editors, and the ISQL view in the DataBase painter. If the printer font is set to Tahoma and the Tahoma font is not installed on the printer, PowerBuilder downloads the entire font set to the printer when it encounters a multilanguage character. If you need to print multilanguage characters, specify a printer font that is installed on your printer.

To support multiple languages in DataWindow objects, set the font in every column and text control to Tahoma.

The default font for print functions is the system font. Use the PrintDefineFont and PrintSetFont functions to specify a font that is available on users' printers and supports multiple languages.

PBNI

The PowerBuilder Native Interface is Unicode based. PBNI extensions must be compiled using the _UNICODE preprocessor directive in your C++ development environment.

Your extension's code must use TCHAR, LPTSTR, or LPCTSTR instead of char, char*, and const char* to ensure that it works correctly in a Unicode environment. Alternatively, you can use the MultiByteToWideChar function to map character strings to Unicode strings. For more information about enabling Unicode in your application, see the documentation for your C++ development environment.

Unicode enabling for Web services

In a PowerScript target, the PBNI extension classes instantiated by Web service client applications use Unicode for all internal processing. However, calls to component methods are converted to ANSI for processing by EasySoap (discontinued), and data returned from these calls is converted to Unicode.

XML string encoding

The XML parser cannot parse a string that uses an eight-bit character code such as windows-1253. For example, a string with the following declaration cannot be parsed:

string ls_xml
ls_xml += '<?xml version="1.0" encoding="windows-1253"?>'

You must use a Unicode encoding value such as UTF16-LE.