|
Unicode has been part of the mix since SQL Server 7.0 .It was introduced
primarily to deal with storing data using a mix of international character
standards.
As a consequence the same bit pattern is always converted to the same
character on all computers.Recently I took over the management of a SQL
Server db and one of the checks required analysing the data for damages and
improper storage of data.
One of the interesting things I discovered was
that mixing different Unicode multiple encoding methods can cause problems.
SQL Server uses UCS-2 or UTF-16 Unicode format, which reflects the 32-bit
Windows kernel method of storing inforrmation. The main purpose is so that
data doesn't have to converted back and forth which leads to enhanced
performance.In contrast , web based applications use the UTF-8 format to
store Unicode data. One of the UTF-8 features is that stores all 7-bit ASCII
characters as seen , then uses control characters to store the remainining
Unicode set.
This gets interesting when UTF-16 data is retrieved and misinterpreted as
UTF-8. The consequnce is usually mangled data. If this data is reentered ,
the most probable outcome is ruined data
Instructions to avoid this problem:
1.Pages that contain data retrieved from the server should set the codepage
to 65001 . This will automatically convert any UCS-2/UTF-16 data to UTF-8
format when the page is rendered and sent to the client.
2. The web server automatically converts UTF-8 data to UCS-2 (into SQL
Server) , if you set the codepage to 65001 any data sent from a Web page
It's worth remembering that a database won't explicitly support Unicode data
unless the field types in question are also explicitly supported. If you use
text instead of ntext, for instance, any Unicode entities sent to the
database will instead be translated into ISO-8859-1 encoding, which should
not be intermixed with true Unicode entities to avoid data damage.
Hopefully this helps in either troubleshooting or planning a web application
interacting with a SQL Server database.
|