Java Unicode System

The Unicode system in Java is designed to handle characters and symbols from virtually any language or script in the world. It was introduced to overcome the limitations of older character encoding systems like ASCII, which could represent only a limited set of characters (mainly English). With the rise of global applications, Java needed a way to represent and process text in many different languages. That's where Unicode comes in.

Key Points about Unicode in Java:

Universal Character Set: Unicode is a universal character set that includes characters from almost all writing systems in the world (e.g., English, Arabic, Chinese, and Cyrillic scripts), as well as special characters like emojis and mathematical symbols.
16-bit Encoding in Java: Java uses 16-bit encoding for characters, meaning each character is represented by a 16-bit char type, which supports Unicode characters. This allows Java to represent over 65,000 characters (from '\u0000' to '\uFFFF'), providing support for a wide range of symbols and characters.
Compatibility with ASCII: The first 128 characters of Unicode correspond to the ASCII character set, making Unicode backward-compatible with older systems.
UTF-16 Encoding: Internally, Java uses UTF-16 (Unicode Transformation Format) to represent characters. Each character is represented as one or two 16-bit code units. Some characters, like common letters or digits, require only one unit, while others, like certain emojis or rare symbols, require two.

Unicode Representation in Java

In Java, Unicode characters can be represented using escape sequences that start with \u, followed by a four-digit hexadecimal code.

Example of Unicode Representation:

char letter = '\u0041';  // Unicode for 'A'
System.out.println(letter); // Output: A

More Examples:
- The letter 'A' is represented by \u0041.
- The letter 'a' is represented by \u0061.
- The symbol '₹' (Indian Rupee) is represented by \u20B9.
- The symbol '€' (Euro) is represented by \u20AC.

Example Program Using Unicode

public class UnicodeExample {
    public static void main(String[] args) {
        // Using Unicode to represent characters
        char ch1 = '\u0041'; // Unicode for 'A'
        char ch2 = '\u0905'; // Unicode for 'अ' (Hindi letter 'A')
        char ch3 = '\u20B9'; // Unicode for '₹' (Indian Rupee Symbol)

        System.out.println("Character 1: " + ch1); // Output: A
        System.out.println("Character 2: " + ch2); // Output: अ
        System.out.println("Character 3: " + ch3); // Output: ₹
    }
}

Output:

Character 1: A
Character 2: अ
Character 3: ₹

Unicode Escape Sequences in Strings

You can also use Unicode escape sequences in strings, which allows you to embed any Unicode character directly in a string.

public class UnicodeStringExample {
    public static void main(String[] args) {
        String str = "\u0048\u0065\u006C\u006C\u006F";  // Unicode for "Hello"
        System.out.println(str);  // Output: Hello
    }
}

Output:

Hello

Unicode Range and Supplementary Characters

The Unicode standard can represent over a million characters, from U+0000 to U+10FFFF. However, Java’s char data type, which uses 16 bits, can only represent characters in the Basic Multilingual Plane (BMP), which covers characters from U+0000 to U+FFFF.
Supplementary Characters: Characters outside the BMP (i.e., from U+10000 to U+10FFFF) are called supplementary characters. Java represents these using a pair of char values called surrogate pairs.

For example, some complex characters like certain Chinese ideographs or emojis may require surrogate pairs to be represented.

Why Java Uses Unicode

Globalization Support: Since Java is a language designed to be platform-independent and used globally, it needed to support a wide variety of languages, scripts, and symbols. Unicode allows Java to handle text in almost any language.
Internationalization and Localization: Java applications can be developed in one language and later translated into multiple languages without changing the internal character handling, thanks to Unicode.
Cross-Platform Consistency: Since Java runs on many different platforms, using Unicode ensures consistent character representation across all systems.

Summary

Unicode is a universal character encoding standard that allows Java to represent characters from almost all languages.
Java uses 16-bit Unicode characters (UTF-16), allowing it to represent over 65,000 characters.
Unicode escape sequences (\uXXXX) can be used to represent specific characters.
Java supports supplementary characters beyond the BMP using surrogate pairs.

Unicode in Java plays a crucial role in enabling the development of globalized applications that can handle diverse languages and symbols uniformly across different platforms.

Next Previous

Getting Started

Java History

Java Overview

Java features

Java JDK, JRE and JVM

Java Setting up development environment