Unicode, Code Points, and Graphemes
Text handling gets easier once you stop treating every char as a whole character. Fonts exposes the text-unit levels it actually uses during layout so you can reason about indexing, fallback, shaping, and glyph coverage with the same vocabulary as the library.
The text-unit levels
char: a single UTF-16 code unit in a .NETstringCodePoint: a Unicode scalar value- grapheme: a user-perceived text element, represented by a
ReadOnlySpan<char>returned fromSpanGraphemeEnumerator
In everyday text, those levels often line up for simple ASCII. Once you move beyond that, they diverge quickly.
What is a char?
In .NET, string is UTF-16. That means a single char is just one UTF-16 code unit.
A char is not always a full Unicode character:
- BMP scalars such as
Afit in onechar - supplementary-plane scalars such as many emoji use two
charvalues as a surrogate pair - combining sequences can use multiple
charvalues to represent what a user sees as one text element
So if you index raw char positions, you are working at the storage level, not the text-semantics level.
What is a CodePoint?
In strict Unicode terminology:
- a code point is any value in the range
U+0000throughU+10FFFF - a Unicode scalar value is any code point except the surrogate range
U+D800throughU+DFFF
CodePoint represents Unicode scalar values. Despite the type name, it intentionally excludes standalone surrogate code points because those are UTF-16 encoding artifacts, not meaningful text values to shape or render.
That makes it the right unit when you want to talk about valid Unicode text values directly.
Useful CodePoint members include:
ValueUtf16SequenceLengthUtf8SequenceLengthIsAsciiIsBmpPlaneReplacementChar
This is also the unit used by glyph-probing APIs such as Font.TryGetGlyphs(...).
What is a grapheme?
A grapheme cluster is the closest thing to a user-perceived text element.
Examples:
Ais one graphemeAfollowed by a combining acute accent is still one grapheme- many emoji sequences joined with zero-width joiners are one grapheme
- a flag emoji made from two regional indicators is one grapheme
Fonts exposes grapheme enumeration through SpanGraphemeEnumerator, which implements the Unicode grapheme cluster algorithm from UAX #29.
This is why TextRun.Start and TextRun.End are grapheme indices rather than raw char indices.
Enumerate CodePoint values
The Unicode enumeration helpers live in SixLabors.Fonts.Unicode.
using System;
using SixLabors.Fonts.Unicode;
// 'A' + combining acute accent (U+0301) renders as a single accented A grapheme,
// followed by a space and the grinning-face emoji (U+1F600).
string text = "Á 😀";
foreach (CodePoint codePoint in text.AsSpan().EnumerateCodePoints())
{
Console.WriteLine(
$"U+{codePoint.Value:X}: UTF-16 length {codePoint.Utf16SequenceLength}");
}
EnumerateCodePoints() returns a SpanCodePointEnumerator. It yields CodePoint values, which means the enumeration surface is Unicode scalar values. Invalid UTF-16 sequences are surfaced as CodePoint.ReplacementChar.
Count helpers are also available:
using SixLabors.Fonts.Unicode;
// 'A' + combining acute (U+0301), space, grinning-face emoji (U+1F600).
// 4 code points: 'A', U+0301, ' ', U+1F600.
int count = "Á 😀".GetCodePointCount();
Enumerate graphemes
Use grapheme enumeration when you need units that better match what a reader sees.
using System;
using SixLabors.Fonts.Unicode;
// Same text as before, but graphemes group the accented A into one cluster.
string text = "Á 😀";
int index = 0;
foreach (ReadOnlySpan<char> grapheme in text.AsSpan().EnumerateGraphemes())
{
Console.WriteLine($"{index++}: {grapheme.ToString()}");
}
EnumerateGraphemes() returns a SpanGraphemeEnumerator.
Count helpers are available here too:
using SixLabors.Fonts.Unicode;
// 3 graphemes: the accented A, the space, and the emoji.
int count = "Á 😀".GetGraphemeCount();
Enumerate word-boundary segments
Use word enumeration when the surface needs to reason about whole words — caret movement that jumps a word at a time, double-click word selection, search-as-you-type tokenization. Word segmentation follows the Unicode Word Boundary Algorithm in UAX #29.
using System;
using SixLabors.Fonts.Unicode;
string text = "Don't stop.";
foreach (WordSegment word in text.AsSpan().EnumerateWordSegments())
{
Console.WriteLine(
$"[{word.Utf16Offset}..{word.Utf16Offset + word.Utf16Length}] '{word.Span.ToString()}'");
}
The output for the example above is:
[0..5] 'Don't'
[5..6] ' '
[6..10] 'stop'
[10..11] '.'
UAX #29 segments include separators — the space between Don't and stop is its own segment, and the trailing . is another. Higher-level editor commands can decide whether to stop on those segments or skip past them; the raw enumerator stays aligned with the standard.
EnumerateWordSegments() returns a SpanWordEnumerator. Each WordSegment exposes:
Span— the UTF-16 slice of the segment.Utf16OffsetandUtf16Length— UTF-16 indices into the original text.CodePointOffsetandCodePointCount— code-point indices into the original text.
This is the same Unicode word-boundary model used by TextMetrics.WordMetrics, MoveCaret(CaretMovement.NextWord), and GetWordMetrics(hit). Use the enumerator when you need word boundaries against raw text without going through a full layout pass; use the metrics APIs when you need positioned word geometry as well. See Hit Testing and Caret Movement for the layout-aware side.
Which unit should you use?
Use char when:
- you are working with raw .NET string storage
- you truly need UTF-16 code-unit offsets
Use CodePoint when:
- you are inspecting Unicode scalar values
- you are probing glyph availability with
TryGetGlyphs(...) - you care about Unicode values, planes, or encoded sequence lengths
Use graphemes when:
- you are slicing visible text ranges
- you are working with
TextRun.StartandTextRun.End - you want indices that align better with user-visible text elements
Relation to layout
Fonts uses additional Unicode logic internally during layout, including line-breaking and script/shaping data. But the public text-unit APIs you will use most often are:
EnumerateCodePoints()EnumerateGraphemes()GetCodePointCount()GetGraphemeCount()CodePoint
If you are debugging a TextRun range, a missing glyph, or a mismatch between visible text and string indices, start by checking whether you are reasoning in char, CodePoint, or grapheme units.
Practical guidance
Use grapheme indexes for user-visible ranges: styling, selection, caret movement, placeholder insertion, and rich text runs. That is the unit closest to what a person thinks of as one visible text element, even when it is made from multiple code points.
Use code points when the question is about Unicode scalar values: probing glyph availability, inspecting script coverage, or understanding encoded sequence length. Use UTF-16 indexes only when interoperating with raw .NET string storage or APIs that explicitly require char offsets.
Never assume visible characters, code points, and UTF-16 code units have the same count. That assumption is the root cause of most off-by-one text range bugs in emoji, combining marks, and complex scripts.