Uucp.Break
Break properties.
These properties are mainly for the Unicode text segmentation and line breaking algorithm.
References.
type line = [
| `AI
| `AK
| `AL
| `AP
| `AS
| `B2
| `BA
| `BB
| `BK
| `CB
| `CJ
| `CL
| `CM
| `CP
| `CR
| `EX
| `EB
| `EM
| `GL
| `H2
| `H3
| `HL
| `HY
| `ID
| `IN
| `IS
| `JL
| `JT
| `JV
| `LF
| `NL
| `NS
| `NU
| `OP
| `PO
| `PR
| `QU
| `RI
| `SA
| `SG
| `SP
| `SY
| `VF
| `VI
| `WJ
| `XX
| `ZW
| `ZWJ
]
The type for line breaks.
val pp_line : Format.formatter -> line -> unit
pp_line ppf l
prints an unspecified representation of l
on ppf
.
line u
is u
's line break property.
val pp_grapheme_cluster : Format.formatter -> grapheme_cluster -> unit
pp_grapheme_cluster ppf g
prints an unspecified representation of g
on ppf
.
val grapheme_cluster : Uchar.t -> grapheme_cluster
grapheme_cluster u
is u
's grapheme cluster break property.
type word = [
| `CR
| `DQ
| `EX
| `EB
| `EBG
| `EM
| `Extend
| `FO
| `GAZ
| `HL
| `KA
| `LE
| `LF
| `MB
| `ML
| `MN
| `NL
| `NU
| `RI
| `SQ
| `WSegSpace
| `XX
| `ZWJ
]
The type for word breaks.
val pp_word : Format.formatter -> word -> unit
pp_word ppf b
prints an unspecified representation of b
on ppf
.
word u
is u
's word break property.
The type for sentence breaks.
val pp_sentence : Format.formatter -> sentence -> unit
pp_sentence ppf b
prints an unspecified representation of b
on ppf
.
sentence u
is u
's sentence break property.
The type for Indic Conjunct Break.
val pp_indic_conjunct_break : Format.formatter -> indic_conjunct_break -> unit
pp_indic_conjunct_break ppf b
prints an unspecified representation of b
on ppf
.
val indic_conjunct_break : Uchar.t -> indic_conjunct_break
indic_conjunct_break u
is u
's Indic conjunct break property.
val pp_east_asian_width : Format.formatter -> east_asian_width -> unit
pp_east_asian_width ppf w
prints an unspecified representation of w
on ppf
.
val east_asian_width : Uchar.t -> east_asian_width
east_asian_width u
is u
's East Asian width property.
val tty_width_hint : Uchar.t -> int
tty_width_hint u
approximates u
's column width as rendered by a typical character terminal.
The current implementation of the function returns either 0
, 1
, 2
or -1
. The value -1
is only returned for scalar values for which the property is non-sensical; clients are expected to sanitize their inputs and not to use the function with these scalar values which are those in range U+0001-U+001F (C0 controls without U+0000) and U+007F-U+009F (DELETE and C1 controls).
Note. Converting a string to normalization form C before folding this function over its scalar values will, in general, yield better approximations (e.g. on Hangul).
Warning. This is not a normative property and only a heuristic. If you find yourself using this function please read carefully the following lines.
This function is the moral equivalent of POSIX wcwidth
, in that its purpose is to help align text displayed by a character terminal. It mimics wcwidth
, as widely implemented, in yet another way: it is mostly wrong.
Computing column width is a surprisingly difficult task in general. Much of the software infrastructure still carries legacy assumptions about the nature of text harking back to the ASCII era. Different terminal emulators attempt to cope with general Unicode text in different ways, creating a fundamental problem: width of text fragments will vary across terminal emulators, with no way of getting feedback from the output layer back into the text-producing layer.
For example: on a modern Linux system, a collection of terminals will disagree on some or all of U+00AD, U+0CBF, and U+2029. They will likewise disagree about unassigned characters (category Cn), sometimes contradicting the system's wcwidth
(e.g. U+0378, U+0530). Terminals using bare libxft will display complex scripts differently from terminals using HarfBuzz, and the rendering on OS X will be slightly different from both.
tty_width_hint
uses a simple and predictable width algorithm, based on Markus Kuhn's portable wcwidth
:
-1
).2
.0
.1
, including Cn.This approach works well, in that it gives results generally consistent with a wide range of terminals, for alphabetic scripts, and for east Asian syllabic and logographic scripts in non-decomposed form. Support varies for abjad scripts in the presence of vowel marks, and it mostly breaks down on abugidas.
Moreover, non-text symbols like Emoji or Yijing hexagrams will be incorrectly classified as 1
-wide, but this in fact agrees with their rendering on many terminals.
Clients should not over-rely on tty_width_hint
. It provides a best-effort approximation which will sometimes fail in practice.
module Low : sig ... end
Low level interface.