unicode/unicode.ads

------------------------------------------------------------------------------
--                     XML/Ada - An XML suite for Ada95                     --
--                                                                          --
--                     Copyright (C) 2001-2017, AdaCore                     --
--                                                                          --
-- This library is free software;  you can redistribute it and/or modify it --
-- under terms of the  GNU General Public License  as published by the Free --
-- Software  Foundation;  either version 3,  or (at your  option) any later --
-- version. This library is distributed in the hope that it will be useful, --
-- but WITHOUT ANY WARRANTY;  without even the implied warranty of MERCHAN- --
-- TABILITY or FITNESS FOR A PARTICULAR PURPOSE.                            --
--                                                                          --
-- As a special exception under Section 7 of GPL version 3, you are granted --
-- additional permissions described in the GCC Runtime Library Exception,   --
-- version 3.1, as published by the Free Software Foundation.               --
--                                                                          --
-- You should have received a copy of the GNU General Public License and    --
-- a copy of the GCC Runtime Library Exception along with this program;     --
-- see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see    --
-- <http://www.gnu.org/licenses/>.                                          --
--                                                                          --
------------------------------------------------------------------------------

--  This package provides support for wide-characters in Unicode/Iso 10646
--  encoding.
--  A series of child packages are given to convert from any encoding to
--  Unicode.
--  It also supports several transformation format (ie serialization of
--  these characters to files), like UTF8, UTF16,...

--  Vocabulary used in this package: This is only a small extract of
--  documents found at http://www.unicode.org/unicode/reports/tr17
--
--  Repertoire
--  ==========
--  Set of abstract characters to be encoded, normally a familiar alphabet or
--  symbol set.
--  Unicode is one such repertoire, although an open one. New entries are
--  added to it, but none is ever deleted from it.
--  Internally, this package converts all characters to entries in the Unicode
--  repertoire
--
--  Glyphs
--  ======
--  A particular image which represents a character or part of a character. For
--  instance, a given character might have a slightly different aspect in
--  different fonts.
--  Note that a single glyph can correspond to a sequence of characters, or a
--  single character to a sequence of glyphs.
--  This package doesn't deal at all with glyphs, this is left to the end-user
--  application
--
--  Subsets
--  =======
--  Unicode is intended to be a universal repertoire, with all possible
--  characters. Most applications will only support a subset of it, given the
--  complexity of some scripts.
--  The Unicode standad includes a set of internal catalogs, called
--  collections. Several child packages exist to support these collections.
--
--  Coded character sets  (packages Unicode.CCS.*)
--  ====================
--  Mapping from a set of abstract characters to the set of non-negative
--  integers
--  The integer associated with a character is called "code point", and the
--  character is called "encoded character"
--  Examples of these are:  ISO/8859-1, JIS X 0208, ...
--
--  Character naming (packages Unicode.Names.*)
--  ================
--  A unique name is assigned to each abstract character, so that it is
--  possible to get the same character no matter what repertoire is used.
--
--  Character Encoding Forms
--  ========================
--  Mapping from the set of integers used in a Coded Character Set to the set
--  of sequences of code units.
--  A "code unit" is integer occupying a specified binary width in a computer
--  architecture
--  Examples of fixed-width encoding forms:  7-bit, 8-bit, EBCDIC
--  Examples of variable-width encoding forms:  Utf-8, Utf-16,...
--
--  Character Encoding Scheme (packages Unicode.CES.*)
--  =========================
--  Mapping of code units into serialized byte sequences. It also takes into
--  account the byte-order serialization.

--  As a summary, converting a file containing latin-1 characters coded on
--  8 bits to a Utf8 latin2 file, the following steps are involved:
--
--     Latin1 string  (contains bytes associated with code points in Latin1)
--       |    "use Unicode.CES.Basic_8bit.To_Utf32"
--       v
--     Utf32 latin1 string (contains code points in Latin1)
--       |    "Convert argument to To_Utf32 should be
--       v         Unicode.CCS.Iso_8859_1.Convert"
--     Utf32 Unicode string (contains code points in Unicode)
--       |    "use Unicode.CES.Utf8.From_Utf32"
--       v
--     Utf8 Unicode string (contains code points in Unicode)
--       |    "Convert argument to From_Utf32 should be
--       v         Unicode.CCS.Iso_8859_2.Convert"
--     Utf8 Latin2 string (contains code points in Latin2)

--  In the package below, all the functions Is_* are based on values defined
--  in the XML standard.
--  Several child packages are provided, that support different encoding
--  forms, and can all convert from and to Utf32, which thus behaves as the
--  reference.

package Unicode is
   pragma Preelaborate (Unicode);

   type Unicode_Char is mod 2**32;
   --  A code point associated with a given character, taken in the Unicode
   --  repertoire.
   --  Note that by design, the first 127 entries are taken in the ASCII set
   --  and are fully compatible. You can thus easily compare this with
   --  constant characters by using Character'Pos ('.')

   function Is_White_Space (Char : Unicode_Char) return Boolean;
   --  Return True if Char is a space character, ie a space, horizontal tab,
   --  line feed or carriage return.

   function Is_Letter (Char : Unicode_Char) return Boolean;
   --  True if Char is a letter.

   function Is_Base_Char (Char : Unicode_Char) return Boolean;
   --  True if Char is a base character.

   function Is_Digit (Char : Unicode_Char) return Boolean;
   --  True if Char is a digit.

   function Is_Combining_Char (Char : Unicode_Char) return Boolean;
   --  True if Char is a combining character (ie a character that
   --  applies to the preceding character to change its meaning, like
   --  accents in latin-1).

   function Is_Extender (Char : Unicode_Char) return Boolean;
   --  True if Char is an extender character.

   function Is_Ideographic (Char : Unicode_Char) return Boolean;
   --  True if Char is an ideographic character (asian languages).

   function To_Unicode (C : Character) return Unicode_Char;
   --  Convert from Ada Character encoding (extended ASCII) to Unicode
   --  character.

   function To_Lower (C : Unicode_Char) return Unicode_Char;
   --  Converts C to a lower case

private
   pragma Inline (Is_Ideographic);
   pragma Inline (Is_Letter);
   pragma Inline (Is_White_Space);
   pragma Inline (To_Unicode);
end Unicode;
Change header to use GPL version 3 KC20-003 git-svn-id: svn+ssh://svn.eu.adacore.com/Dev/trunk/xmlada@182682 936e1b1b-40f2-da11-902a-00137254ae57 2011-12-20 09:09:12 +00:00			`------------------------------------------------------------------------------`
			`-- XML/Ada - An XML suite for Ada95 --`
			`-- --`
Update copyright notices Change-Id: Iae1283292b460ead48dedc565d6f24259e497e69 2017-01-03 10:02:38 +01:00			`-- Copyright (C) 2001-2017, AdaCore --`
Change header to use GPL version 3 KC20-003 git-svn-id: svn+ssh://svn.eu.adacore.com/Dev/trunk/xmlada@182682 936e1b1b-40f2-da11-902a-00137254ae57 2011-12-20 09:09:12 +00:00			`-- --`
			`-- This library is free software; you can redistribute it and/or modify it --`
			`-- under terms of the GNU General Public License as published by the Free --`
			`-- Software Foundation; either version 3, or (at your option) any later --`
			`-- version. This library is distributed in the hope that it will be useful, --`
			`-- but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHAN- --`
			`-- TABILITY or FITNESS FOR A PARTICULAR PURPOSE. --`
			`-- --`
			`-- As a special exception under Section 7 of GPL version 3, you are granted --`
			`-- additional permissions described in the GCC Runtime Library Exception, --`
			`-- version 3.1, as published by the Free Software Foundation. --`
			`-- --`
			`-- You should have received a copy of the GNU General Public License and --`
			`-- a copy of the GCC Runtime Library Exception along with this program; --`
			`-- see the files COPYING3 and COPYING.RUNTIME respectively. If not, see --`
			`-- <http://www.gnu.org/licenses/>. --`
			`-- --`
			`------------------------------------------------------------------------------`
Initial revision git-svn-id: svn+ssh://svn.eu.adacore.com/Dev/importfromcvs/trunk@11480 936e1b1b-40f2-da11-902a-00137254ae57 2001-10-30 14:48:07 +00:00
			`-- This package provides support for wide-characters in Unicode/Iso 10646`
			`-- encoding.`
			`-- A series of child packages are given to convert from any encoding to`
			`-- Unicode.`
			`-- It also supports several transformation format (ie serialization of`
			`-- these characters to files), like UTF8, UTF16,...`

			`-- Vocabulary used in this package: This is only a small extract of`
			`-- documents found at http://www.unicode.org/unicode/reports/tr17`
			`--`
			`-- Repertoire`
			`-- ==========`
			`-- Set of abstract characters to be encoded, normally a familiar alphabet or`
			`-- symbol set.`
			`-- Unicode is one such repertoire, although an open one. New entries are`
			`-- added to it, but none is ever deleted from it.`
			`-- Internally, this package converts all characters to entries in the Unicode`
			`-- repertoire`
			`--`
			`-- Glyphs`
			`-- ======`
			`-- A particular image which represents a character or part of a character. For`
			`-- instance, a given character might have a slightly different aspect in`
			`-- different fonts.`
			`-- Note that a single glyph can correspond to a sequence of characters, or a`
			`-- single character to a sequence of glyphs.`
			`-- This package doesn't deal at all with glyphs, this is left to the end-user`
			`-- application`
			`--`
			`-- Subsets`
			`-- =======`
			`-- Unicode is intended to be a universal repertoire, with all possible`
			`-- characters. Most applications will only support a subset of it, given the`
			`-- complexity of some scripts.`
			`-- The Unicode standad includes a set of internal catalogs, called`
			`-- collections. Several child packages exist to support these collections.`
			`--`
			`-- Coded character sets (packages Unicode.CCS.*)`
			`-- ====================`
			`-- Mapping from a set of abstract characters to the set of non-negative`
			`-- integers`
			`-- The integer associated with a character is called "code point", and the`
			`-- character is called "encoded character"`
			`-- Examples of these are: ISO/8859-1, JIS X 0208, ...`
			`--`
			`-- Character naming (packages Unicode.Names.*)`
			`-- ================`
			`-- A unique name is assigned to each abstract character, so that it is`
			`-- possible to get the same character no matter what repertoire is used.`
			`--`
			`-- Character Encoding Forms`
			`-- ========================`
			`-- Mapping from the set of integers used in a Coded Character Set to the set`
			`-- of sequences of code units.`
			`-- A "code unit" is integer occupying a specified binary width in a computer`
			`-- architecture`
			`-- Examples of fixed-width encoding forms: 7-bit, 8-bit, EBCDIC`
			`-- Examples of variable-width encoding forms: Utf-8, Utf-16,...`
			`--`
			`-- Character Encoding Scheme (packages Unicode.CES.*)`
			`-- =========================`
			`-- Mapping of code units into serialized byte sequences. It also takes into`
			`-- account the byte-order serialization.`

			`-- As a summary, converting a file containing latin-1 characters coded on`
			`-- 8 bits to a Utf8 latin2 file, the following steps are involved:`
			`--`
			`-- Latin1 string (contains bytes associated with code points in Latin1)`
			`-- \| "use Unicode.CES.Basic_8bit.To_Utf32"`
			`-- v`
			`-- Utf32 latin1 string (contains code points in Latin1)`
			`-- \| "Convert argument to To_Utf32 should be`
			`-- v Unicode.CCS.Iso_8859_1.Convert"`
			`-- Utf32 Unicode string (contains code points in Unicode)`
			`-- \| "use Unicode.CES.Utf8.From_Utf32"`
			`-- v`
			`-- Utf8 Unicode string (contains code points in Unicode)`
			`-- \| "Convert argument to From_Utf32 should be`
			`-- v Unicode.CCS.Iso_8859_2.Convert"`
			`-- Utf8 Latin2 string (contains code points in Latin2)`

			`-- In the package below, all the functions Is_* are based on values defined`
			`-- in the XML standard.`
			`-- Several child packages are provided, that support different encoding`
			`-- forms, and can all convert from and to Utf32, which thus behaves as the`
			`-- reference.`

			`package Unicode is`
Merge the xmlada-xml_1_1 branch with HEAD. We still do not fully support XML 1.1, but the code has been cleaned up somewhat to match more recent versions of the XML testsuite, improve the support for schemas,... and is thus in good shape for distribution at this stage git-svn-id: svn+ssh://svn.eu.adacore.com/Dev/importfromcvs/trunk@12314 936e1b1b-40f2-da11-902a-00137254ae57 2006-04-27 12:52:07 +00:00			`pragma Preelaborate (Unicode);`
Initial revision git-svn-id: svn+ssh://svn.eu.adacore.com/Dev/importfromcvs/trunk@11480 936e1b1b-40f2-da11-902a-00137254ae57 2001-10-30 14:48:07 +00:00
			`type Unicode_Char is mod 2**32;`
			`-- A code point associated with a given character, taken in the Unicode`
			`-- repertoire.`
			`-- Note that by design, the first 127 entries are taken in the ASCII set`
			`-- and are fully compatible. You can thus easily compare this with`
			`-- constant characters by using Character'Pos ('.')`

			`function Is_White_Space (Char : Unicode_Char) return Boolean;`
			`-- Return True if Char is a space character, ie a space, horizontal tab,`
			`-- line feed or carriage return.`

			`function Is_Letter (Char : Unicode_Char) return Boolean;`
			`-- True if Char is a letter.`

			`function Is_Base_Char (Char : Unicode_Char) return Boolean;`
			`-- True if Char is a base character.`

			`function Is_Digit (Char : Unicode_Char) return Boolean;`
			`-- True if Char is a digit.`

			`function Is_Combining_Char (Char : Unicode_Char) return Boolean;`
			`-- True if Char is a combining character (ie a character that`
			`-- applies to the preceding character to change its meaning, like`
			`-- accents in latin-1).`

			`function Is_Extender (Char : Unicode_Char) return Boolean;`
			`-- True if Char is an extender character.`

			`function Is_Ideographic (Char : Unicode_Char) return Boolean;`
			`-- True if Char is an ideographic character (asian languages).`

			`function To_Unicode (C : Character) return Unicode_Char;`
			`-- Convert from Ada Character encoding (extended ASCII) to Unicode`
			`-- character.`

(Utf8_Next_Char, Utf8_Length, Utf8_Prev_Char, Utf8_Get_Char, Utf8_Find_Next_Char): new subprograms (Is_Valid_Name_Char, Is_Valid_Name_Startchar): minor optimization by using lookup tables. MA08-027 and MA04-050 git-svn-id: svn+ssh://svn.eu.adacore.com/Dev/trunk/xmlada@216794 936e1b1b-40f2-da11-902a-00137254ae57 2013-10-08 13:49:30 +00:00			`function To_Lower (C : Unicode_Char) return Unicode_Char;`
			`-- Converts C to a lower case`

Initial revision git-svn-id: svn+ssh://svn.eu.adacore.com/Dev/importfromcvs/trunk@11480 936e1b1b-40f2-da11-902a-00137254ae57 2001-10-30 14:48:07 +00:00			`private`
			`pragma Inline (Is_Ideographic);`
			`pragma Inline (Is_Letter);`
			`pragma Inline (Is_White_Space);`
			`pragma Inline (To_Unicode);`
			`end Unicode;`