summaryrefslogtreecommitdiffstats
path: root/contrib/perl5/pod/perlunicode.pod
diff options
context:
space:
mode:
Diffstat (limited to 'contrib/perl5/pod/perlunicode.pod')
-rw-r--r--contrib/perl5/pod/perlunicode.pod242
1 files changed, 0 insertions, 242 deletions
diff --git a/contrib/perl5/pod/perlunicode.pod b/contrib/perl5/pod/perlunicode.pod
deleted file mode 100644
index 5b0fe2f..0000000
--- a/contrib/perl5/pod/perlunicode.pod
+++ /dev/null
@@ -1,242 +0,0 @@
-=head1 NAME
-
-perlunicode - Unicode support in Perl (EXPERIMENTAL, subject to change)
-
-=head1 DESCRIPTION
-
-=head2 Important Caveat
-
- WARNING: As of the 5.6.1 release, the implementation of Unicode
- support in Perl is incomplete, and continues to be highly experimental.
-
-The following areas need further work. They are being rapidly addressed
-in the 5.7.x development branch.
-
-=over 4
-
-=item Input and Output Disciplines
-
-There is currently no easy way to mark data read from a file or other
-external source as being utf8. This will be one of the major areas of
-focus in the near future.
-
-=item Regular Expressions
-
-The existing regular expression compiler does not produce polymorphic
-opcodes. This means that the determination on whether to match Unicode
-characters is made when the pattern is compiled, based on whether the
-pattern contains Unicode characters, and not when the matching happens
-at run time. This needs to be changed to adaptively match Unicode if
-the string to be matched is Unicode.
-
-=item C<use utf8> still needed to enable a few features
-
-The C<utf8> pragma implements the tables used for Unicode support. These
-tables are automatically loaded on demand, so the C<utf8> pragma need not
-normally be used.
-
-However, as a compatibility measure, this pragma must be explicitly used
-to enable recognition of UTF-8 encoded literals and identifiers in the
-source text.
-
-=back
-
-=head2 Byte and Character semantics
-
-Beginning with version 5.6, Perl uses logically wide characters to
-represent strings internally. This internal representation of strings
-uses the UTF-8 encoding.
-
-In future, Perl-level operations can be expected to work with characters
-rather than bytes, in general.
-
-However, as strictly an interim compatibility measure, Perl v5.6 aims to
-provide a safe migration path from byte semantics to character semantics
-for programs. For operations where Perl can unambiguously decide that the
-input data is characters, Perl now switches to character semantics.
-For operations where this determination cannot be made without additional
-information from the user, Perl decides in favor of compatibility, and
-chooses to use byte semantics.
-
-This behavior preserves compatibility with earlier versions of Perl,
-which allowed byte semantics in Perl operations, but only as long as
-none of the program's inputs are marked as being as source of Unicode
-character data. Such data may come from filehandles, from calls to
-external programs, from information provided by the system (such as %ENV),
-or from literals and constants in the source text.
-
-If the C<-C> command line switch is used, (or the ${^WIDE_SYSTEM_CALLS}
-global flag is set to C<1>), all system calls will use the
-corresponding wide character APIs. This is currently only implemented
-on Windows.
-
-Regardless of the above, the C<bytes> pragma can always be used to force
-byte semantics in a particular lexical scope. See L<bytes>.
-
-The C<utf8> pragma is primarily a compatibility device that enables
-recognition of UTF-8 in literals encountered by the parser. It may also
-be used for enabling some of the more experimental Unicode support features.
-Note that this pragma is only required until a future version of Perl
-in which character semantics will become the default. This pragma may
-then become a no-op. See L<utf8>.
-
-Unless mentioned otherwise, Perl operators will use character semantics
-when they are dealing with Unicode data, and byte semantics otherwise.
-Thus, character semantics for these operations apply transparently; if
-the input data came from a Unicode source (for example, by adding a
-character encoding discipline to the filehandle whence it came, or a
-literal UTF-8 string constant in the program), character semantics
-apply; otherwise, byte semantics are in effect. To force byte semantics
-on Unicode data, the C<bytes> pragma should be used.
-
-Under character semantics, many operations that formerly operated on
-bytes change to operating on characters. For ASCII data this makes
-no difference, because UTF-8 stores ASCII in single bytes, but for
-any character greater than C<chr(127)>, the character may be stored in
-a sequence of two or more bytes, all of which have the high bit set.
-But by and large, the user need not worry about this, because Perl
-hides it from the user. A character in Perl is logically just a number
-ranging from 0 to 2**32 or so. Larger characters encode to longer
-sequences of bytes internally, but again, this is just an internal
-detail which is hidden at the Perl level.
-
-=head2 Effects of character semantics
-
-Character semantics have the following effects:
-
-=over 4
-
-=item *
-
-Strings and patterns may contain characters that have an ordinal value
-larger than 255.
-
-Presuming you use a Unicode editor to edit your program, such characters
-will typically occur directly within the literal strings as UTF-8
-characters, but you can also specify a particular character with an
-extension of the C<\x> notation. UTF-8 characters are specified by
-putting the hexadecimal code within curlies after the C<\x>. For instance,
-a Unicode smiley face is C<\x{263A}>.
-
-=item *
-
-Identifiers within the Perl script may contain Unicode alphanumeric
-characters, including ideographs. (You are currently on your own when
-it comes to using the canonical forms of characters--Perl doesn't (yet)
-attempt to canonicalize variable names for you.)
-
-=item *
-
-Regular expressions match characters instead of bytes. For instance,
-"." matches a character instead of a byte. (However, the C<\C> pattern
-is provided to force a match a single byte ("C<char>" in C, hence
-C<\C>).)
-
-=item *
-
-Character classes in regular expressions match characters instead of
-bytes, and match against the character properties specified in the
-Unicode properties database. So C<\w> can be used to match an ideograph,
-for instance.
-
-=item *
-
-Named Unicode properties and block ranges make be used as character
-classes via the new C<\p{}> (matches property) and C<\P{}> (doesn't
-match property) constructs. For instance, C<\p{Lu}> matches any
-character with the Unicode uppercase property, while C<\p{M}> matches
-any mark character. Single letter properties may omit the brackets, so
-that can be written C<\pM> also. Many predefined character classes are
-available, such as C<\p{IsMirrored}> and C<\p{InTibetan}>.
-
-=item *
-
-The special pattern C<\X> match matches any extended Unicode sequence
-(a "combining character sequence" in Standardese), where the first
-character is a base character and subsequent characters are mark
-characters that apply to the base character. It is equivalent to
-C<(?:\PM\pM*)>.
-
-=item *
-
-The C<tr///> operator translates characters instead of bytes. Note
-that the C<tr///CU> functionality has been removed, as the interface
-was a mistake. For similar functionality see pack('U0', ...) and
-pack('C0', ...).
-
-=item *
-
-Case translation operators use the Unicode case translation tables
-when provided character input. Note that C<uc()> translates to
-uppercase, while C<ucfirst> translates to titlecase (for languages
-that make the distinction). Naturally the corresponding backslash
-sequences have the same semantics.
-
-=item *
-
-Most operators that deal with positions or lengths in the string will
-automatically switch to using character positions, including C<chop()>,
-C<substr()>, C<pos()>, C<index()>, C<rindex()>, C<sprintf()>,
-C<write()>, and C<length()>. Operators that specifically don't switch
-include C<vec()>, C<pack()>, and C<unpack()>. Operators that really
-don't care include C<chomp()>, as well as any other operator that
-treats a string as a bucket of bits, such as C<sort()>, and the
-operators dealing with filenames.
-
-=item *
-
-The C<pack()>/C<unpack()> letters "C<c>" and "C<C>" do I<not> change,
-since they're often used for byte-oriented formats. (Again, think
-"C<char>" in the C language.) However, there is a new "C<U>" specifier
-that will convert between UTF-8 characters and integers. (It works
-outside of the utf8 pragma too.)
-
-=item *
-
-The C<chr()> and C<ord()> functions work on characters. This is like
-C<pack("U")> and C<unpack("U")>, not like C<pack("C")> and
-C<unpack("C")>. In fact, the latter are how you now emulate
-byte-oriented C<chr()> and C<ord()> under utf8.
-
-=item *
-
-The bit string operators C<& | ^ ~> can operate on character data.
-However, for backward compatibility reasons (bit string operations
-when the characters all are less than 256 in ordinal value) one cannot
-mix C<~> (the bit complement) and characters both less than 256 and
-equal or greater than 256. Most importantly, the DeMorgan's laws
-(C<~($x|$y) eq ~$x&~$y>, C<~($x&$y) eq ~$x|~$y>) won't hold.
-Another way to look at this is that the complement cannot return
-B<both> the 8-bit (byte) wide bit complement, and the full character
-wide bit complement.
-
-=item *
-
-And finally, C<scalar reverse()> reverses by character rather than by byte.
-
-=back
-
-=head2 Character encodings for input and output
-
-[XXX: This feature is not yet implemented.]
-
-=head1 CAVEATS
-
-As of yet, there is no method for automatically coercing input and
-output to some encoding other than UTF-8. This is planned in the near
-future, however.
-
-Whether an arbitrary piece of data will be treated as "characters" or
-"bytes" by internal operations cannot be divined at the current time.
-
-Use of locales with utf8 may lead to odd results. Currently there is
-some attempt to apply 8-bit locale info to characters in the range
-0..255, but this is demonstrably incorrect for locales that use
-characters above that range (when mapped into Unicode). It will also
-tend to run slower. Avoidance of locales is strongly encouraged.
-
-=head1 SEE ALSO
-
-L<bytes>, L<utf8>, L<perlvar/"${^WIDE_SYSTEM_CALLS}">
-
-=cut
OpenPOWER on IntegriCloud