diff options
Diffstat (limited to 'contrib/perl5/pod/perlunicode.pod')
-rw-r--r-- | contrib/perl5/pod/perlunicode.pod | 242 |
1 files changed, 0 insertions, 242 deletions
diff --git a/contrib/perl5/pod/perlunicode.pod b/contrib/perl5/pod/perlunicode.pod deleted file mode 100644 index 5b0fe2f..0000000 --- a/contrib/perl5/pod/perlunicode.pod +++ /dev/null @@ -1,242 +0,0 @@ -=head1 NAME - -perlunicode - Unicode support in Perl (EXPERIMENTAL, subject to change) - -=head1 DESCRIPTION - -=head2 Important Caveat - - WARNING: As of the 5.6.1 release, the implementation of Unicode - support in Perl is incomplete, and continues to be highly experimental. - -The following areas need further work. They are being rapidly addressed -in the 5.7.x development branch. - -=over 4 - -=item Input and Output Disciplines - -There is currently no easy way to mark data read from a file or other -external source as being utf8. This will be one of the major areas of -focus in the near future. - -=item Regular Expressions - -The existing regular expression compiler does not produce polymorphic -opcodes. This means that the determination on whether to match Unicode -characters is made when the pattern is compiled, based on whether the -pattern contains Unicode characters, and not when the matching happens -at run time. This needs to be changed to adaptively match Unicode if -the string to be matched is Unicode. - -=item C<use utf8> still needed to enable a few features - -The C<utf8> pragma implements the tables used for Unicode support. These -tables are automatically loaded on demand, so the C<utf8> pragma need not -normally be used. - -However, as a compatibility measure, this pragma must be explicitly used -to enable recognition of UTF-8 encoded literals and identifiers in the -source text. - -=back - -=head2 Byte and Character semantics - -Beginning with version 5.6, Perl uses logically wide characters to -represent strings internally. This internal representation of strings -uses the UTF-8 encoding. - -In future, Perl-level operations can be expected to work with characters -rather than bytes, in general. - -However, as strictly an interim compatibility measure, Perl v5.6 aims to -provide a safe migration path from byte semantics to character semantics -for programs. For operations where Perl can unambiguously decide that the -input data is characters, Perl now switches to character semantics. -For operations where this determination cannot be made without additional -information from the user, Perl decides in favor of compatibility, and -chooses to use byte semantics. - -This behavior preserves compatibility with earlier versions of Perl, -which allowed byte semantics in Perl operations, but only as long as -none of the program's inputs are marked as being as source of Unicode -character data. Such data may come from filehandles, from calls to -external programs, from information provided by the system (such as %ENV), -or from literals and constants in the source text. - -If the C<-C> command line switch is used, (or the ${^WIDE_SYSTEM_CALLS} -global flag is set to C<1>), all system calls will use the -corresponding wide character APIs. This is currently only implemented -on Windows. - -Regardless of the above, the C<bytes> pragma can always be used to force -byte semantics in a particular lexical scope. See L<bytes>. - -The C<utf8> pragma is primarily a compatibility device that enables -recognition of UTF-8 in literals encountered by the parser. It may also -be used for enabling some of the more experimental Unicode support features. -Note that this pragma is only required until a future version of Perl -in which character semantics will become the default. This pragma may -then become a no-op. See L<utf8>. - -Unless mentioned otherwise, Perl operators will use character semantics -when they are dealing with Unicode data, and byte semantics otherwise. -Thus, character semantics for these operations apply transparently; if -the input data came from a Unicode source (for example, by adding a -character encoding discipline to the filehandle whence it came, or a -literal UTF-8 string constant in the program), character semantics -apply; otherwise, byte semantics are in effect. To force byte semantics -on Unicode data, the C<bytes> pragma should be used. - -Under character semantics, many operations that formerly operated on -bytes change to operating on characters. For ASCII data this makes -no difference, because UTF-8 stores ASCII in single bytes, but for -any character greater than C<chr(127)>, the character may be stored in -a sequence of two or more bytes, all of which have the high bit set. -But by and large, the user need not worry about this, because Perl -hides it from the user. A character in Perl is logically just a number -ranging from 0 to 2**32 or so. Larger characters encode to longer -sequences of bytes internally, but again, this is just an internal -detail which is hidden at the Perl level. - -=head2 Effects of character semantics - -Character semantics have the following effects: - -=over 4 - -=item * - -Strings and patterns may contain characters that have an ordinal value -larger than 255. - -Presuming you use a Unicode editor to edit your program, such characters -will typically occur directly within the literal strings as UTF-8 -characters, but you can also specify a particular character with an -extension of the C<\x> notation. UTF-8 characters are specified by -putting the hexadecimal code within curlies after the C<\x>. For instance, -a Unicode smiley face is C<\x{263A}>. - -=item * - -Identifiers within the Perl script may contain Unicode alphanumeric -characters, including ideographs. (You are currently on your own when -it comes to using the canonical forms of characters--Perl doesn't (yet) -attempt to canonicalize variable names for you.) - -=item * - -Regular expressions match characters instead of bytes. For instance, -"." matches a character instead of a byte. (However, the C<\C> pattern -is provided to force a match a single byte ("C<char>" in C, hence -C<\C>).) - -=item * - -Character classes in regular expressions match characters instead of -bytes, and match against the character properties specified in the -Unicode properties database. So C<\w> can be used to match an ideograph, -for instance. - -=item * - -Named Unicode properties and block ranges make be used as character -classes via the new C<\p{}> (matches property) and C<\P{}> (doesn't -match property) constructs. For instance, C<\p{Lu}> matches any -character with the Unicode uppercase property, while C<\p{M}> matches -any mark character. Single letter properties may omit the brackets, so -that can be written C<\pM> also. Many predefined character classes are -available, such as C<\p{IsMirrored}> and C<\p{InTibetan}>. - -=item * - -The special pattern C<\X> match matches any extended Unicode sequence -(a "combining character sequence" in Standardese), where the first -character is a base character and subsequent characters are mark -characters that apply to the base character. It is equivalent to -C<(?:\PM\pM*)>. - -=item * - -The C<tr///> operator translates characters instead of bytes. Note -that the C<tr///CU> functionality has been removed, as the interface -was a mistake. For similar functionality see pack('U0', ...) and -pack('C0', ...). - -=item * - -Case translation operators use the Unicode case translation tables -when provided character input. Note that C<uc()> translates to -uppercase, while C<ucfirst> translates to titlecase (for languages -that make the distinction). Naturally the corresponding backslash -sequences have the same semantics. - -=item * - -Most operators that deal with positions or lengths in the string will -automatically switch to using character positions, including C<chop()>, -C<substr()>, C<pos()>, C<index()>, C<rindex()>, C<sprintf()>, -C<write()>, and C<length()>. Operators that specifically don't switch -include C<vec()>, C<pack()>, and C<unpack()>. Operators that really -don't care include C<chomp()>, as well as any other operator that -treats a string as a bucket of bits, such as C<sort()>, and the -operators dealing with filenames. - -=item * - -The C<pack()>/C<unpack()> letters "C<c>" and "C<C>" do I<not> change, -since they're often used for byte-oriented formats. (Again, think -"C<char>" in the C language.) However, there is a new "C<U>" specifier -that will convert between UTF-8 characters and integers. (It works -outside of the utf8 pragma too.) - -=item * - -The C<chr()> and C<ord()> functions work on characters. This is like -C<pack("U")> and C<unpack("U")>, not like C<pack("C")> and -C<unpack("C")>. In fact, the latter are how you now emulate -byte-oriented C<chr()> and C<ord()> under utf8. - -=item * - -The bit string operators C<& | ^ ~> can operate on character data. -However, for backward compatibility reasons (bit string operations -when the characters all are less than 256 in ordinal value) one cannot -mix C<~> (the bit complement) and characters both less than 256 and -equal or greater than 256. Most importantly, the DeMorgan's laws -(C<~($x|$y) eq ~$x&~$y>, C<~($x&$y) eq ~$x|~$y>) won't hold. -Another way to look at this is that the complement cannot return -B<both> the 8-bit (byte) wide bit complement, and the full character -wide bit complement. - -=item * - -And finally, C<scalar reverse()> reverses by character rather than by byte. - -=back - -=head2 Character encodings for input and output - -[XXX: This feature is not yet implemented.] - -=head1 CAVEATS - -As of yet, there is no method for automatically coercing input and -output to some encoding other than UTF-8. This is planned in the near -future, however. - -Whether an arbitrary piece of data will be treated as "characters" or -"bytes" by internal operations cannot be divined at the current time. - -Use of locales with utf8 may lead to odd results. Currently there is -some attempt to apply 8-bit locale info to characters in the range -0..255, but this is demonstrably incorrect for locales that use -characters above that range (when mapped into Unicode). It will also -tend to run slower. Avoidance of locales is strongly encouraged. - -=head1 SEE ALSO - -L<bytes>, L<utf8>, L<perlvar/"${^WIDE_SYSTEM_CALLS}"> - -=cut |