diff options
Diffstat (limited to 'contrib/perl5/pod/perlguts.pod')
-rw-r--r-- | contrib/perl5/pod/perlguts.pod | 2318 |
1 files changed, 0 insertions, 2318 deletions
diff --git a/contrib/perl5/pod/perlguts.pod b/contrib/perl5/pod/perlguts.pod deleted file mode 100644 index 9993cc1..0000000 --- a/contrib/perl5/pod/perlguts.pod +++ /dev/null @@ -1,2318 +0,0 @@ -=head1 NAME - -perlguts - Introduction to the Perl API - -=head1 DESCRIPTION - -This document attempts to describe how to use the Perl API, as well as -containing some info on the basic workings of the Perl core. It is far -from complete and probably contains many errors. Please refer any -questions or comments to the author below. - -=head1 Variables - -=head2 Datatypes - -Perl has three typedefs that handle Perl's three main data types: - - SV Scalar Value - AV Array Value - HV Hash Value - -Each typedef has specific routines that manipulate the various data types. - -=head2 What is an "IV"? - -Perl uses a special typedef IV which is a simple signed integer type that is -guaranteed to be large enough to hold a pointer (as well as an integer). -Additionally, there is the UV, which is simply an unsigned IV. - -Perl also uses two special typedefs, I32 and I16, which will always be at -least 32-bits and 16-bits long, respectively. (Again, there are U32 and U16, -as well.) - -=head2 Working with SVs - -An SV can be created and loaded with one command. There are four types of -values that can be loaded: an integer value (IV), a double (NV), -a string (PV), and another scalar (SV). - -The six routines are: - - SV* newSViv(IV); - SV* newSVnv(double); - SV* newSVpv(const char*, int); - SV* newSVpvn(const char*, int); - SV* newSVpvf(const char*, ...); - SV* newSVsv(SV*); - -To change the value of an *already-existing* SV, there are seven routines: - - void sv_setiv(SV*, IV); - void sv_setuv(SV*, UV); - void sv_setnv(SV*, double); - void sv_setpv(SV*, const char*); - void sv_setpvn(SV*, const char*, int) - void sv_setpvf(SV*, const char*, ...); - void sv_setpvfn(SV*, const char*, STRLEN, va_list *, SV **, I32, bool); - void sv_setsv(SV*, SV*); - -Notice that you can choose to specify the length of the string to be -assigned by using C<sv_setpvn>, C<newSVpvn>, or C<newSVpv>, or you may -allow Perl to calculate the length by using C<sv_setpv> or by specifying -0 as the second argument to C<newSVpv>. Be warned, though, that Perl will -determine the string's length by using C<strlen>, which depends on the -string terminating with a NUL character. - -The arguments of C<sv_setpvf> are processed like C<sprintf>, and the -formatted output becomes the value. - -C<sv_setpvfn> is an analogue of C<vsprintf>, but it allows you to specify -either a pointer to a variable argument list or the address and length of -an array of SVs. The last argument points to a boolean; on return, if that -boolean is true, then locale-specific information has been used to format -the string, and the string's contents are therefore untrustworthy (see -L<perlsec>). This pointer may be NULL if that information is not -important. Note that this function requires you to specify the length of -the format. - -STRLEN is an integer type (Size_t, usually defined as size_t in -config.h) guaranteed to be large enough to represent the size of -any string that perl can handle. - -The C<sv_set*()> functions are not generic enough to operate on values -that have "magic". See L<Magic Virtual Tables> later in this document. - -All SVs that contain strings should be terminated with a NUL character. -If it is not NUL-terminated there is a risk of -core dumps and corruptions from code which passes the string to C -functions or system calls which expect a NUL-terminated string. -Perl's own functions typically add a trailing NUL for this reason. -Nevertheless, you should be very careful when you pass a string stored -in an SV to a C function or system call. - -To access the actual value that an SV points to, you can use the macros: - - SvIV(SV*) - SvUV(SV*) - SvNV(SV*) - SvPV(SV*, STRLEN len) - SvPV_nolen(SV*) - -which will automatically coerce the actual scalar type into an IV, UV, double, -or string. - -In the C<SvPV> macro, the length of the string returned is placed into the -variable C<len> (this is a macro, so you do I<not> use C<&len>). If you do -not care what the length of the data is, use the C<SvPV_nolen> macro. -Historically the C<SvPV> macro with the global variable C<PL_na> has been -used in this case. But that can be quite inefficient because C<PL_na> must -be accessed in thread-local storage in threaded Perl. In any case, remember -that Perl allows arbitrary strings of data that may both contain NULs and -might not be terminated by a NUL. - -Also remember that C doesn't allow you to safely say C<foo(SvPV(s, len), -len);>. It might work with your compiler, but it won't work for everyone. -Break this sort of statement up into separate assignments: - - SV *s; - STRLEN len; - char * ptr; - ptr = SvPV(s, len); - foo(ptr, len); - -If you want to know if the scalar value is TRUE, you can use: - - SvTRUE(SV*) - -Although Perl will automatically grow strings for you, if you need to force -Perl to allocate more memory for your SV, you can use the macro - - SvGROW(SV*, STRLEN newlen) - -which will determine if more memory needs to be allocated. If so, it will -call the function C<sv_grow>. Note that C<SvGROW> can only increase, not -decrease, the allocated memory of an SV and that it does not automatically -add a byte for the a trailing NUL (perl's own string functions typically do -C<SvGROW(sv, len + 1)>). - -If you have an SV and want to know what kind of data Perl thinks is stored -in it, you can use the following macros to check the type of SV you have. - - SvIOK(SV*) - SvNOK(SV*) - SvPOK(SV*) - -You can get and set the current length of the string stored in an SV with -the following macros: - - SvCUR(SV*) - SvCUR_set(SV*, I32 val) - -You can also get a pointer to the end of the string stored in the SV -with the macro: - - SvEND(SV*) - -But note that these last three macros are valid only if C<SvPOK()> is true. - -If you want to append something to the end of string stored in an C<SV*>, -you can use the following functions: - - void sv_catpv(SV*, const char*); - void sv_catpvn(SV*, const char*, STRLEN); - void sv_catpvf(SV*, const char*, ...); - void sv_catpvfn(SV*, const char*, STRLEN, va_list *, SV **, I32, bool); - void sv_catsv(SV*, SV*); - -The first function calculates the length of the string to be appended by -using C<strlen>. In the second, you specify the length of the string -yourself. The third function processes its arguments like C<sprintf> and -appends the formatted output. The fourth function works like C<vsprintf>. -You can specify the address and length of an array of SVs instead of the -va_list argument. The fifth function extends the string stored in the first -SV with the string stored in the second SV. It also forces the second SV -to be interpreted as a string. - -The C<sv_cat*()> functions are not generic enough to operate on values that -have "magic". See L<Magic Virtual Tables> later in this document. - -If you know the name of a scalar variable, you can get a pointer to its SV -by using the following: - - SV* get_sv("package::varname", FALSE); - -This returns NULL if the variable does not exist. - -If you want to know if this variable (or any other SV) is actually C<defined>, -you can call: - - SvOK(SV*) - -The scalar C<undef> value is stored in an SV instance called C<PL_sv_undef>. Its -address can be used whenever an C<SV*> is needed. - -There are also the two values C<PL_sv_yes> and C<PL_sv_no>, which contain Boolean -TRUE and FALSE values, respectively. Like C<PL_sv_undef>, their addresses can -be used whenever an C<SV*> is needed. - -Do not be fooled into thinking that C<(SV *) 0> is the same as C<&PL_sv_undef>. -Take this code: - - SV* sv = (SV*) 0; - if (I-am-to-return-a-real-value) { - sv = sv_2mortal(newSViv(42)); - } - sv_setsv(ST(0), sv); - -This code tries to return a new SV (which contains the value 42) if it should -return a real value, or undef otherwise. Instead it has returned a NULL -pointer which, somewhere down the line, will cause a segmentation violation, -bus error, or just weird results. Change the zero to C<&PL_sv_undef> in the first -line and all will be well. - -To free an SV that you've created, call C<SvREFCNT_dec(SV*)>. Normally this -call is not necessary (see L<Reference Counts and Mortality>). - -=head2 Offsets - -Perl provides the function C<sv_chop> to efficiently remove characters -from the beginning of a string; you give it an SV and a pointer to -somewhere inside the the PV, and it discards everything before the -pointer. The efficiency comes by means of a little hack: instead of -actually removing the characters, C<sv_chop> sets the flag C<OOK> -(offset OK) to signal to other functions that the offset hack is in -effect, and it puts the number of bytes chopped off into the IV field -of the SV. It then moves the PV pointer (called C<SvPVX>) forward that -many bytes, and adjusts C<SvCUR> and C<SvLEN>. - -Hence, at this point, the start of the buffer that we allocated lives -at C<SvPVX(sv) - SvIV(sv)> in memory and the PV pointer is pointing -into the middle of this allocated storage. - -This is best demonstrated by example: - - % ./perl -Ilib -MDevel::Peek -le '$a="12345"; $a=~s/.//; Dump($a)' - SV = PVIV(0x8128450) at 0x81340f0 - REFCNT = 1 - FLAGS = (POK,OOK,pPOK) - IV = 1 (OFFSET) - PV = 0x8135781 ( "1" . ) "2345"\0 - CUR = 4 - LEN = 5 - -Here the number of bytes chopped off (1) is put into IV, and -C<Devel::Peek::Dump> helpfully reminds us that this is an offset. The -portion of the string between the "real" and the "fake" beginnings is -shown in parentheses, and the values of C<SvCUR> and C<SvLEN> reflect -the fake beginning, not the real one. - -Something similar to the offset hack is perfomed on AVs to enable -efficient shifting and splicing off the beginning of the array; while -C<AvARRAY> points to the first element in the array that is visible from -Perl, C<AvALLOC> points to the real start of the C array. These are -usually the same, but a C<shift> operation can be carried out by -increasing C<AvARRAY> by one and decreasing C<AvFILL> and C<AvLEN>. -Again, the location of the real start of the C array only comes into -play when freeing the array. See C<av_shift> in F<av.c>. - -=head2 What's Really Stored in an SV? - -Recall that the usual method of determining the type of scalar you have is -to use C<Sv*OK> macros. Because a scalar can be both a number and a string, -usually these macros will always return TRUE and calling the C<Sv*V> -macros will do the appropriate conversion of string to integer/double or -integer/double to string. - -If you I<really> need to know if you have an integer, double, or string -pointer in an SV, you can use the following three macros instead: - - SvIOKp(SV*) - SvNOKp(SV*) - SvPOKp(SV*) - -These will tell you if you truly have an integer, double, or string pointer -stored in your SV. The "p" stands for private. - -In general, though, it's best to use the C<Sv*V> macros. - -=head2 Working with AVs - -There are two ways to create and load an AV. The first method creates an -empty AV: - - AV* newAV(); - -The second method both creates the AV and initially populates it with SVs: - - AV* av_make(I32 num, SV **ptr); - -The second argument points to an array containing C<num> C<SV*>'s. Once the -AV has been created, the SVs can be destroyed, if so desired. - -Once the AV has been created, the following operations are possible on AVs: - - void av_push(AV*, SV*); - SV* av_pop(AV*); - SV* av_shift(AV*); - void av_unshift(AV*, I32 num); - -These should be familiar operations, with the exception of C<av_unshift>. -This routine adds C<num> elements at the front of the array with the C<undef> -value. You must then use C<av_store> (described below) to assign values -to these new elements. - -Here are some other functions: - - I32 av_len(AV*); - SV** av_fetch(AV*, I32 key, I32 lval); - SV** av_store(AV*, I32 key, SV* val); - -The C<av_len> function returns the highest index value in array (just -like $#array in Perl). If the array is empty, -1 is returned. The -C<av_fetch> function returns the value at index C<key>, but if C<lval> -is non-zero, then C<av_fetch> will store an undef value at that index. -The C<av_store> function stores the value C<val> at index C<key>, and does -not increment the reference count of C<val>. Thus the caller is responsible -for taking care of that, and if C<av_store> returns NULL, the caller will -have to decrement the reference count to avoid a memory leak. Note that -C<av_fetch> and C<av_store> both return C<SV**>'s, not C<SV*>'s as their -return value. - - void av_clear(AV*); - void av_undef(AV*); - void av_extend(AV*, I32 key); - -The C<av_clear> function deletes all the elements in the AV* array, but -does not actually delete the array itself. The C<av_undef> function will -delete all the elements in the array plus the array itself. The -C<av_extend> function extends the array so that it contains at least C<key+1> -elements. If C<key+1> is less than the currently allocated length of the array, -then nothing is done. - -If you know the name of an array variable, you can get a pointer to its AV -by using the following: - - AV* get_av("package::varname", FALSE); - -This returns NULL if the variable does not exist. - -See L<Understanding the Magic of Tied Hashes and Arrays> for more -information on how to use the array access functions on tied arrays. - -=head2 Working with HVs - -To create an HV, you use the following routine: - - HV* newHV(); - -Once the HV has been created, the following operations are possible on HVs: - - SV** hv_store(HV*, const char* key, U32 klen, SV* val, U32 hash); - SV** hv_fetch(HV*, const char* key, U32 klen, I32 lval); - -The C<klen> parameter is the length of the key being passed in (Note that -you cannot pass 0 in as a value of C<klen> to tell Perl to measure the -length of the key). The C<val> argument contains the SV pointer to the -scalar being stored, and C<hash> is the precomputed hash value (zero if -you want C<hv_store> to calculate it for you). The C<lval> parameter -indicates whether this fetch is actually a part of a store operation, in -which case a new undefined value will be added to the HV with the supplied -key and C<hv_fetch> will return as if the value had already existed. - -Remember that C<hv_store> and C<hv_fetch> return C<SV**>'s and not just -C<SV*>. To access the scalar value, you must first dereference the return -value. However, you should check to make sure that the return value is -not NULL before dereferencing it. - -These two functions check if a hash table entry exists, and deletes it. - - bool hv_exists(HV*, const char* key, U32 klen); - SV* hv_delete(HV*, const char* key, U32 klen, I32 flags); - -If C<flags> does not include the C<G_DISCARD> flag then C<hv_delete> will -create and return a mortal copy of the deleted value. - -And more miscellaneous functions: - - void hv_clear(HV*); - void hv_undef(HV*); - -Like their AV counterparts, C<hv_clear> deletes all the entries in the hash -table but does not actually delete the hash table. The C<hv_undef> deletes -both the entries and the hash table itself. - -Perl keeps the actual data in linked list of structures with a typedef of HE. -These contain the actual key and value pointers (plus extra administrative -overhead). The key is a string pointer; the value is an C<SV*>. However, -once you have an C<HE*>, to get the actual key and value, use the routines -specified below. - - I32 hv_iterinit(HV*); - /* Prepares starting point to traverse hash table */ - HE* hv_iternext(HV*); - /* Get the next entry, and return a pointer to a - structure that has both the key and value */ - char* hv_iterkey(HE* entry, I32* retlen); - /* Get the key from an HE structure and also return - the length of the key string */ - SV* hv_iterval(HV*, HE* entry); - /* Return a SV pointer to the value of the HE - structure */ - SV* hv_iternextsv(HV*, char** key, I32* retlen); - /* This convenience routine combines hv_iternext, - hv_iterkey, and hv_iterval. The key and retlen - arguments are return values for the key and its - length. The value is returned in the SV* argument */ - -If you know the name of a hash variable, you can get a pointer to its HV -by using the following: - - HV* get_hv("package::varname", FALSE); - -This returns NULL if the variable does not exist. - -The hash algorithm is defined in the C<PERL_HASH(hash, key, klen)> macro: - - hash = 0; - while (klen--) - hash = (hash * 33) + *key++; - hash = hash + (hash >> 5); /* after 5.6 */ - -The last step was added in version 5.6 to improve distribution of -lower bits in the resulting hash value. - -See L<Understanding the Magic of Tied Hashes and Arrays> for more -information on how to use the hash access functions on tied hashes. - -=head2 Hash API Extensions - -Beginning with version 5.004, the following functions are also supported: - - HE* hv_fetch_ent (HV* tb, SV* key, I32 lval, U32 hash); - HE* hv_store_ent (HV* tb, SV* key, SV* val, U32 hash); - - bool hv_exists_ent (HV* tb, SV* key, U32 hash); - SV* hv_delete_ent (HV* tb, SV* key, I32 flags, U32 hash); - - SV* hv_iterkeysv (HE* entry); - -Note that these functions take C<SV*> keys, which simplifies writing -of extension code that deals with hash structures. These functions -also allow passing of C<SV*> keys to C<tie> functions without forcing -you to stringify the keys (unlike the previous set of functions). - -They also return and accept whole hash entries (C<HE*>), making their -use more efficient (since the hash number for a particular string -doesn't have to be recomputed every time). See L<perlapi> for detailed -descriptions. - -The following macros must always be used to access the contents of hash -entries. Note that the arguments to these macros must be simple -variables, since they may get evaluated more than once. See -L<perlapi> for detailed descriptions of these macros. - - HePV(HE* he, STRLEN len) - HeVAL(HE* he) - HeHASH(HE* he) - HeSVKEY(HE* he) - HeSVKEY_force(HE* he) - HeSVKEY_set(HE* he, SV* sv) - -These two lower level macros are defined, but must only be used when -dealing with keys that are not C<SV*>s: - - HeKEY(HE* he) - HeKLEN(HE* he) - -Note that both C<hv_store> and C<hv_store_ent> do not increment the -reference count of the stored C<val>, which is the caller's responsibility. -If these functions return a NULL value, the caller will usually have to -decrement the reference count of C<val> to avoid a memory leak. - -=head2 References - -References are a special type of scalar that point to other data types -(including references). - -To create a reference, use either of the following functions: - - SV* newRV_inc((SV*) thing); - SV* newRV_noinc((SV*) thing); - -The C<thing> argument can be any of an C<SV*>, C<AV*>, or C<HV*>. The -functions are identical except that C<newRV_inc> increments the reference -count of the C<thing>, while C<newRV_noinc> does not. For historical -reasons, C<newRV> is a synonym for C<newRV_inc>. - -Once you have a reference, you can use the following macro to dereference -the reference: - - SvRV(SV*) - -then call the appropriate routines, casting the returned C<SV*> to either an -C<AV*> or C<HV*>, if required. - -To determine if an SV is a reference, you can use the following macro: - - SvROK(SV*) - -To discover what type of value the reference refers to, use the following -macro and then check the return value. - - SvTYPE(SvRV(SV*)) - -The most useful types that will be returned are: - - SVt_IV Scalar - SVt_NV Scalar - SVt_PV Scalar - SVt_RV Scalar - SVt_PVAV Array - SVt_PVHV Hash - SVt_PVCV Code - SVt_PVGV Glob (possible a file handle) - SVt_PVMG Blessed or Magical Scalar - - See the sv.h header file for more details. - -=head2 Blessed References and Class Objects - -References are also used to support object-oriented programming. In the -OO lexicon, an object is simply a reference that has been blessed into a -package (or class). Once blessed, the programmer may now use the reference -to access the various methods in the class. - -A reference can be blessed into a package with the following function: - - SV* sv_bless(SV* sv, HV* stash); - -The C<sv> argument must be a reference. The C<stash> argument specifies -which class the reference will belong to. See -L<Stashes and Globs> for information on converting class names into stashes. - -/* Still under construction */ - -Upgrades rv to reference if not already one. Creates new SV for rv to -point to. If C<classname> is non-null, the SV is blessed into the specified -class. SV is returned. - - SV* newSVrv(SV* rv, const char* classname); - -Copies integer or double into an SV whose reference is C<rv>. SV is blessed -if C<classname> is non-null. - - SV* sv_setref_iv(SV* rv, const char* classname, IV iv); - SV* sv_setref_nv(SV* rv, const char* classname, NV iv); - -Copies the pointer value (I<the address, not the string!>) into an SV whose -reference is rv. SV is blessed if C<classname> is non-null. - - SV* sv_setref_pv(SV* rv, const char* classname, PV iv); - -Copies string into an SV whose reference is C<rv>. Set length to 0 to let -Perl calculate the string length. SV is blessed if C<classname> is non-null. - - SV* sv_setref_pvn(SV* rv, const char* classname, PV iv, STRLEN length); - -Tests whether the SV is blessed into the specified class. It does not -check inheritance relationships. - - int sv_isa(SV* sv, const char* name); - -Tests whether the SV is a reference to a blessed object. - - int sv_isobject(SV* sv); - -Tests whether the SV is derived from the specified class. SV can be either -a reference to a blessed object or a string containing a class name. This -is the function implementing the C<UNIVERSAL::isa> functionality. - - bool sv_derived_from(SV* sv, const char* name); - -To check if you've got an object derived from a specific class you have -to write: - - if (sv_isobject(sv) && sv_derived_from(sv, class)) { ... } - -=head2 Creating New Variables - -To create a new Perl variable with an undef value which can be accessed from -your Perl script, use the following routines, depending on the variable type. - - SV* get_sv("package::varname", TRUE); - AV* get_av("package::varname", TRUE); - HV* get_hv("package::varname", TRUE); - -Notice the use of TRUE as the second parameter. The new variable can now -be set, using the routines appropriate to the data type. - -There are additional macros whose values may be bitwise OR'ed with the -C<TRUE> argument to enable certain extra features. Those bits are: - - GV_ADDMULTI Marks the variable as multiply defined, thus preventing the - "Name <varname> used only once: possible typo" warning. - GV_ADDWARN Issues the warning "Had to create <varname> unexpectedly" if - the variable did not exist before the function was called. - -If you do not specify a package name, the variable is created in the current -package. - -=head2 Reference Counts and Mortality - -Perl uses an reference count-driven garbage collection mechanism. SVs, -AVs, or HVs (xV for short in the following) start their life with a -reference count of 1. If the reference count of an xV ever drops to 0, -then it will be destroyed and its memory made available for reuse. - -This normally doesn't happen at the Perl level unless a variable is -undef'ed or the last variable holding a reference to it is changed or -overwritten. At the internal level, however, reference counts can be -manipulated with the following macros: - - int SvREFCNT(SV* sv); - SV* SvREFCNT_inc(SV* sv); - void SvREFCNT_dec(SV* sv); - -However, there is one other function which manipulates the reference -count of its argument. The C<newRV_inc> function, you will recall, -creates a reference to the specified argument. As a side effect, -it increments the argument's reference count. If this is not what -you want, use C<newRV_noinc> instead. - -For example, imagine you want to return a reference from an XSUB function. -Inside the XSUB routine, you create an SV which initially has a reference -count of one. Then you call C<newRV_inc>, passing it the just-created SV. -This returns the reference as a new SV, but the reference count of the -SV you passed to C<newRV_inc> has been incremented to two. Now you -return the reference from the XSUB routine and forget about the SV. -But Perl hasn't! Whenever the returned reference is destroyed, the -reference count of the original SV is decreased to one and nothing happens. -The SV will hang around without any way to access it until Perl itself -terminates. This is a memory leak. - -The correct procedure, then, is to use C<newRV_noinc> instead of -C<newRV_inc>. Then, if and when the last reference is destroyed, -the reference count of the SV will go to zero and it will be destroyed, -stopping any memory leak. - -There are some convenience functions available that can help with the -destruction of xVs. These functions introduce the concept of "mortality". -An xV that is mortal has had its reference count marked to be decremented, -but not actually decremented, until "a short time later". Generally the -term "short time later" means a single Perl statement, such as a call to -an XSUB function. The actual determinant for when mortal xVs have their -reference count decremented depends on two macros, SAVETMPS and FREETMPS. -See L<perlcall> and L<perlxs> for more details on these macros. - -"Mortalization" then is at its simplest a deferred C<SvREFCNT_dec>. -However, if you mortalize a variable twice, the reference count will -later be decremented twice. - -You should be careful about creating mortal variables. Strange things -can happen if you make the same value mortal within multiple contexts, -or if you make a variable mortal multiple times. - -To create a mortal variable, use the functions: - - SV* sv_newmortal() - SV* sv_2mortal(SV*) - SV* sv_mortalcopy(SV*) - -The first call creates a mortal SV, the second converts an existing -SV to a mortal SV (and thus defers a call to C<SvREFCNT_dec>), and the -third creates a mortal copy of an existing SV. - -The mortal routines are not just for SVs -- AVs and HVs can be -made mortal by passing their address (type-casted to C<SV*>) to the -C<sv_2mortal> or C<sv_mortalcopy> routines. - -=head2 Stashes and Globs - -A "stash" is a hash that contains all of the different objects that -are contained within a package. Each key of the stash is a symbol -name (shared by all the different types of objects that have the same -name), and each value in the hash table is a GV (Glob Value). This GV -in turn contains references to the various objects of that name, -including (but not limited to) the following: - - Scalar Value - Array Value - Hash Value - I/O Handle - Format - Subroutine - -There is a single stash called "PL_defstash" that holds the items that exist -in the "main" package. To get at the items in other packages, append the -string "::" to the package name. The items in the "Foo" package are in -the stash "Foo::" in PL_defstash. The items in the "Bar::Baz" package are -in the stash "Baz::" in "Bar::"'s stash. - -To get the stash pointer for a particular package, use the function: - - HV* gv_stashpv(const char* name, I32 create) - HV* gv_stashsv(SV*, I32 create) - -The first function takes a literal string, the second uses the string stored -in the SV. Remember that a stash is just a hash table, so you get back an -C<HV*>. The C<create> flag will create a new package if it is set. - -The name that C<gv_stash*v> wants is the name of the package whose symbol table -you want. The default package is called C<main>. If you have multiply nested -packages, pass their names to C<gv_stash*v>, separated by C<::> as in the Perl -language itself. - -Alternately, if you have an SV that is a blessed reference, you can find -out the stash pointer by using: - - HV* SvSTASH(SvRV(SV*)); - -then use the following to get the package name itself: - - char* HvNAME(HV* stash); - -If you need to bless or re-bless an object you can use the following -function: - - SV* sv_bless(SV*, HV* stash) - -where the first argument, an C<SV*>, must be a reference, and the second -argument is a stash. The returned C<SV*> can now be used in the same way -as any other SV. - -For more information on references and blessings, consult L<perlref>. - -=head2 Double-Typed SVs - -Scalar variables normally contain only one type of value, an integer, -double, pointer, or reference. Perl will automatically convert the -actual scalar data from the stored type into the requested type. - -Some scalar variables contain more than one type of scalar data. For -example, the variable C<$!> contains either the numeric value of C<errno> -or its string equivalent from either C<strerror> or C<sys_errlist[]>. - -To force multiple data values into an SV, you must do two things: use the -C<sv_set*v> routines to add the additional scalar type, then set a flag -so that Perl will believe it contains more than one type of data. The -four macros to set the flags are: - - SvIOK_on - SvNOK_on - SvPOK_on - SvROK_on - -The particular macro you must use depends on which C<sv_set*v> routine -you called first. This is because every C<sv_set*v> routine turns on -only the bit for the particular type of data being set, and turns off -all the rest. - -For example, to create a new Perl variable called "dberror" that contains -both the numeric and descriptive string error values, you could use the -following code: - - extern int dberror; - extern char *dberror_list; - - SV* sv = get_sv("dberror", TRUE); - sv_setiv(sv, (IV) dberror); - sv_setpv(sv, dberror_list[dberror]); - SvIOK_on(sv); - -If the order of C<sv_setiv> and C<sv_setpv> had been reversed, then the -macro C<SvPOK_on> would need to be called instead of C<SvIOK_on>. - -=head2 Magic Variables - -[This section still under construction. Ignore everything here. Post no -bills. Everything not permitted is forbidden.] - -Any SV may be magical, that is, it has special features that a normal -SV does not have. These features are stored in the SV structure in a -linked list of C<struct magic>'s, typedef'ed to C<MAGIC>. - - struct magic { - MAGIC* mg_moremagic; - MGVTBL* mg_virtual; - U16 mg_private; - char mg_type; - U8 mg_flags; - SV* mg_obj; - char* mg_ptr; - I32 mg_len; - }; - -Note this is current as of patchlevel 0, and could change at any time. - -=head2 Assigning Magic - -Perl adds magic to an SV using the sv_magic function: - - void sv_magic(SV* sv, SV* obj, int how, const char* name, I32 namlen); - -The C<sv> argument is a pointer to the SV that is to acquire a new magical -feature. - -If C<sv> is not already magical, Perl uses the C<SvUPGRADE> macro to -set the C<SVt_PVMG> flag for the C<sv>. Perl then continues by adding -it to the beginning of the linked list of magical features. Any prior -entry of the same type of magic is deleted. Note that this can be -overridden, and multiple instances of the same type of magic can be -associated with an SV. - -The C<name> and C<namlen> arguments are used to associate a string with -the magic, typically the name of a variable. C<namlen> is stored in the -C<mg_len> field and if C<name> is non-null and C<namlen> >= 0 a malloc'd -copy of the name is stored in C<mg_ptr> field. - -The sv_magic function uses C<how> to determine which, if any, predefined -"Magic Virtual Table" should be assigned to the C<mg_virtual> field. -See the "Magic Virtual Table" section below. The C<how> argument is also -stored in the C<mg_type> field. - -The C<obj> argument is stored in the C<mg_obj> field of the C<MAGIC> -structure. If it is not the same as the C<sv> argument, the reference -count of the C<obj> object is incremented. If it is the same, or if -the C<how> argument is "#", or if it is a NULL pointer, then C<obj> is -merely stored, without the reference count being incremented. - -There is also a function to add magic to an C<HV>: - - void hv_magic(HV *hv, GV *gv, int how); - -This simply calls C<sv_magic> and coerces the C<gv> argument into an C<SV>. - -To remove the magic from an SV, call the function sv_unmagic: - - void sv_unmagic(SV *sv, int type); - -The C<type> argument should be equal to the C<how> value when the C<SV> -was initially made magical. - -=head2 Magic Virtual Tables - -The C<mg_virtual> field in the C<MAGIC> structure is a pointer to a -C<MGVTBL>, which is a structure of function pointers and stands for -"Magic Virtual Table" to handle the various operations that might be -applied to that variable. - -The C<MGVTBL> has five pointers to the following routine types: - - int (*svt_get)(SV* sv, MAGIC* mg); - int (*svt_set)(SV* sv, MAGIC* mg); - U32 (*svt_len)(SV* sv, MAGIC* mg); - int (*svt_clear)(SV* sv, MAGIC* mg); - int (*svt_free)(SV* sv, MAGIC* mg); - -This MGVTBL structure is set at compile-time in C<perl.h> and there are -currently 19 types (or 21 with overloading turned on). These different -structures contain pointers to various routines that perform additional -actions depending on which function is being called. - - Function pointer Action taken - ---------------- ------------ - svt_get Do something after the value of the SV is retrieved. - svt_set Do something after the SV is assigned a value. - svt_len Report on the SV's length. - svt_clear Clear something the SV represents. - svt_free Free any extra storage associated with the SV. - -For instance, the MGVTBL structure called C<vtbl_sv> (which corresponds -to an C<mg_type> of '\0') contains: - - { magic_get, magic_set, magic_len, 0, 0 } - -Thus, when an SV is determined to be magical and of type '\0', if a get -operation is being performed, the routine C<magic_get> is called. All -the various routines for the various magical types begin with C<magic_>. -NOTE: the magic routines are not considered part of the Perl API, and may -not be exported by the Perl library. - -The current kinds of Magic Virtual Tables are: - - mg_type MGVTBL Type of magic - ------- ------ ---------------------------- - \0 vtbl_sv Special scalar variable - A vtbl_amagic %OVERLOAD hash - a vtbl_amagicelem %OVERLOAD hash element - c (none) Holds overload table (AMT) on stash - B vtbl_bm Boyer-Moore (fast string search) - D vtbl_regdata Regex match position data (@+ and @- vars) - d vtbl_regdatum Regex match position data element - E vtbl_env %ENV hash - e vtbl_envelem %ENV hash element - f vtbl_fm Formline ('compiled' format) - g vtbl_mglob m//g target / study()ed string - I vtbl_isa @ISA array - i vtbl_isaelem @ISA array element - k vtbl_nkeys scalar(keys()) lvalue - L (none) Debugger %_<filename - l vtbl_dbline Debugger %_<filename element - o vtbl_collxfrm Locale transformation - P vtbl_pack Tied array or hash - p vtbl_packelem Tied array or hash element - q vtbl_packelem Tied scalar or handle - S vtbl_sig %SIG hash - s vtbl_sigelem %SIG hash element - t vtbl_taint Taintedness - U vtbl_uvar Available for use by extensions - v vtbl_vec vec() lvalue - x vtbl_substr substr() lvalue - y vtbl_defelem Shadow "foreach" iterator variable / - smart parameter vivification - * vtbl_glob GV (typeglob) - # vtbl_arylen Array length ($#ary) - . vtbl_pos pos() lvalue - ~ (none) Available for use by extensions - -When an uppercase and lowercase letter both exist in the table, then the -uppercase letter is used to represent some kind of composite type (a list -or a hash), and the lowercase letter is used to represent an element of -that composite type. - -The '~' and 'U' magic types are defined specifically for use by -extensions and will not be used by perl itself. Extensions can use -'~' magic to 'attach' private information to variables (typically -objects). This is especially useful because there is no way for -normal perl code to corrupt this private information (unlike using -extra elements of a hash object). - -Similarly, 'U' magic can be used much like tie() to call a C function -any time a scalar's value is used or changed. The C<MAGIC>'s -C<mg_ptr> field points to a C<ufuncs> structure: - - struct ufuncs { - I32 (*uf_val)(IV, SV*); - I32 (*uf_set)(IV, SV*); - IV uf_index; - }; - -When the SV is read from or written to, the C<uf_val> or C<uf_set> -function will be called with C<uf_index> as the first arg and a -pointer to the SV as the second. A simple example of how to add 'U' -magic is shown below. Note that the ufuncs structure is copied by -sv_magic, so you can safely allocate it on the stack. - - void - Umagic(sv) - SV *sv; - PREINIT: - struct ufuncs uf; - CODE: - uf.uf_val = &my_get_fn; - uf.uf_set = &my_set_fn; - uf.uf_index = 0; - sv_magic(sv, 0, 'U', (char*)&uf, sizeof(uf)); - -Note that because multiple extensions may be using '~' or 'U' magic, -it is important for extensions to take extra care to avoid conflict. -Typically only using the magic on objects blessed into the same class -as the extension is sufficient. For '~' magic, it may also be -appropriate to add an I32 'signature' at the top of the private data -area and check that. - -Also note that the C<sv_set*()> and C<sv_cat*()> functions described -earlier do B<not> invoke 'set' magic on their targets. This must -be done by the user either by calling the C<SvSETMAGIC()> macro after -calling these functions, or by using one of the C<sv_set*_mg()> or -C<sv_cat*_mg()> functions. Similarly, generic C code must call the -C<SvGETMAGIC()> macro to invoke any 'get' magic if they use an SV -obtained from external sources in functions that don't handle magic. -See L<perlapi> for a description of these functions. -For example, calls to the C<sv_cat*()> functions typically need to be -followed by C<SvSETMAGIC()>, but they don't need a prior C<SvGETMAGIC()> -since their implementation handles 'get' magic. - -=head2 Finding Magic - - MAGIC* mg_find(SV*, int type); /* Finds the magic pointer of that type */ - -This routine returns a pointer to the C<MAGIC> structure stored in the SV. -If the SV does not have that magical feature, C<NULL> is returned. Also, -if the SV is not of type SVt_PVMG, Perl may core dump. - - int mg_copy(SV* sv, SV* nsv, const char* key, STRLEN klen); - -This routine checks to see what types of magic C<sv> has. If the mg_type -field is an uppercase letter, then the mg_obj is copied to C<nsv>, but -the mg_type field is changed to be the lowercase letter. - -=head2 Understanding the Magic of Tied Hashes and Arrays - -Tied hashes and arrays are magical beasts of the 'P' magic type. - -WARNING: As of the 5.004 release, proper usage of the array and hash -access functions requires understanding a few caveats. Some -of these caveats are actually considered bugs in the API, to be fixed -in later releases, and are bracketed with [MAYCHANGE] below. If -you find yourself actually applying such information in this section, be -aware that the behavior may change in the future, umm, without warning. - -The perl tie function associates a variable with an object that implements -the various GET, SET etc methods. To perform the equivalent of the perl -tie function from an XSUB, you must mimic this behaviour. The code below -carries out the necessary steps - firstly it creates a new hash, and then -creates a second hash which it blesses into the class which will implement -the tie methods. Lastly it ties the two hashes together, and returns a -reference to the new tied hash. Note that the code below does NOT call the -TIEHASH method in the MyTie class - -see L<Calling Perl Routines from within C Programs> for details on how -to do this. - - SV* - mytie() - PREINIT: - HV *hash; - HV *stash; - SV *tie; - CODE: - hash = newHV(); - tie = newRV_noinc((SV*)newHV()); - stash = gv_stashpv("MyTie", TRUE); - sv_bless(tie, stash); - hv_magic(hash, tie, 'P'); - RETVAL = newRV_noinc(hash); - OUTPUT: - RETVAL - -The C<av_store> function, when given a tied array argument, merely -copies the magic of the array onto the value to be "stored", using -C<mg_copy>. It may also return NULL, indicating that the value did not -actually need to be stored in the array. [MAYCHANGE] After a call to -C<av_store> on a tied array, the caller will usually need to call -C<mg_set(val)> to actually invoke the perl level "STORE" method on the -TIEARRAY object. If C<av_store> did return NULL, a call to -C<SvREFCNT_dec(val)> will also be usually necessary to avoid a memory -leak. [/MAYCHANGE] - -The previous paragraph is applicable verbatim to tied hash access using the -C<hv_store> and C<hv_store_ent> functions as well. - -C<av_fetch> and the corresponding hash functions C<hv_fetch> and -C<hv_fetch_ent> actually return an undefined mortal value whose magic -has been initialized using C<mg_copy>. Note the value so returned does not -need to be deallocated, as it is already mortal. [MAYCHANGE] But you will -need to call C<mg_get()> on the returned value in order to actually invoke -the perl level "FETCH" method on the underlying TIE object. Similarly, -you may also call C<mg_set()> on the return value after possibly assigning -a suitable value to it using C<sv_setsv>, which will invoke the "STORE" -method on the TIE object. [/MAYCHANGE] - -[MAYCHANGE] -In other words, the array or hash fetch/store functions don't really -fetch and store actual values in the case of tied arrays and hashes. They -merely call C<mg_copy> to attach magic to the values that were meant to be -"stored" or "fetched". Later calls to C<mg_get> and C<mg_set> actually -do the job of invoking the TIE methods on the underlying objects. Thus -the magic mechanism currently implements a kind of lazy access to arrays -and hashes. - -Currently (as of perl version 5.004), use of the hash and array access -functions requires the user to be aware of whether they are operating on -"normal" hashes and arrays, or on their tied variants. The API may be -changed to provide more transparent access to both tied and normal data -types in future versions. -[/MAYCHANGE] - -You would do well to understand that the TIEARRAY and TIEHASH interfaces -are mere sugar to invoke some perl method calls while using the uniform hash -and array syntax. The use of this sugar imposes some overhead (typically -about two to four extra opcodes per FETCH/STORE operation, in addition to -the creation of all the mortal variables required to invoke the methods). -This overhead will be comparatively small if the TIE methods are themselves -substantial, but if they are only a few statements long, the overhead -will not be insignificant. - -=head2 Localizing changes - -Perl has a very handy construction - - { - local $var = 2; - ... - } - -This construction is I<approximately> equivalent to - - { - my $oldvar = $var; - $var = 2; - ... - $var = $oldvar; - } - -The biggest difference is that the first construction would -reinstate the initial value of $var, irrespective of how control exits -the block: C<goto>, C<return>, C<die>/C<eval> etc. It is a little bit -more efficient as well. - -There is a way to achieve a similar task from C via Perl API: create a -I<pseudo-block>, and arrange for some changes to be automatically -undone at the end of it, either explicit, or via a non-local exit (via -die()). A I<block>-like construct is created by a pair of -C<ENTER>/C<LEAVE> macros (see L<perlcall/"Returning a Scalar">). -Such a construct may be created specially for some important localized -task, or an existing one (like boundaries of enclosing Perl -subroutine/block, or an existing pair for freeing TMPs) may be -used. (In the second case the overhead of additional localization must -be almost negligible.) Note that any XSUB is automatically enclosed in -an C<ENTER>/C<LEAVE> pair. - -Inside such a I<pseudo-block> the following service is available: - -=over 4 - -=item C<SAVEINT(int i)> - -=item C<SAVEIV(IV i)> - -=item C<SAVEI32(I32 i)> - -=item C<SAVELONG(long i)> - -These macros arrange things to restore the value of integer variable -C<i> at the end of enclosing I<pseudo-block>. - -=item C<SAVESPTR(s)> - -=item C<SAVEPPTR(p)> - -These macros arrange things to restore the value of pointers C<s> and -C<p>. C<s> must be a pointer of a type which survives conversion to -C<SV*> and back, C<p> should be able to survive conversion to C<char*> -and back. - -=item C<SAVEFREESV(SV *sv)> - -The refcount of C<sv> would be decremented at the end of -I<pseudo-block>. This is similar to C<sv_2mortal> in that it is also a -mechanism for doing a delayed C<SvREFCNT_dec>. However, while C<sv_2mortal> -extends the lifetime of C<sv> until the beginning of the next statement, -C<SAVEFREESV> extends it until the end of the enclosing scope. These -lifetimes can be wildly different. - -Also compare C<SAVEMORTALIZESV>. - -=item C<SAVEMORTALIZESV(SV *sv)> - -Just like C<SAVEFREESV>, but mortalizes C<sv> at the end of the current -scope instead of decrementing its reference count. This usually has the -effect of keeping C<sv> alive until the statement that called the currently -live scope has finished executing. - -=item C<SAVEFREEOP(OP *op)> - -The C<OP *> is op_free()ed at the end of I<pseudo-block>. - -=item C<SAVEFREEPV(p)> - -The chunk of memory which is pointed to by C<p> is Safefree()ed at the -end of I<pseudo-block>. - -=item C<SAVECLEARSV(SV *sv)> - -Clears a slot in the current scratchpad which corresponds to C<sv> at -the end of I<pseudo-block>. - -=item C<SAVEDELETE(HV *hv, char *key, I32 length)> - -The key C<key> of C<hv> is deleted at the end of I<pseudo-block>. The -string pointed to by C<key> is Safefree()ed. If one has a I<key> in -short-lived storage, the corresponding string may be reallocated like -this: - - SAVEDELETE(PL_defstash, savepv(tmpbuf), strlen(tmpbuf)); - -=item C<SAVEDESTRUCTOR(DESTRUCTORFUNC_NOCONTEXT_t f, void *p)> - -At the end of I<pseudo-block> the function C<f> is called with the -only argument C<p>. - -=item C<SAVEDESTRUCTOR_X(DESTRUCTORFUNC_t f, void *p)> - -At the end of I<pseudo-block> the function C<f> is called with the -implicit context argument (if any), and C<p>. - -=item C<SAVESTACK_POS()> - -The current offset on the Perl internal stack (cf. C<SP>) is restored -at the end of I<pseudo-block>. - -=back - -The following API list contains functions, thus one needs to -provide pointers to the modifiable data explicitly (either C pointers, -or Perlish C<GV *>s). Where the above macros take C<int>, a similar -function takes C<int *>. - -=over 4 - -=item C<SV* save_scalar(GV *gv)> - -Equivalent to Perl code C<local $gv>. - -=item C<AV* save_ary(GV *gv)> - -=item C<HV* save_hash(GV *gv)> - -Similar to C<save_scalar>, but localize C<@gv> and C<%gv>. - -=item C<void save_item(SV *item)> - -Duplicates the current value of C<SV>, on the exit from the current -C<ENTER>/C<LEAVE> I<pseudo-block> will restore the value of C<SV> -using the stored value. - -=item C<void save_list(SV **sarg, I32 maxsarg)> - -A variant of C<save_item> which takes multiple arguments via an array -C<sarg> of C<SV*> of length C<maxsarg>. - -=item C<SV* save_svref(SV **sptr)> - -Similar to C<save_scalar>, but will reinstate a C<SV *>. - -=item C<void save_aptr(AV **aptr)> - -=item C<void save_hptr(HV **hptr)> - -Similar to C<save_svref>, but localize C<AV *> and C<HV *>. - -=back - -The C<Alias> module implements localization of the basic types within the -I<caller's scope>. People who are interested in how to localize things in -the containing scope should take a look there too. - -=head1 Subroutines - -=head2 XSUBs and the Argument Stack - -The XSUB mechanism is a simple way for Perl programs to access C subroutines. -An XSUB routine will have a stack that contains the arguments from the Perl -program, and a way to map from the Perl data structures to a C equivalent. - -The stack arguments are accessible through the C<ST(n)> macro, which returns -the C<n>'th stack argument. Argument 0 is the first argument passed in the -Perl subroutine call. These arguments are C<SV*>, and can be used anywhere -an C<SV*> is used. - -Most of the time, output from the C routine can be handled through use of -the RETVAL and OUTPUT directives. However, there are some cases where the -argument stack is not already long enough to handle all the return values. -An example is the POSIX tzname() call, which takes no arguments, but returns -two, the local time zone's standard and summer time abbreviations. - -To handle this situation, the PPCODE directive is used and the stack is -extended using the macro: - - EXTEND(SP, num); - -where C<SP> is the macro that represents the local copy of the stack pointer, -and C<num> is the number of elements the stack should be extended by. - -Now that there is room on the stack, values can be pushed on it using the -macros to push IVs, doubles, strings, and SV pointers respectively: - - PUSHi(IV) - PUSHn(double) - PUSHp(char*, I32) - PUSHs(SV*) - -And now the Perl program calling C<tzname>, the two values will be assigned -as in: - - ($standard_abbrev, $summer_abbrev) = POSIX::tzname; - -An alternate (and possibly simpler) method to pushing values on the stack is -to use the macros: - - XPUSHi(IV) - XPUSHn(double) - XPUSHp(char*, I32) - XPUSHs(SV*) - -These macros automatically adjust the stack for you, if needed. Thus, you -do not need to call C<EXTEND> to extend the stack. -However, see L</Putting a C value on Perl stack> - -For more information, consult L<perlxs> and L<perlxstut>. - -=head2 Calling Perl Routines from within C Programs - -There are four routines that can be used to call a Perl subroutine from -within a C program. These four are: - - I32 call_sv(SV*, I32); - I32 call_pv(const char*, I32); - I32 call_method(const char*, I32); - I32 call_argv(const char*, I32, register char**); - -The routine most often used is C<call_sv>. The C<SV*> argument -contains either the name of the Perl subroutine to be called, or a -reference to the subroutine. The second argument consists of flags -that control the context in which the subroutine is called, whether -or not the subroutine is being passed arguments, how errors should be -trapped, and how to treat return values. - -All four routines return the number of arguments that the subroutine returned -on the Perl stack. - -These routines used to be called C<perl_call_sv> etc., before Perl v5.6.0, -but those names are now deprecated; macros of the same name are provided for -compatibility. - -When using any of these routines (except C<call_argv>), the programmer -must manipulate the Perl stack. These include the following macros and -functions: - - dSP - SP - PUSHMARK() - PUTBACK - SPAGAIN - ENTER - SAVETMPS - FREETMPS - LEAVE - XPUSH*() - POP*() - -For a detailed description of calling conventions from C to Perl, -consult L<perlcall>. - -=head2 Memory Allocation - -All memory meant to be used with the Perl API functions should be manipulated -using the macros described in this section. The macros provide the necessary -transparency between differences in the actual malloc implementation that is -used within perl. - -It is suggested that you enable the version of malloc that is distributed -with Perl. It keeps pools of various sizes of unallocated memory in -order to satisfy allocation requests more quickly. However, on some -platforms, it may cause spurious malloc or free errors. - - New(x, pointer, number, type); - Newc(x, pointer, number, type, cast); - Newz(x, pointer, number, type); - -These three macros are used to initially allocate memory. - -The first argument C<x> was a "magic cookie" that was used to keep track -of who called the macro, to help when debugging memory problems. However, -the current code makes no use of this feature (most Perl developers now -use run-time memory checkers), so this argument can be any number. - -The second argument C<pointer> should be the name of a variable that will -point to the newly allocated memory. - -The third and fourth arguments C<number> and C<type> specify how many of -the specified type of data structure should be allocated. The argument -C<type> is passed to C<sizeof>. The final argument to C<Newc>, C<cast>, -should be used if the C<pointer> argument is different from the C<type> -argument. - -Unlike the C<New> and C<Newc> macros, the C<Newz> macro calls C<memzero> -to zero out all the newly allocated memory. - - Renew(pointer, number, type); - Renewc(pointer, number, type, cast); - Safefree(pointer) - -These three macros are used to change a memory buffer size or to free a -piece of memory no longer needed. The arguments to C<Renew> and C<Renewc> -match those of C<New> and C<Newc> with the exception of not needing the -"magic cookie" argument. - - Move(source, dest, number, type); - Copy(source, dest, number, type); - Zero(dest, number, type); - -These three macros are used to move, copy, or zero out previously allocated -memory. The C<source> and C<dest> arguments point to the source and -destination starting points. Perl will move, copy, or zero out C<number> -instances of the size of the C<type> data structure (using the C<sizeof> -function). - -=head2 PerlIO - -The most recent development releases of Perl has been experimenting with -removing Perl's dependency on the "normal" standard I/O suite and allowing -other stdio implementations to be used. This involves creating a new -abstraction layer that then calls whichever implementation of stdio Perl -was compiled with. All XSUBs should now use the functions in the PerlIO -abstraction layer and not make any assumptions about what kind of stdio -is being used. - -For a complete description of the PerlIO abstraction, consult L<perlapio>. - -=head2 Putting a C value on Perl stack - -A lot of opcodes (this is an elementary operation in the internal perl -stack machine) put an SV* on the stack. However, as an optimization -the corresponding SV is (usually) not recreated each time. The opcodes -reuse specially assigned SVs (I<target>s) which are (as a corollary) -not constantly freed/created. - -Each of the targets is created only once (but see -L<Scratchpads and recursion> below), and when an opcode needs to put -an integer, a double, or a string on stack, it just sets the -corresponding parts of its I<target> and puts the I<target> on stack. - -The macro to put this target on stack is C<PUSHTARG>, and it is -directly used in some opcodes, as well as indirectly in zillions of -others, which use it via C<(X)PUSH[pni]>. - -Because the target is reused, you must be careful when pushing multiple -values on the stack. The following code will not do what you think: - - XPUSHi(10); - XPUSHi(20); - -This translates as "set C<TARG> to 10, push a pointer to C<TARG> onto -the stack; set C<TARG> to 20, push a pointer to C<TARG> onto the stack". -At the end of the operation, the stack does not contain the values 10 -and 20, but actually contains two pointers to C<TARG>, which we have set -to 20. If you need to push multiple different values, use C<XPUSHs>, -which bypasses C<TARG>. - -On a related note, if you do use C<(X)PUSH[npi]>, then you're going to -need a C<dTARG> in your variable declarations so that the C<*PUSH*> -macros can make use of the local variable C<TARG>. - -=head2 Scratchpads - -The question remains on when the SVs which are I<target>s for opcodes -are created. The answer is that they are created when the current unit -- -a subroutine or a file (for opcodes for statements outside of -subroutines) -- is compiled. During this time a special anonymous Perl -array is created, which is called a scratchpad for the current -unit. - -A scratchpad keeps SVs which are lexicals for the current unit and are -targets for opcodes. One can deduce that an SV lives on a scratchpad -by looking on its flags: lexicals have C<SVs_PADMY> set, and -I<target>s have C<SVs_PADTMP> set. - -The correspondence between OPs and I<target>s is not 1-to-1. Different -OPs in the compile tree of the unit can use the same target, if this -would not conflict with the expected life of the temporary. - -=head2 Scratchpads and recursion - -In fact it is not 100% true that a compiled unit contains a pointer to -the scratchpad AV. In fact it contains a pointer to an AV of -(initially) one element, and this element is the scratchpad AV. Why do -we need an extra level of indirection? - -The answer is B<recursion>, and maybe (sometime soon) B<threads>. Both -these can create several execution pointers going into the same -subroutine. For the subroutine-child not write over the temporaries -for the subroutine-parent (lifespan of which covers the call to the -child), the parent and the child should have different -scratchpads. (I<And> the lexicals should be separate anyway!) - -So each subroutine is born with an array of scratchpads (of length 1). -On each entry to the subroutine it is checked that the current -depth of the recursion is not more than the length of this array, and -if it is, new scratchpad is created and pushed into the array. - -The I<target>s on this scratchpad are C<undef>s, but they are already -marked with correct flags. - -=head1 Compiled code - -=head2 Code tree - -Here we describe the internal form your code is converted to by -Perl. Start with a simple example: - - $a = $b + $c; - -This is converted to a tree similar to this one: - - assign-to - / \ - + $a - / \ - $b $c - -(but slightly more complicated). This tree reflects the way Perl -parsed your code, but has nothing to do with the execution order. -There is an additional "thread" going through the nodes of the tree -which shows the order of execution of the nodes. In our simplified -example above it looks like: - - $b ---> $c ---> + ---> $a ---> assign-to - -But with the actual compile tree for C<$a = $b + $c> it is different: -some nodes I<optimized away>. As a corollary, though the actual tree -contains more nodes than our simplified example, the execution order -is the same as in our example. - -=head2 Examining the tree - -If you have your perl compiled for debugging (usually done with C<-D -optimize=-g> on C<Configure> command line), you may examine the -compiled tree by specifying C<-Dx> on the Perl command line. The -output takes several lines per node, and for C<$b+$c> it looks like -this: - - 5 TYPE = add ===> 6 - TARG = 1 - FLAGS = (SCALAR,KIDS) - { - TYPE = null ===> (4) - (was rv2sv) - FLAGS = (SCALAR,KIDS) - { - 3 TYPE = gvsv ===> 4 - FLAGS = (SCALAR) - GV = main::b - } - } - { - TYPE = null ===> (5) - (was rv2sv) - FLAGS = (SCALAR,KIDS) - { - 4 TYPE = gvsv ===> 5 - FLAGS = (SCALAR) - GV = main::c - } - } - -This tree has 5 nodes (one per C<TYPE> specifier), only 3 of them are -not optimized away (one per number in the left column). The immediate -children of the given node correspond to C<{}> pairs on the same level -of indentation, thus this listing corresponds to the tree: - - add - / \ - null null - | | - gvsv gvsv - -The execution order is indicated by C<===E<gt>> marks, thus it is C<3 -4 5 6> (node C<6> is not included into above listing), i.e., -C<gvsv gvsv add whatever>. - -Each of these nodes represents an op, a fundamental operation inside the -Perl core. The code which implements each operation can be found in the -F<pp*.c> files; the function which implements the op with type C<gvsv> -is C<pp_gvsv>, and so on. As the tree above shows, different ops have -different numbers of children: C<add> is a binary operator, as one would -expect, and so has two children. To accommodate the various different -numbers of children, there are various types of op data structure, and -they link together in different ways. - -The simplest type of op structure is C<OP>: this has no children. Unary -operators, C<UNOP>s, have one child, and this is pointed to by the -C<op_first> field. Binary operators (C<BINOP>s) have not only an -C<op_first> field but also an C<op_last> field. The most complex type of -op is a C<LISTOP>, which has any number of children. In this case, the -first child is pointed to by C<op_first> and the last child by -C<op_last>. The children in between can be found by iteratively -following the C<op_sibling> pointer from the first child to the last. - -There are also two other op types: a C<PMOP> holds a regular expression, -and has no children, and a C<LOOP> may or may not have children. If the -C<op_children> field is non-zero, it behaves like a C<LISTOP>. To -complicate matters, if a C<UNOP> is actually a C<null> op after -optimization (see L</Compile pass 2: context propagation>) it will still -have children in accordance with its former type. - -=head2 Compile pass 1: check routines - -The tree is created by the compiler while I<yacc> code feeds it -the constructions it recognizes. Since I<yacc> works bottom-up, so does -the first pass of perl compilation. - -What makes this pass interesting for perl developers is that some -optimization may be performed on this pass. This is optimization by -so-called "check routines". The correspondence between node names -and corresponding check routines is described in F<opcode.pl> (do not -forget to run C<make regen_headers> if you modify this file). - -A check routine is called when the node is fully constructed except -for the execution-order thread. Since at this time there are no -back-links to the currently constructed node, one can do most any -operation to the top-level node, including freeing it and/or creating -new nodes above/below it. - -The check routine returns the node which should be inserted into the -tree (if the top-level node was not modified, check routine returns -its argument). - -By convention, check routines have names C<ck_*>. They are usually -called from C<new*OP> subroutines (or C<convert>) (which in turn are -called from F<perly.y>). - -=head2 Compile pass 1a: constant folding - -Immediately after the check routine is called the returned node is -checked for being compile-time executable. If it is (the value is -judged to be constant) it is immediately executed, and a I<constant> -node with the "return value" of the corresponding subtree is -substituted instead. The subtree is deleted. - -If constant folding was not performed, the execution-order thread is -created. - -=head2 Compile pass 2: context propagation - -When a context for a part of compile tree is known, it is propagated -down through the tree. At this time the context can have 5 values -(instead of 2 for runtime context): void, boolean, scalar, list, and -lvalue. In contrast with the pass 1 this pass is processed from top -to bottom: a node's context determines the context for its children. - -Additional context-dependent optimizations are performed at this time. -Since at this moment the compile tree contains back-references (via -"thread" pointers), nodes cannot be free()d now. To allow -optimized-away nodes at this stage, such nodes are null()ified instead -of free()ing (i.e. their type is changed to OP_NULL). - -=head2 Compile pass 3: peephole optimization - -After the compile tree for a subroutine (or for an C<eval> or a file) -is created, an additional pass over the code is performed. This pass -is neither top-down or bottom-up, but in the execution order (with -additional complications for conditionals). These optimizations are -done in the subroutine peep(). Optimizations performed at this stage -are subject to the same restrictions as in the pass 2. - -=head1 Examining internal data structures with the C<dump> functions - -To aid debugging, the source file F<dump.c> contains a number of -functions which produce formatted output of internal data structures. - -The most commonly used of these functions is C<Perl_sv_dump>; it's used -for dumping SVs, AVs, HVs, and CVs. The C<Devel::Peek> module calls -C<sv_dump> to produce debugging output from Perl-space, so users of that -module should already be familiar with its format. - -C<Perl_op_dump> can be used to dump an C<OP> structure or any of its -derivatives, and produces output similiar to C<perl -Dx>; in fact, -C<Perl_dump_eval> will dump the main root of the code being evaluated, -exactly like C<-Dx>. - -Other useful functions are C<Perl_dump_sub>, which turns a C<GV> into an -op tree, C<Perl_dump_packsubs> which calls C<Perl_dump_sub> on all the -subroutines in a package like so: (Thankfully, these are all xsubs, so -there is no op tree) - - (gdb) print Perl_dump_packsubs(PL_defstash) - - SUB attributes::bootstrap = (xsub 0x811fedc 0) - - SUB UNIVERSAL::can = (xsub 0x811f50c 0) - - SUB UNIVERSAL::isa = (xsub 0x811f304 0) - - SUB UNIVERSAL::VERSION = (xsub 0x811f7ac 0) - - SUB DynaLoader::boot_DynaLoader = (xsub 0x805b188 0) - -and C<Perl_dump_all>, which dumps all the subroutines in the stash and -the op tree of the main root. - -=head1 How multiple interpreters and concurrency are supported - -=head2 Background and PERL_IMPLICIT_CONTEXT - -The Perl interpreter can be regarded as a closed box: it has an API -for feeding it code or otherwise making it do things, but it also has -functions for its own use. This smells a lot like an object, and -there are ways for you to build Perl so that you can have multiple -interpreters, with one interpreter represented either as a C++ object, -a C structure, or inside a thread. The thread, the C structure, or -the C++ object will contain all the context, the state of that -interpreter. - -Three macros control the major Perl build flavors: MULTIPLICITY, -USE_THREADS and PERL_OBJECT. The MULTIPLICITY build has a C structure -that packages all the interpreter state, there is a similar thread-specific -data structure under USE_THREADS, and the (now deprecated) PERL_OBJECT -build has a C++ class to maintain interpreter state. In all three cases, -PERL_IMPLICIT_CONTEXT is also normally defined, and enables the -support for passing in a "hidden" first argument that represents all three -data structures. - -All this obviously requires a way for the Perl internal functions to be -C++ methods, subroutines taking some kind of structure as the first -argument, or subroutines taking nothing as the first argument. To -enable these three very different ways of building the interpreter, -the Perl source (as it does in so many other situations) makes heavy -use of macros and subroutine naming conventions. - -First problem: deciding which functions will be public API functions and -which will be private. All functions whose names begin C<S_> are private -(think "S" for "secret" or "static"). All other functions begin with -"Perl_", but just because a function begins with "Perl_" does not mean it is -part of the API. (See L</Internal Functions>.) The easiest way to be B<sure> a -function is part of the API is to find its entry in L<perlapi>. -If it exists in L<perlapi>, it's part of the API. If it doesn't, and you -think it should be (i.e., you need it for your extension), send mail via -L<perlbug> explaining why you think it should be. - -Second problem: there must be a syntax so that the same subroutine -declarations and calls can pass a structure as their first argument, -or pass nothing. To solve this, the subroutines are named and -declared in a particular way. Here's a typical start of a static -function used within the Perl guts: - - STATIC void - S_incline(pTHX_ char *s) - -STATIC becomes "static" in C, and is #define'd to nothing in C++. - -A public function (i.e. part of the internal API, but not necessarily -sanctioned for use in extensions) begins like this: - - void - Perl_sv_setsv(pTHX_ SV* dsv, SV* ssv) - -C<pTHX_> is one of a number of macros (in perl.h) that hide the -details of the interpreter's context. THX stands for "thread", "this", -or "thingy", as the case may be. (And no, George Lucas is not involved. :-) -The first character could be 'p' for a B<p>rototype, 'a' for B<a>rgument, -or 'd' for B<d>eclaration, so we have C<pTHX>, C<aTHX> and C<dTHX>, and -their variants. - -When Perl is built without options that set PERL_IMPLICIT_CONTEXT, there is no -first argument containing the interpreter's context. The trailing underscore -in the pTHX_ macro indicates that the macro expansion needs a comma -after the context argument because other arguments follow it. If -PERL_IMPLICIT_CONTEXT is not defined, pTHX_ will be ignored, and the -subroutine is not prototyped to take the extra argument. The form of the -macro without the trailing underscore is used when there are no additional -explicit arguments. - -When a core function calls another, it must pass the context. This -is normally hidden via macros. Consider C<sv_setsv>. It expands into -something like this: - - ifdef PERL_IMPLICIT_CONTEXT - define sv_setsv(a,b) Perl_sv_setsv(aTHX_ a, b) - /* can't do this for vararg functions, see below */ - else - define sv_setsv Perl_sv_setsv - endif - -This works well, and means that XS authors can gleefully write: - - sv_setsv(foo, bar); - -and still have it work under all the modes Perl could have been -compiled with. - -Under PERL_OBJECT in the core, that will translate to either: - - CPerlObj::Perl_sv_setsv(foo,bar); # in CPerlObj functions, - # C++ takes care of 'this' - or - - pPerl->Perl_sv_setsv(foo,bar); # in truly static functions, - # see objXSUB.h - -Under PERL_OBJECT in extensions (aka PERL_CAPI), or under -MULTIPLICITY/USE_THREADS with PERL_IMPLICIT_CONTEXT in both core -and extensions, it will become: - - Perl_sv_setsv(aTHX_ foo, bar); # the canonical Perl "API" - # for all build flavors - -This doesn't work so cleanly for varargs functions, though, as macros -imply that the number of arguments is known in advance. Instead we -either need to spell them out fully, passing C<aTHX_> as the first -argument (the Perl core tends to do this with functions like -Perl_warner), or use a context-free version. - -The context-free version of Perl_warner is called -Perl_warner_nocontext, and does not take the extra argument. Instead -it does dTHX; to get the context from thread-local storage. We -C<#define warner Perl_warner_nocontext> so that extensions get source -compatibility at the expense of performance. (Passing an arg is -cheaper than grabbing it from thread-local storage.) - -You can ignore [pad]THX[xo] when browsing the Perl headers/sources. -Those are strictly for use within the core. Extensions and embedders -need only be aware of [pad]THX. - -=head2 So what happened to dTHR? - -C<dTHR> was introduced in perl 5.005 to support the older thread model. -The older thread model now uses the C<THX> mechanism to pass context -pointers around, so C<dTHR> is not useful any more. Perl 5.6.0 and -later still have it for backward source compatibility, but it is defined -to be a no-op. - -=head2 How do I use all this in extensions? - -When Perl is built with PERL_IMPLICIT_CONTEXT, extensions that call -any functions in the Perl API will need to pass the initial context -argument somehow. The kicker is that you will need to write it in -such a way that the extension still compiles when Perl hasn't been -built with PERL_IMPLICIT_CONTEXT enabled. - -There are three ways to do this. First, the easy but inefficient way, -which is also the default, in order to maintain source compatibility -with extensions: whenever XSUB.h is #included, it redefines the aTHX -and aTHX_ macros to call a function that will return the context. -Thus, something like: - - sv_setsv(asv, bsv); - -in your extension will translate to this when PERL_IMPLICIT_CONTEXT is -in effect: - - Perl_sv_setsv(Perl_get_context(), asv, bsv); - -or to this otherwise: - - Perl_sv_setsv(asv, bsv); - -You have to do nothing new in your extension to get this; since -the Perl library provides Perl_get_context(), it will all just -work. - -The second, more efficient way is to use the following template for -your Foo.xs: - - #define PERL_NO_GET_CONTEXT /* we want efficiency */ - #include "EXTERN.h" - #include "perl.h" - #include "XSUB.h" - - static my_private_function(int arg1, int arg2); - - static SV * - my_private_function(int arg1, int arg2) - { - dTHX; /* fetch context */ - ... call many Perl API functions ... - } - - [... etc ...] - - MODULE = Foo PACKAGE = Foo - - /* typical XSUB */ - - void - my_xsub(arg) - int arg - CODE: - my_private_function(arg, 10); - -Note that the only two changes from the normal way of writing an -extension is the addition of a C<#define PERL_NO_GET_CONTEXT> before -including the Perl headers, followed by a C<dTHX;> declaration at -the start of every function that will call the Perl API. (You'll -know which functions need this, because the C compiler will complain -that there's an undeclared identifier in those functions.) No changes -are needed for the XSUBs themselves, because the XS() macro is -correctly defined to pass in the implicit context if needed. - -The third, even more efficient way is to ape how it is done within -the Perl guts: - - - #define PERL_NO_GET_CONTEXT /* we want efficiency */ - #include "EXTERN.h" - #include "perl.h" - #include "XSUB.h" - - /* pTHX_ only needed for functions that call Perl API */ - static my_private_function(pTHX_ int arg1, int arg2); - - static SV * - my_private_function(pTHX_ int arg1, int arg2) - { - /* dTHX; not needed here, because THX is an argument */ - ... call Perl API functions ... - } - - [... etc ...] - - MODULE = Foo PACKAGE = Foo - - /* typical XSUB */ - - void - my_xsub(arg) - int arg - CODE: - my_private_function(aTHX_ arg, 10); - -This implementation never has to fetch the context using a function -call, since it is always passed as an extra argument. Depending on -your needs for simplicity or efficiency, you may mix the previous -two approaches freely. - -Never add a comma after C<pTHX> yourself--always use the form of the -macro with the underscore for functions that take explicit arguments, -or the form without the argument for functions with no explicit arguments. - -=head2 Should I do anything special if I call perl from multiple threads? - -If you create interpreters in one thread and then proceed to call them in -another, you need to make sure perl's own Thread Local Storage (TLS) slot is -initialized correctly in each of those threads. - -The C<perl_alloc> and C<perl_clone> API functions will automatically set -the TLS slot to the interpreter they created, so that there is no need to do -anything special if the interpreter is always accessed in the same thread that -created it, and that thread did not create or call any other interpreters -afterwards. If that is not the case, you have to set the TLS slot of the -thread before calling any functions in the Perl API on that particular -interpreter. This is done by calling the C<PERL_SET_CONTEXT> macro in that -thread as the first thing you do: - - /* do this before doing anything else with some_perl */ - PERL_SET_CONTEXT(some_perl); - - ... other Perl API calls on some_perl go here ... - -=head2 Future Plans and PERL_IMPLICIT_SYS - -Just as PERL_IMPLICIT_CONTEXT provides a way to bundle up everything -that the interpreter knows about itself and pass it around, so too are -there plans to allow the interpreter to bundle up everything it knows -about the environment it's running on. This is enabled with the -PERL_IMPLICIT_SYS macro. Currently it only works with PERL_OBJECT -and USE_THREADS on Windows (see inside iperlsys.h). - -This allows the ability to provide an extra pointer (called the "host" -environment) for all the system calls. This makes it possible for -all the system stuff to maintain their own state, broken down into -seven C structures. These are thin wrappers around the usual system -calls (see win32/perllib.c) for the default perl executable, but for a -more ambitious host (like the one that would do fork() emulation) all -the extra work needed to pretend that different interpreters are -actually different "processes", would be done here. - -The Perl engine/interpreter and the host are orthogonal entities. -There could be one or more interpreters in a process, and one or -more "hosts", with free association between them. - -=head1 Internal Functions - -All of Perl's internal functions which will be exposed to the outside -world are be prefixed by C<Perl_> so that they will not conflict with XS -functions or functions used in a program in which Perl is embedded. -Similarly, all global variables begin with C<PL_>. (By convention, -static functions start with C<S_>) - -Inside the Perl core, you can get at the functions either with or -without the C<Perl_> prefix, thanks to a bunch of defines that live in -F<embed.h>. This header file is generated automatically from -F<embed.pl>. F<embed.pl> also creates the prototyping header files for -the internal functions, generates the documentation and a lot of other -bits and pieces. It's important that when you add a new function to the -core or change an existing one, you change the data in the table at the -end of F<embed.pl> as well. Here's a sample entry from that table: - - Apd |SV** |av_fetch |AV* ar|I32 key|I32 lval - -The second column is the return type, the third column the name. Columns -after that are the arguments. The first column is a set of flags: - -=over 3 - -=item A - -This function is a part of the public API. - -=item p - -This function has a C<Perl_> prefix; ie, it is defined as C<Perl_av_fetch> - -=item d - -This function has documentation using the C<apidoc> feature which we'll -look at in a second. - -=back - -Other available flags are: - -=over 3 - -=item s - -This is a static function and is defined as C<S_whatever>, and usually -called within the sources as C<whatever(...)>. - -=item n - -This does not use C<aTHX_> and C<pTHX> to pass interpreter context. (See -L<perlguts/Background and PERL_IMPLICIT_CONTEXT>.) - -=item r - -This function never returns; C<croak>, C<exit> and friends. - -=item f - -This function takes a variable number of arguments, C<printf> style. -The argument list should end with C<...>, like this: - - Afprd |void |croak |const char* pat|... - -=item M - -This function is part of the experimental development API, and may change -or disappear without notice. - -=item o - -This function should not have a compatibility macro to define, say, -C<Perl_parse> to C<parse>. It must be called as C<Perl_parse>. - -=item j - -This function is not a member of C<CPerlObj>. If you don't know -what this means, don't use it. - -=item x - -This function isn't exported out of the Perl core. - -=back - -If you edit F<embed.pl>, you will need to run C<make regen_headers> to -force a rebuild of F<embed.h> and other auto-generated files. - -=head2 Formatted Printing of IVs, UVs, and NVs - -If you are printing IVs, UVs, or NVS instead of the stdio(3) style -formatting codes like C<%d>, C<%ld>, C<%f>, you should use the -following macros for portability - - IVdf IV in decimal - UVuf UV in decimal - UVof UV in octal - UVxf UV in hexadecimal - NVef NV %e-like - NVff NV %f-like - NVgf NV %g-like - -These will take care of 64-bit integers and long doubles. -For example: - - printf("IV is %"IVdf"\n", iv); - -The IVdf will expand to whatever is the correct format for the IVs. - -If you are printing addresses of pointers, use UVxf combined -with PTR2UV(), do not use %lx or %p. - -=head2 Pointer-To-Integer and Integer-To-Pointer - -Because pointer size does not necessarily equal integer size, -use the follow macros to do it right. - - PTR2UV(pointer) - PTR2IV(pointer) - PTR2NV(pointer) - INT2PTR(pointertotype, integer) - -For example: - - IV iv = ...; - SV *sv = INT2PTR(SV*, iv); - -and - - AV *av = ...; - UV uv = PTR2UV(av); - -=head2 Source Documentation - -There's an effort going on to document the internal functions and -automatically produce reference manuals from them - L<perlapi> is one -such manual which details all the functions which are available to XS -writers. L<perlintern> is the autogenerated manual for the functions -which are not part of the API and are supposedly for internal use only. - -Source documentation is created by putting POD comments into the C -source, like this: - - /* - =for apidoc sv_setiv - - Copies an integer into the given SV. Does not handle 'set' magic. See - C<sv_setiv_mg>. - - =cut - */ - -Please try and supply some documentation if you add functions to the -Perl core. - -=head1 Unicode Support - -Perl 5.6.0 introduced Unicode support. It's important for porters and XS -writers to understand this support and make sure that the code they -write does not corrupt Unicode data. - -=head2 What B<is> Unicode, anyway? - -In the olden, less enlightened times, we all used to use ASCII. Most of -us did, anyway. The big problem with ASCII is that it's American. Well, -no, that's not actually the problem; the problem is that it's not -particularly useful for people who don't use the Roman alphabet. What -used to happen was that particular languages would stick their own -alphabet in the upper range of the sequence, between 128 and 255. Of -course, we then ended up with plenty of variants that weren't quite -ASCII, and the whole point of it being a standard was lost. - -Worse still, if you've got a language like Chinese or -Japanese that has hundreds or thousands of characters, then you really -can't fit them into a mere 256, so they had to forget about ASCII -altogether, and build their own systems using pairs of numbers to refer -to one character. - -To fix this, some people formed Unicode, Inc. and -produced a new character set containing all the characters you can -possibly think of and more. There are several ways of representing these -characters, and the one Perl uses is called UTF8. UTF8 uses -a variable number of bytes to represent a character, instead of just -one. You can learn more about Unicode at http://www.unicode.org/ - -=head2 How can I recognise a UTF8 string? - -You can't. This is because UTF8 data is stored in bytes just like -non-UTF8 data. The Unicode character 200, (C<0xC8> for you hex types) -capital E with a grave accent, is represented by the two bytes -C<v196.172>. Unfortunately, the non-Unicode string C<chr(196).chr(172)> -has that byte sequence as well. So you can't tell just by looking - this -is what makes Unicode input an interesting problem. - -The API function C<is_utf8_string> can help; it'll tell you if a string -contains only valid UTF8 characters. However, it can't do the work for -you. On a character-by-character basis, C<is_utf8_char> will tell you -whether the current character in a string is valid UTF8. - -=head2 How does UTF8 represent Unicode characters? - -As mentioned above, UTF8 uses a variable number of bytes to store a -character. Characters with values 1...128 are stored in one byte, just -like good ol' ASCII. Character 129 is stored as C<v194.129>; this -continues up to character 191, which is C<v194.191>. Now we've run out of -bits (191 is binary C<10111111>) so we move on; 192 is C<v195.128>. And -so it goes on, moving to three bytes at character 2048. - -Assuming you know you're dealing with a UTF8 string, you can find out -how long the first character in it is with the C<UTF8SKIP> macro: - - char *utf = "\305\233\340\240\201"; - I32 len; - - len = UTF8SKIP(utf); /* len is 2 here */ - utf += len; - len = UTF8SKIP(utf); /* len is 3 here */ - -Another way to skip over characters in a UTF8 string is to use -C<utf8_hop>, which takes a string and a number of characters to skip -over. You're on your own about bounds checking, though, so don't use it -lightly. - -All bytes in a multi-byte UTF8 character will have the high bit set, so -you can test if you need to do something special with this character -like this: - - UV uv; - - if (utf & 0x80) - /* Must treat this as UTF8 */ - uv = utf8_to_uv(utf); - else - /* OK to treat this character as a byte */ - uv = *utf; - -You can also see in that example that we use C<utf8_to_uv> to get the -value of the character; the inverse function C<uv_to_utf8> is available -for putting a UV into UTF8: - - if (uv > 0x80) - /* Must treat this as UTF8 */ - utf8 = uv_to_utf8(utf8, uv); - else - /* OK to treat this character as a byte */ - *utf8++ = uv; - -You B<must> convert characters to UVs using the above functions if -you're ever in a situation where you have to match UTF8 and non-UTF8 -characters. You may not skip over UTF8 characters in this case. If you -do this, you'll lose the ability to match hi-bit non-UTF8 characters; -for instance, if your UTF8 string contains C<v196.172>, and you skip -that character, you can never match a C<chr(200)> in a non-UTF8 string. -So don't do that! - -=head2 How does Perl store UTF8 strings? - -Currently, Perl deals with Unicode strings and non-Unicode strings -slightly differently. If a string has been identified as being UTF-8 -encoded, Perl will set a flag in the SV, C<SVf_UTF8>. You can check and -manipulate this flag with the following macros: - - SvUTF8(sv) - SvUTF8_on(sv) - SvUTF8_off(sv) - -This flag has an important effect on Perl's treatment of the string: if -Unicode data is not properly distinguished, regular expressions, -C<length>, C<substr> and other string handling operations will have -undesirable results. - -The problem comes when you have, for instance, a string that isn't -flagged is UTF8, and contains a byte sequence that could be UTF8 - -especially when combining non-UTF8 and UTF8 strings. - -Never forget that the C<SVf_UTF8> flag is separate to the PV value; you -need be sure you don't accidentally knock it off while you're -manipulating SVs. More specifically, you cannot expect to do this: - - SV *sv; - SV *nsv; - STRLEN len; - char *p; - - p = SvPV(sv, len); - frobnicate(p); - nsv = newSVpvn(p, len); - -The C<char*> string does not tell you the whole story, and you can't -copy or reconstruct an SV just by copying the string value. Check if the -old SV has the UTF8 flag set, and act accordingly: - - p = SvPV(sv, len); - frobnicate(p); - nsv = newSVpvn(p, len); - if (SvUTF8(sv)) - SvUTF8_on(nsv); - -In fact, your C<frobnicate> function should be made aware of whether or -not it's dealing with UTF8 data, so that it can handle the string -appropriately. - -=head2 How do I convert a string to UTF8? - -If you're mixing UTF8 and non-UTF8 strings, you might find it necessary -to upgrade one of the strings to UTF8. If you've got an SV, the easiest -way to do this is: - - sv_utf8_upgrade(sv); - -However, you must not do this, for example: - - if (!SvUTF8(left)) - sv_utf8_upgrade(left); - -If you do this in a binary operator, you will actually change one of the -strings that came into the operator, and, while it shouldn't be noticeable -by the end user, it can cause problems. - -Instead, C<bytes_to_utf8> will give you a UTF8-encoded B<copy> of its -string argument. This is useful for having the data available for -comparisons and so on, without harming the original SV. There's also -C<utf8_to_bytes> to go the other way, but naturally, this will fail if -the string contains any characters above 255 that can't be represented -in a single byte. - -=head2 Is there anything else I need to know? - -Not really. Just remember these things: - -=over 3 - -=item * - -There's no way to tell if a string is UTF8 or not. You can tell if an SV -is UTF8 by looking at is C<SvUTF8> flag. Don't forget to set the flag if -something should be UTF8. Treat the flag as part of the PV, even though -it's not - if you pass on the PV to somewhere, pass on the flag too. - -=item * - -If a string is UTF8, B<always> use C<utf8_to_uv> to get at the value, -unless C<!(*s & 0x80)> in which case you can use C<*s>. - -=item * - -When writing to a UTF8 string, B<always> use C<uv_to_utf8>, unless -C<uv < 0x80> in which case you can use C<*s = uv>. - -=item * - -Mixing UTF8 and non-UTF8 strings is tricky. Use C<bytes_to_utf8> to get -a new string which is UTF8 encoded. There are tricks you can use to -delay deciding whether you need to use a UTF8 string until you get to a -high character - C<HALF_UPGRADE> is one of those. - -=back - -=head1 AUTHORS - -Until May 1997, this document was maintained by Jeff Okamoto -<okamoto@corp.hp.com>. It is now maintained as part of Perl itself -by the Perl 5 Porters <perl5-porters@perl.org>. - -With lots of help and suggestions from Dean Roehrich, Malcolm Beattie, -Andreas Koenig, Paul Hudson, Ilya Zakharevich, Paul Marquess, Neil -Bowers, Matthew Green, Tim Bunce, Spider Boardman, Ulrich Pfeifer, -Stephen McCamant, and Gurusamy Sarathy. - -API Listing originally by Dean Roehrich <roehrich@cray.com>. - -Modifications to autogenerate the API listing (L<perlapi>) by Benjamin -Stuhl. - -=head1 SEE ALSO - -perlapi(1), perlintern(1), perlxs(1), perlembed(1) |