summaryrefslogtreecommitdiffstats
path: root/sys/kern/vfs_vnops.c
Commit message (Collapse)AuthorAgeFilesLines
* More caddr_t removal, make fo_ioctl take a void * instead of a caddr_t.alfred2002-06-291-2/+2
|
* Clean up vn_rdwr locking.jeff2002-06-281-6/+12
| | | | | - Do shared locks on read. - Only do vn_{start,finished}_write when writing.
* Use proper size in bzero of stat structure.mckusick2002-06-241-1/+1
| | | | | Submitted by: Jake Burkholder <jake@locore.ca> Sponsored by: DARPA & NAI Labs.
* This patch fixes a size problem with the stat structure formckusick2002-06-221-2/+1
| | | | | | | | | 64-bit architectures that was introduced in the UFS2 code merge two days ago. The stat structure change that caused the problem was the addition of the file create time. Submitted by: Bruce Evans <bde@zeta.org.au> Sponsored by: DARPA & NAI Labs.
* This commit adds basic support for the UFS2 filesystem. The UFS2mckusick2002-06-211-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | filesystem expands the inode to 256 bytes to make space for 64-bit block pointers. It also adds a file-creation time field, an ability to use jumbo blocks per inode to allow extent like pointer density, and space for extended attributes (up to twice the filesystem block size worth of attributes, e.g., on a 16K filesystem, there is space for 32K of attributes). UFS2 fully supports and runs existing UFS1 filesystems. New filesystems built using newfs can be built in either UFS1 or UFS2 format using the -O option. In this commit UFS1 is the default format, so if you want to build UFS2 format filesystems, you must specify -O 2. This default will be changed to UFS2 when UFS2 proves itself to be stable. In this commit the boot code for reading UFS2 filesystems is not compiled (see /sys/boot/common/ufsread.c) as there is insufficient space in the boot block. Once the size of the boot block is increased, this code can be defined. Things to note: the definition of SBSIZE has changed to SBLOCKSIZE. The header file <ufs/ufs/dinode.h> must be included before <ufs/ffs/fs.h> so as to get the definitions of ufs2_daddr_t and ufs_lbn_t. Still TODO: Verify that the first level bootstraps work for all the architectures. Convert the utility ffsinfo to understand UFS2 and test growfs. Add support for the extended attribute storage. Update soft updates to ensure integrity of extended attribute storage. Switch the current extended attribute interfaces to use the extended attribute storage. Add the extent like functionality (framework is there, but is currently never used). Sponsored by: DARPA & NAI Labs. Reviewed by: Poul-Henning Kamp <phk@freebsd.org>
* Disable the shared locking namei() code for now. It breaks several stackingjeff2002-05-141-5/+5
| | | | | filesystems. This is on hold until the rest of VFS Locking is reviewed and deemed safe. It can be enabled with 'options LOOKUP_SHARED'.
* Lock proctree_lock instead of pgrpsess_lock.jhb2002-04-161-3/+3
|
* Use VOP_GETVOBJECT instead of accessing the member directly. This fixedjeff2002-04-141-1/+1
| | | | | | an issue with nullfs and NAMEI shared. Submitted by: Alexander Kabaev
* Turn #ifdef LOOKUP_SHARED into #ifndef LOOKUP_EXCLUSIVE to enable thisjeff2002-04-091-5/+5
| | | | | | | | | behavior by default. Also, change the options line to reflect this. If there are no problems reported this will become the only behavior and the knob will be removed in a month or so. Demanded by: obrien
* Change the suser() API to take advantage of td_ucred as well as do ajhb2002-04-011-1/+1
| | | | | | | | | | | | general cleanup of the API. The entire API now consists of two functions similar to the pre-KSE API. The suser() function takes a thread pointer as its only argument. The td_ucred member of this thread must be valid so the only valid thread pointers are curthread and a few kernel threads such as thread0. The suser_cred() function takes a pointer to a struct ucred as its first argument and an integer flag as its second argument. The flag is currently only used for the PRISON_ROOT flag. Discussed on: smp@
* Added used include of <sys/sx.h>. Don't depend on namespace pollution inbde2002-03-261-0/+1
| | | | <sys/file.h>.
* Remove __P.alfred2002-03-191-11/+11
|
* Giant pushdown for read/write/pread/pwrite syscalls.alfred2002-03-151-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | kern/kern_descrip.c: Aquire Giant in fdrop_locked when file refcount hits zero, this removes the requirement for the caller to own Giant for the most part. kern/kern_ktrace.c: Aquire Giant in ktrgenio, simplifies locking in upper read/write syscalls. kern/vfs_bio.c: Aquire Giant in bwillwrite if needed. kern/sys_generic.c Giant pushdown, remove Giant for: read, pread, write and pwrite. readv and writev aren't done yet because of the possible malloc calls for iov to uio processing. kern/sys_socket.c Grab giant in the socket fo_read/write functions. kern/vfs_vnops.c Grab giant in the vnode fo_read/write functions.
* This patch adds the "LOCKSHARED" option to namei which causes it to only ↵jeff2002-03-121-0/+34
| | | | | | | | | | | | | | | | acquire shared locks on leafs. The stat() and open() calls have been changed to make use of this new functionality. Using shared locks in these cases is sufficient and can significantly reduce their latency if IO is pending to these vnodes. Also, this reduces the number of exclusive locks that are floating around in the system, which helps reduce the number of deadlocks that occur. A new kernel option "LOOKUP_SHARED" has been added. It defaults to off so this patch can be turned on for testing, and should eventually go away once it is proven to be stable. I have personally been running this patch for over a year now, so it is believed to be fully stable. Reviewed by: jake, obrien Approved by: jake
* Stop abusing the pgrpsess_lock.tanimura2002-03-111-3/+3
|
* Document all functions, global and static variables, and sysctls.eivind2002-03-051-3/+9
| | | | | | | | Includes some minor whitespace changes, and re-ordering to be able to document properly (e.g, grouping of variables and the SYSCTL macro calls for them, where the documentation has been added.) Reviewed by: phk (but all errors are mine)
* Simple p_ucred -> td_ucred changes to start using the per-thread ucredjhb2002-02-271-6/+6
| | | | reference.
* Lock struct pgrp, session and sigio.tanimura2002-02-231-6/+15
| | | | | | | | | | | | | | | | | | | | | | | | | New locks are: - pgrpsess_lock which locks the whole pgrps and sessions, - pg_mtx which protects the pgrp members, and - s_mtx which protects the session members. Please refer to sys/proc.h for the coverage of these locks. Changes on the pgrp/session interface: - pgfind() needs the pgrpsess_lock held. - The caller of enterpgrp() is responsible to allocate a new pgrp and session. - Call enterthispgrp() in order to enter an existing pgrp. - pgsignal() requires a pgrp lock held. Reviewed by: jhb, alfred Tested on: cvsup.jp.FreeBSD.org (which is a quad-CPU machine running -current)
* More cleanups relating to vm object allocation failure: make sure werwatson2002-02-201-1/+5
| | | | | | | | | call VOP_CLOSE() with vp unlocked; clean up the return path a little, in as much as our namei/vnode operation return paths can be cleared up. For a return case that was apparently never taken, this sure is ugly. Reviewed by: jeffr
* Add the braces missed by revision 1.131.iedowse2002-02-181-1/+2
| | | | Pointy hat to: rwatson
* When vn_open() is failing because it cannot allocate a vm object, callrwatson2002-02-181-1/+1
| | | | | | VOP_CLOSE() on the vnode, so that VOP_OPEN() and VOP_CLOSE() calls are symmetric in all failure cases. This prevents an 'open' reference from being leaked in that unlikely failure scenario.
* Make sure to hold vnode lock when calling into VOP_GETATTR().rwatson2002-02-101-1/+8
| | | | Discussed with: mckusick, phk
* Part I: Update extended attribute API and ABI:rwatson2002-02-101-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | o Modify the system call syntax for extattr_{get,set}_{fd,file}() so as not to use the scatter gather API (which appeared not to be used by any consumers, and be less portable), rather, accepts 'data' and 'nbytes' in the style of other simple read/write interfaces. This changes the API and ABI. o Modify system call semantics so that extattr_get_{fd,file}() return a size_t. When performing a read, the number of bytes read will be returned, unless the data pointer is NULL, in which case the number of bytes of data are returned. This changes the API only. o Modify the VOP_GETEXTATTR() vnode operation to accept a *size_t argument so as to return the size, if desirable. If set to NULL, the size will not be returned. o Update various filesystems (pseodofs, ufs) to DTRT. These changes should make extended attributes more useful and more portable. More commits to rebuild the system call files, as well as update userland utilities to follow. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
* Make st_blksize default to PAGE_SIZE instead of zero.phk2002-01-251-2/+2
|
* Remove 'VXLOCK: interlock avoided' warnings. This can now occur in normaldillon2002-01-191-0/+3
| | | | | | | | | | | | operation. The vgonel() code has always called vclean() but until we started proactively freeing vnodes it would never actually be called with a dirty vnode, so this situation did not occur prior to the vnlru() code. Now that we proactively free vnodes when kern.maxvnodes is hit, however, vclean() winds up with work to do and improperly generates the warnings. Reviewed by: peter Approved by: re (for MFC) MFC after: 1 day
* SMP Lock struct file, filedesc and the global file list.alfred2002-01-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Seigo Tanimura (tanimura) posted the initial delta. I've polished it quite a bit reducing the need for locking and adapting it for KSE. Locks: 1 mutex in each filedesc protects all the fields. protects "struct file" initialization, while a struct file is being changed from &badfileops -> &pipeops or something the filedesc should be locked. 1 mutex in each struct file protects the refcount fields. doesn't protect anything else. the flags used for garbage collection have been moved to f_gcflag which was the FILLER short, this doesn't need locking because the garbage collection is a single threaded container. could likely be made to use a pool mutex. 1 sx lock for the global filelist. struct file * fhold(struct file *fp); /* increments reference count on a file */ struct file * fhold_locked(struct file *fp); /* like fhold but expects file to locked */ struct file * ffind_hold(struct thread *, int fd); /* finds the struct file in thread, adds one reference and returns it unlocked */ struct file * ffind_lock(struct thread *, int fd); /* ffind_hold, but returns file locked */ I still have to smp-safe the fget cruft, I'll get to that asap.
* This is a forward port of Peter's vlrureclaim() fix, with some minor modsdillon2001-12-181-1/+2
| | | | | | | | | | | | | by me to make it more efficient. The original code had serious balancing problems and could also deadlock easily. This code relegates the vnode reclamation to its own kproc and relaxes the vnode reclamation requirements to better maintain kern.maxvnodes. This code still doesn't balance as well as it could, but it does a much better job then the original code. Approved by: re@freebsd.org Obtained from: ps, peter, dillon MFS Assuming: Assuming no problems crop up in Yahoo testing MFC after: 7 days
* turn vn_open() into a wrapper around vn_open_cred() which allowsalfred2001-11-111-2/+12
| | | | | | | | one to perform a vn_open using temporary/other/fake credentials. Modify the nfs client side locking code to use vn_open_cred() passing proc0's ucred instead of the old way which was to temporary raise privs while running vn_open(). This should close the race hopefully.
* o vn_open() fails to call VOP_CLOSE() if vfs_object_create fails. Ideallyrwatson2001-10-231-0/+1
| | | | | | all successful calls to VOP_OPEN() might be reflected in a call to VOP_CLOSE(). For now, simply add a comment reflecting this problem; this should be fixed at some point.
* Add missing includes of sys/lock.h.jhb2001-10-111-0/+1
|
* Make uio_yield() a global. Call uio_yield() between chunksdillon2001-09-261-1/+4
| | | | | | | | | | | | | | in vn_rdwr_inchunks(), allowing other processes to gain an exclusive lock on the vnode. Specifically: directory scanning, to avoid a race to the root directory, and multiple child processes coring simultaniously so they can figure out that some other core'ing child has an exclusive adv lock and just exit instead. This completely fixes performance problems when large programs core. You can have hundreds of copies (forked children) of the same binary core all at once and not notice. MFC after: 3 days
* KSE Milestone 2julian2001-09-121-79/+79
| | | | | | | | | | | | | | Note ALL MODULES MUST BE RECOMPILED make the kernel aware that there are smaller units of scheduling than the process. (but only allow one thread per process at this time). This is functionally equivalent to teh previousl -current except that there is a thread associated with each process. Sorry john! (your next MFC will be a doosie!) Reviewed by: peter@freebsd.org, dillon@freebsd.org X-MFC after: ha ha ha ha
* This brings in a Yahoo coredump patch from Paul, with additional mods bydillon2001-09-081-0/+39
| | | | | | | | | | | | | | | | | | | me (addition of vn_rdwr_inchunks). The problem Yahoo is solving is that if you have large process images core dumping, or you have a large number of forked processes all core dumping at the same time, the original coredump code would leave the vnode locked throughout. This can cause the directory vnode to get locked up, which can cause the parent directory vnode to get locked up, and so on all the way to the root node, locking the entire machine up for extremely long periods of time. This patch solves the problem in two ways. First it uses an advisory non-blocking lock to abort multiple processes trying to core to the same file. Second (my contribution) it chunks up the writes and uses bwillwrite() to avoid holding the vnode locked while blocking in the buffer cache. Submitted by: ps Reviewed by: dillon MFC after: 2 weeks
* vn_stat(): if va_size (u_quad_t) > OFF_MAX, return EOVERFLOW, don't copy itache2001-08-231-0/+4
| | | | blindly to st_size
* This patch implements O_DIRECT about 80% of the way. It takes a patchsetdillon2001-05-241-0/+4
| | | | | | | | | | | | | | | | Tor created a while ago, removes the raw I/O piece (that has cache coherency problems), and adds a buffer cache / VM freeing piece. Essentially this patch causes O_DIRECT I/O to not be left in the cache, but does not prevent it from going through the cache, hence the 80%. For the last 20% we need a method by which the I/O can be issued directly to buffer supplied by the user process and bypass the buffer cache entirely, but still maintain cache coherency. I also have the code working under -stable but the changes made to sys/file.h may not be MFCable, so an MFC is not on the table yet. Submitted by: tegge, dillon
* Revert consequences of changes to mount.h, part 2.grog2001-04-291-2/+0
| | | | Requested by: bde
* When closing the last reference to an unlinked file, it is freedmckusick2001-04-251-0/+9
| | | | | | | | | | | | | | | by the inactive routine. Because the freeing causes the filesystem to be modified, the close must be held up during periods when the filesystem is suspended. For snapshots to be consistent across crashes, they must write blocks that they copy and claim those written blocks in their on-disk block pointers before the old blocks that they referenced can be allowed to be written. Close a loophole that allowed unwritten blocks to be skipped when doing ffs_sync with a request to wait for all I/O activity to be completed.
* Correct #includes to work with fixed sys/mount.h.grog2001-04-231-0/+2
|
* Previous commit broke interlock locking for !LK_RETRY case.bp2001-03-261-2/+3
|
* Prevent race condition by using msleep() instead of mtx_unlock()/tsleep().bp2001-03-261-2/+1
| | | | Reviewed by: alfred
* o Rename "namespace" argument to "attrnamespace" as namespace is a C++rwatson2001-03-191-7/+7
| | | | | | | reserved word. Submitted by: jkh Obtained from: TrustedBSD Project
* o Change the API and ABI of the Extended Attribute kernel interfaces torwatson2001-03-151-8/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | introduce a new argument, "namespace", rather than relying on a first- character namespace indicator. This is in line with more recent thinking on EA interfaces on various mailing lists, including the posix1e, Linux acl-devel, and trustedbsd-discuss forums. Two namespaces are defined by default, EXTATTR_NAMESPACE_SYSTEM and EXTATTR_NAMESPACE_USER, where the primary distinction lies in the access control model: user EAs are accessible based on the normal MAC and DAC file/directory protections, and system attributes are limited to kernel-originated or appropriately privileged userland requests. o These API changes occur at several levels: the namespace argument is introduced in the extattr_{get,set}_file() system call interfaces, at the vnode operation level in the vop_{get,set}extattr() interfaces, and in the UFS extended attribute implementation. Changes are also introduced in the VFS extattrctl() interface (system call, VFS, and UFS implementation), where the arguments are modified to include a namespace field, as well as modified to advoid direct access to userspace variables from below the VFS layer (in the style of recent changes to mount by adrian@FreeBSD.org). This required some cleanup and bug fixing regarding VFS locks and the VFS interface, as a vnode pointer may now be optionally submitted to the VFS_EXTATTRCTL() call. Updated documentation for the VFS interface will be committed shortly. o In the near future, the auto-starting feature will be updated to search two sub-directories to the ".attribute" directory in appropriate file systems: "user" and "system" to locate attributes intended for those namespaces, as the single filename is no longer sufficient to indicate what namespace the attribute is intended for. Until this is committed, all attributes auto-started by UFS will be placed in the EXTATTR_NAMESPACE_SYSTEM namespace. o The default POSIX.1e attribute names for ACLs and Capabilities have been updated to no longer include the '$' in their filename. As such, if you're using these features, you'll need to rename the attribute backing files to the same names without '$' symbols in front. o Note that these changes will require changes in userland, which will be committed shortly. These include modifications to the extended attribute utilities, as well as to libutil for new namespace string conversion routines. Once the matching userland changes are committed, a buildworld is recommended to update all the necessary include files and verify that the kernel and userland environments are in sync. Note: If you do not use extended attributes (most people won't), upgrading is not imperative although since the system call API has changed, the new userland extended attribute code will no longer compile with old include files. o Couple of minor cleanups while I'm there: make more code compilation conditional on FFS_EXTATTR, which should recover a bit of space on kernels running without EA's, as well as update copyright dates. Obtained from: TrustedBSD Project
* Extend kqueue down to the device layer.jlemon2001-02-151-81/+6
| | | | Backwards compatible approach suggested by: peter
* Change and clean the mutex lock interface.bmilekic2001-02-091-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mtx_enter(lock, type) becomes: mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks) mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized) similarily, for releasing a lock, we now have: mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN. We change the caller interface for the two different types of locks because the semantics are entirely different for each case, and this makes it explicitly clear and, at the same time, it rids us of the extra `type' argument. The enter->lock and exit->unlock change has been made with the idea that we're "locking data" and not "entering locked code" in mind. Further, remove all additional "flags" previously passed to the lock acquire/release routines with the exception of two: MTX_QUIET and MTX_NOSWITCH The functionality of these flags is preserved and they can be passed to the lock/unlock routines by calling the corresponding wrappers: mtx_{lock, unlock}_flags(lock, flag(s)) and mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN locks, respectively. Re-inline some lock acq/rel code; in the sleep lock case, we only inline the _obtain_lock()s in order to ensure that the inlined code fits into a cache line. In the spin lock case, we inline recursion and actually only perform a function call if we need to spin. This change has been made with the idea that we generally tend to avoid spin locks and that also the spin locks that we do have and are heavily used (i.e. sched_lock) do recurse, and therefore in an effort to reduce function call overhead for some architectures (such as alpha), we inline recursion for this case. Create a new malloc type for the witness code and retire from using the M_DEV type. The new type is called M_WITNESS and is only declared if WITNESS is enabled. Begin cleaning up some machdep/mutex.h code - specifically updated the "optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently need those. Finally, caught up to the interface changes in all sys code. Contributors: jake, jhb, jasone (in no particular order)
* Convert all simplelocks to mutexes and remove the simplelock implementations.jasone2001-01-241-4/+4
|
* Implement a low-memory deadlock solution.dillon2000-11-181-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Removed most of the hacks that were trying to deal with low-memory situations prior to now. The new code is based on the concept that I/O must be able to function in a low memory situation. All major modules related to I/O (except networking) have been adjusted to allow allocation out of the system reserve memory pool. These modules now detect a low memory situation but rather then block they instead continue to operate, then return resources to the memory pool instead of cache them or leave them wired. Code has been added to stall in a low-memory situation prior to a vnode being locked. Thus situations where a process blocks in a low-memory condition while holding a locked vnode have been reduced to near nothing. Not only will I/O continue to operate, but many prior deadlock conditions simply no longer exist. Implement a number of VFS/BIO fixes (found by Ian): in biodone(), bogus-page replacement code, the loop was not properly incrementing loop variables prior to a continue statement. We do not believe this code can be hit anyway but we aren't taking any chances. We'll turn the whole section into a panic (as it already is in brelse()) after the release is rolled. In biodone(), the foff calculation was incorrectly clamped to the iosize, causing the wrong foff to be calculated for pages in the case of an I/O error or biodone() called without initiating I/O. The problem always caused a panic before. Now it doesn't. The problem is mainly an issue with NFS. Fixed casts for ~PAGE_MASK. This code worked properly before only because the calculations use signed arithmatic. Better to properly extend PAGE_MASK first before inverting it for the 64 bit masking op. In brelse(), the bogus_page fixup code was improperly throwing away the original contents of 'm' when it did the j-loop to fix the bogus pages. The result was that it would potentially invalidate parts of the *WRONG* page(!), leading to corruption. There may still be cases where a background bitmap write is being duplicated, causing potential corruption. We have identified a potentially serious bug related to this but the fix is still TBD. So instead this patch contains a KASSERT to detect the problem and panic the machine rather then continue to corrupt the filesystem. The problem does not occur very often.. it is very hard to reproduce, and it may or may not be the cause of the corruption people have reported. Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>) Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>
* Take VBLK devices further out of their missery.phk2000-11-021-2/+2
| | | | This should fix the panic I introduced in my previous commit on this topic.
* Catch up to moving headers:jhb2000-10-201-2/+1
| | | | | - machine/ipl.h -> sys/ipl.h - machine/mutex.h -> sys/mutex.h
* Convert lockmgr locks from using simple locks to using mutexes.jasone2000-10-041-2/+4
| | | | | | Add lockdestroy() and appropriate invocations, which corresponds to lockinit() and must be called to clean up after a lockmgr lock is no longer needed.
* o Introduce vn_extattr_rm(), a helper function in the style ofrwatson2000-09-221-0/+23
| | | | | | | | vn_extattr_get() and vn_extattr_set(). vn_extattr_rm() removes the specified extended attribute from a vnode, authorizing the change as the kernel (NULL cred). Obtained from: TrustedBSD Project
OpenPOWER on IntegriCloud