summaryrefslogtreecommitdiffstats
path: root/sys/ufs
Commit message (Collapse)AuthorAgeFilesLines
...
* Fixes to track snapshot copy-on-write checking in the specinfomckusick2001-03-078-58/+57
| | | | | | structure rather than assuming that the device vnode would reside in the FFS filesystem (which is obviously a broken assumption with the device filesystem).
* Grab the process lock while calling psignal and before calling psignal.jhb2001-03-071-0/+2
|
* Protect SIGDELSET of p_siglist with the proc lock.jhb2001-03-071-1/+4
|
* Free lock before returning from process_worklist_item.mckusick2001-03-011-1/+3
| | | | Obtained from: Constantine Sapuntzakis <csapuntz@stanford.edu>
* Reviewed by: jlemonadrian2001-03-011-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | An initial tidyup of the mount() syscall and VFS mount code. This code replaces the earlier work done by jlemon in an attempt to make linux_mount() work. * the guts of the mount work has been moved into vfs_mount(). * move `type', `path' and `flags' from being userland variables into being kernel variables in vfs_mount(). `data' remains a pointer into userspace. * Attempt to verify the `type' and `path' strings passed to vfs_mount() aren't too long. * rework mount() and linux_mount() to take the userland parameters (besides data, as mentioned) and pass kernel variables to vfs_mount(). (linux_mount() already did this, I've just tidied it up a little more.) * remove the copyin*() stuff for `path'. `data' still requires copyin*() since its a pointer into userland. * set `mount->mnt_statf_mntonname' in vfs_mount() rather than in each filesystem. This variable is generally initialised with `path', and each filesystem can override it if they want to. * NOTE: f_mntonname is intiailised with "/" in the case of a root mount.
* Add a NOTE_REVOKE flag for vnodes, which is triggered from within vclean().jlemon2001-02-231-0/+13
| | | | | | | Use this to tell a filter attached to a vnode that the underlying vnode is no longer valid, by returning EV_EOF. PR: kern/25309, kern/25206
* Use correct list pointer when detaching knote from list.jlemon2001-02-231-1/+2
|
* Free lock before calling panic so that subsequent attempt to write outmckusick2001-02-231-62/+191
| | | | | buffers does not re-panic with `locking against myself'. This change should not affect normal operations of soft updates in any way.
* When cleaning up excess inode dependencies, check for being done.mckusick2001-02-221-0/+2
| | | | Reviewed by: Jan Koum <jkb@yahoo-inc.com>
* This patch corrects two problems with the rate limiting codemckusick2001-02-201-4/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | that was introduced in revision 1.80. The problem manifested itself with a `locking against myself' panic and could also result in soft updates inconsistences associated with inodedeps. The two problems are: 1) One of the background operations could manipulate the bitmap while holding it locked with intent to create. This held lock results in a `locking against myself' panic, when the background processing that we have been coopted to do tries to lock the bitmap which we are already holding locked. To understand how to fix this problem, first, observe that we can do the background cleanups in inodedep_lookup only when allocating inodedeps (DEPALLOC is set in the call to inodedep_lookup). Second observe that calls to inodedep_lookup with DEPALLOC set can only happen from the following calls into the softdep code: softdep_setup_inomapdep softdep_setup_allocdirect softdep_setup_remove softdep_setup_freeblocks softdep_setup_directory_change softdep_setup_directory_add softdep_change_linkcnt Only the first two of these can come from ffs_alloc.c while holding a bitmap locked. Thus, inodedep_lookup must not go off to do request_cleanups when being called from these functions. This change adds a flag, NODELAY, that can be passed to inodedep_lookup to let it know that it should not do background processing in those cases. 2) The return value from request_cleanup when helping out with the cleanup was 0 instead of 1. This meant that despite the fact that we may have slept while doing the cleanups, the code did not recheck for the appearance of an inodedep (e.g., goto top in inodedep_lookup). This lead to the softdep inconsistency in which we ended up with two inodedep's for the same inode. Reviewed by: Peter Wemm <peter@yahoo-inc.com>, Matt Dillon <dillon@earth.backplane.com>
* Preceed/preceeding are not english words. Use precede and preceding.asmodai2001-02-181-1/+1
|
* Extend kqueue down to the device layer.jlemon2001-02-151-0/+72
| | | | Backwards compatible approach suggested by: peter
* Implement a unified run queue and adjust priority levels accordingly.jake2001-02-121-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - All processes go into the same array of queues, with different scheduling classes using different portions of the array. This allows user processes to have their priorities propogated up into interrupt thread range if need be. - I chose 64 run queues as an arbitrary number that is greater than 32. We used to have 4 separate arrays of 32 queues each, so this may not be optimal. The new run queue code was written with this in mind; changing the number of run queues only requires changing constants in runq.h and adjusting the priority levels. - The new run queue code takes the run queue as a parameter. This is intended to be used to create per-cpu run queues. Implement wrappers for compatibility with the old interface which pass in the global run queue structure. - Group the priority level, user priority, native priority (before propogation) and the scheduling class into a struct priority. - Change any hard coded priority levels that I found to use symbolic constants (TTIPRI and TTOPRI). - Remove the curpriority global variable and use that of curproc. This was used to detect when a process' priority had lowered and it should yield. We now effectively yield on every interrupt. - Activate propogate_priority(). It should now have the desired effect without needing to also propogate the scheduling class. - Temporarily comment out the call to vm_page_zero_idle() in the idle loop. It interfered with propogate_priority() because the idle process needed to do a non-blocking acquire of Giant and then other processes would try to propogate their priority onto it. The idle process should not do anything except idle. vm_page_zero_idle() will return in the form of an idle priority kernel thread which is woken up at apprioriate times by the vm system. - Update struct kinfo_proc to the new priority interface. Deliberately change its size by adjusting the spare fields. It remained the same size, but the layout has changed, so userland processes that use it would parse the data incorrectly. The size constraint should really be changed to an arbitrary version number. Also add a debug.sizeof sysctl node for struct kinfo_proc.
* Change and clean the mutex lock interface.bmilekic2001-02-095-55/+55
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mtx_enter(lock, type) becomes: mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks) mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized) similarily, for releasing a lock, we now have: mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN. We change the caller interface for the two different types of locks because the semantics are entirely different for each case, and this makes it explicitly clear and, at the same time, it rids us of the extra `type' argument. The enter->lock and exit->unlock change has been made with the idea that we're "locking data" and not "entering locked code" in mind. Further, remove all additional "flags" previously passed to the lock acquire/release routines with the exception of two: MTX_QUIET and MTX_NOSWITCH The functionality of these flags is preserved and they can be passed to the lock/unlock routines by calling the corresponding wrappers: mtx_{lock, unlock}_flags(lock, flag(s)) and mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN locks, respectively. Re-inline some lock acq/rel code; in the sleep lock case, we only inline the _obtain_lock()s in order to ensure that the inlined code fits into a cache line. In the spin lock case, we inline recursion and actually only perform a function call if we need to spin. This change has been made with the idea that we generally tend to avoid spin locks and that also the spin locks that we do have and are heavily used (i.e. sched_lock) do recurse, and therefore in an effort to reduce function call overhead for some architectures (such as alpha), we inline recursion for this case. Create a new malloc type for the witness code and retire from using the M_DEV type. The new type is called M_WITNESS and is only declared if WITNESS is enabled. Begin cleaning up some machdep/mutex.h code - specifically updated the "optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently need those. Finally, caught up to the interface changes in all sys code. Contributors: jake, jhb, jasone (in no particular order)
* Another round of the <sys/queue.h> FOREACH transmogriffer.phk2001-02-043-30/+15
| | | | | Created with: sed(1) Reviewed by: md5(1)
* Mechanical change to use <sys/queue.h> macro API instead ofphk2001-02-044-26/+26
| | | | | | | fondling implementation details. Created with: sed(1) Reviewed by: md5(1)
* Use <sys/queue.h> macro API.phk2001-02-042-18/+18
|
* Remove a DIAGNOSTIC check which belongs in <sys/queue.h> if anyplace at all.phk2001-02-041-4/+0
|
* Extend the sanity checks in ufs_lookup to ensure that each directoryiedowse2001-02-041-1/+2
| | | | | | | | | | | | | | entry fits within its DIRBLKSIZ block. The surrounding code is extremely fragile with respect to corruption of the directory entry 'd_reclen' field; if directory corruption occurs, it can blindly scan forward beyond the end of the filesystem block. Usually this results in a 'fault on nofault entry' panic. Directory corruption is now much more likely to be detected, resulting in a 'ufs_dirbad' panic. If the filesystem is read-only, it will simply print a warning message, and skip the corrupted block. Reviewed by: mckusick
* Use the correct flags field when checking for a read-only filesystemiedowse2001-02-031-1/+1
| | | | | | | | | | in ufs_dirbad(). The mnt_stat.f_flags field is only updated by the syscalls *statfs and getfsstat, so mnt_flag should be used instead. This only affects whether or not a panic is generated on detection of certain types of directory corruption. Reviewed by: mckusick
* Fix a race between the syncer and umount. When you umount a softupdatesdillon2001-01-301-12/+38
| | | | | | | | | | | filesystem softdep_process_worklist() is called in a loop until it indicates that no dependancies remain, but the determination of that fact depends on there only being one softdep_process_worklist() instance running. It was possible for the syncer to also be running softdep_process_worklist() and the pre-existing checks in the code to prevent this were not sufficient to prevent the race. This patch solves the problem. Approved-by: mckusick
* Convert all simplelocks to mutexes and remove the simplelock implementations.jasone2001-01-244-37/+34
|
* The ffs superblock includes a 128-byte region for use by temporaryiedowse2001-01-153-34/+36
| | | | | | | | | | | | | | | | | | | | in-core pointers to summary information. An array in this region (fs_csp) could overflow on filesystems with a very large number of cylinder groups (~16000 on i386 with 8k blocks). When this happens, other fields in the superblock get corrupted, and fsck refuses to check the filesystem. Solve this problem by replacing the fs_csp array in 'struct fs' with a single pointer, and add padding to keep the length of the 128-byte region fixed. Update the kernel and userland utilities to use just this single pointer. With this change, the kernel no longer makes use of the superblock fields 'fs_csshift' and 'fs_csmask'. Add a comment to newfs/mkfs.c to indicate that these fields must be calculated for compatibility with older kernels. Reviewed by: mckusick
* Properly compute the size of the final block of superblock summary information.mckusick2001-01-121-1/+1
| | | | Submitted by: Ian Dowse <iedowse@maths.tcd.ie>
* o Commit reems of style(9) changes, whitespace improvements, and commentrwatson2001-01-071-61/+60
| | | | | | cleanups. Obtained from: TrustedBSD Project
* o Zero the ufs_extattr_header length field (not necessary, but not a badrwatson2001-01-071-1/+8
| | | | | | | | | | idea either) in ufs_extattr_rm. o More completely fill out the local_aio structure when writing out the zero'd extended attribute in ufs_extattr_rm -- previoulsy, this worked fine, but probably should not have. This corrects extraneous warnings about inconsistent inodes following file deletion. Reviewed by: jedgar
* o Add an additional EA inconsistency reporting opportunity inrwatson2001-01-071-2/+16
| | | | | | | | ufs_extattr_rm. o Make both reporting locations report the function name where the inconsistency is discovered, as well as the inode number in question. Reviewed by: jedgar
* o Make call to ufs_extattr_rm() in ufs_extattr_vnode_inactive() userwatson2001-01-071-1/+1
| | | | | | | NULL as the credential, not 0, so as to make it more clear what's going on. Obtained from: TrustedBSD Project
* o Remove unnecessary sanity check involving requested offset of extendedrwatson2001-01-071-5/+0
| | | | | | | | | | attribute read--the offset is required to be 0 by an earlier check, meaning that it will always be within the scope of the attribute data. This change should have no impact on executed code paths other than removing the unnecessary check: please report if any new failures start to occur as a result. Obtained from: TrustedBSD Project
* This implements a better launder limiting solution. There was a solutiondillon2000-12-261-1/+3
| | | | | | | | | | | | | | | | | | | in 4.2-REL which I ripped out in -stable and -current when implementing the low-memory handling solution. However, maxlaunder turns out to be the saving grace in certain very heavily loaded systems (e.g. newsreader box). The new algorithm limits the number of pages laundered in the first pageout daemon pass. If that is not sufficient then suceessive will be run without any limit. Write I/O is now pipelined using two sysctls, vfs.lorunningspace and vfs.hirunningspace. This prevents excessive buffered writes in the disk queues which cause long (multi-second) delays for reads. It leads to more stable (less jerky) and generally faster I/O streaming to disk by allowing required read ops (e.g. for indirect blocks and such) to occur without interrupting the write stream, amoung other things. NOTE: eventually, filesystem write I/O pipelining needs to be done on a per-device basis. At the moment it is globalized.
* Several small but important fixes for snapshots:mckusick2000-12-194-17/+40
| | | | | | | | | | | 1) Be more tolerant of missing snapshot files by only trying to decrement their reference count if they are registered as active. 2) Fix for snapshots of filesystems with block sizes larger than 8K (from Ollivier Robert <roberto@eurocontrol.fr>). 3) Fix to avoid losing last block in snapshot file when calculating blocks that need to be copied (from Don Coleman <coleman@coleman.org>).
* Get rid of spurious check in ffs_truncate for i_size == lengthmckusick2000-12-191-2/+0
| | | | | | | which fails to set the modification time on the file. The same check a few lines later takes the correct action. Submitted by: Ian Dowse <iedowse@maths.tcd.ie>
* add a stub for softdep_slowdown so that it's possible to build theassar2000-12-171-0/+6
| | | | kernel without SOFTUPDATES
* Avoid a data-consistency race between write() and mmap()dillon2000-12-171-0/+9
| | | | | | | | by ensuring that newly allocated blocks are zerod. The race can occur even in the case where the write covers the entire block. Reported by: Sven Berkvens <sven@berkvens.net>, Marc Olzheim <zlo@zlo.nu>
* - Move ifs_init() so that it can initialize ifs_inode_hash_mtx.tanimura2000-12-141-12/+12
| | | | - s/ffs_inode_hash_lock/ifs_inode_hash_lock/
* Do not race for the lock of an inode hash.tanimura2000-12-132-12/+84
| | | | Reviewed by: jhb
* Preventing runaway kernel soft updates memory, take three.mckusick2000-12-134-73/+168
| | | | | | | | | | | | | | | | | | | | | Previously, the syncer process was the only process in the system that could process the soft updates background work list. If enough other processes were adding requests to that list, it would eventually grow without bound. Because some of the work list requests require vnodes to be locked, it was not generally safe to let random processes process the work list while they already held vnodes locked. By adding a flag to the work list queue processing function to indicate whether the calling process could safely lock vnodes, it becomes possible to co-opt other processes into helping out with the work list. Now when the worklist gets too large, other processes can safely help out by picking off those work requests that can be handled without locking a vnode, leaving only the small number of requests requiring a vnode lock for the syncer process. With this change, it appears possible to keep even the nastiest workloads under control. Submitted by: Paul Saab <ps@yahoo-inc.com>
* Convert more malloc+bzero to malloc+M_ZERO.dwmalone2000-12-083-18/+11
| | | | | Submitted by: josh@zipperup.org Submitted by: Robert Drehmel <robd@gmx.net>
* Staticize some malloc M_ instances.phk2000-12-081-13/+13
|
* Add necessary bwillwrite() in writev() entry point.dillon2000-12-061-3/+3
| | | | | | Deal with excessive dirty buffers when msync() syncs non-contiguous dirty buffers by checking for the case in UFS *before* checking for clusterability.
* More aggressively rate limit the growth of soft dependency structuresmckusick2000-11-201-33/+21
| | | | | | | | | | | in the face of multiple processes doing massive numbers of filesystem operations. While this patch will work in nearly all situations, there are still some perverse workloads that can overwhelm the system. Detecting and handling these perverse workloads will be the subject of another patch. Reviewed by: Paul Saab <ps@yahoo-inc.com> Obtained from: Ethan Solomita <ethan@geocast.com>
* Implement a low-memory deadlock solution.dillon2000-11-183-17/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Removed most of the hacks that were trying to deal with low-memory situations prior to now. The new code is based on the concept that I/O must be able to function in a low memory situation. All major modules related to I/O (except networking) have been adjusted to allow allocation out of the system reserve memory pool. These modules now detect a low memory situation but rather then block they instead continue to operate, then return resources to the memory pool instead of cache them or leave them wired. Code has been added to stall in a low-memory situation prior to a vnode being locked. Thus situations where a process blocks in a low-memory condition while holding a locked vnode have been reduced to near nothing. Not only will I/O continue to operate, but many prior deadlock conditions simply no longer exist. Implement a number of VFS/BIO fixes (found by Ian): in biodone(), bogus-page replacement code, the loop was not properly incrementing loop variables prior to a continue statement. We do not believe this code can be hit anyway but we aren't taking any chances. We'll turn the whole section into a panic (as it already is in brelse()) after the release is rolled. In biodone(), the foff calculation was incorrectly clamped to the iosize, causing the wrong foff to be calculated for pages in the case of an I/O error or biodone() called without initiating I/O. The problem always caused a panic before. Now it doesn't. The problem is mainly an issue with NFS. Fixed casts for ~PAGE_MASK. This code worked properly before only because the calculations use signed arithmatic. Better to properly extend PAGE_MASK first before inverting it for the 64 bit masking op. In brelse(), the bogus_page fixup code was improperly throwing away the original contents of 'm' when it did the j-loop to fix the bogus pages. The result was that it would potentially invalidate parts of the *WRONG* page(!), leading to corruption. There may still be cases where a background bitmap write is being duplicated, causing potential corruption. We have identified a potentially serious bug related to this but the fix is still TBD. So instead this patch contains a KASSERT to detect the problem and panic the machine rather then continue to corrupt the filesystem. The problem does not occur very often.. it is very hard to reproduce, and it may or may not be the cause of the corruption people have reported. Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>) Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>
* When deleting a file, the ordering of events imposed by soft updatesmckusick2000-11-141-15/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | is to first write the deleted directory entry to disk, second write the zero'ed inode to disk, and finally to release the freed blocks and the inode back to the cylinder-group map. As this ordering requires two disk writes to occur which are normally spaced about 30 seconds apart (except when memory is under duress), it takes about a minute from the time that a file is deleted until its inode and data blocks show up in the cylinder-group map for reallocation. If a file has had only a brief lifetime (less than 30 seconds from creation to deletion), neither its inode nor its directory entry may have been written to disk. If its directory entry has not been written to disk, then we need not wait for that directory block to be written as the on-disk directory block does not reference the inode. Similarly, if the allocated inode has never been written to disk, we do not have to wait for it to be written back either as its on-disk representation is still zero'ed out. Thus, in the case of a short lived file, we can simply release the blocks and inode to the cylinder-group map immediately. As the inode and its blocks are released immediately, they are immediately available for other uses. If they are not released for a minute, then other inodes and blocks must be allocated for short lived files, cluttering up the vnode and buffer caches. The previous code was a bit too aggressive in trying to release the blocks and inode back to the cylinder-group map resulting in their being made available when in fact the inode on disk had not yet been zero'ed. This patch takes a more conservative approach to doing the release which avoids doing the release prematurely.
* Fixed breakage of mknod() in rev.1.48 of ext2_vnops.c and rev.1.126 ofbde2000-11-041-1/+3
| | | | | | | | | | | | | | | | | | | | ufs_vnops.c: 1) i_ino was confused with i_number, so the inode number passed to VFS_VGET() was usually wrong (usually 0U). 2) ip was dereferenced after vgone() freed it, so the inode number passed to VFS_VGET() was sometimes not even wrong. Bug (1) was usually fatal in ext2_mknod(), since ext2fs doesn't have space for inode 0 on the disk; ino_to_fsba() subtracts 1 from the inode number, so inode number 0U gives a way out of bounds array index. Bug(1) was usually harmless in ufs_mknod(); ino_to_fsba() doesn't subtract 1, and VFS_VGET() reads suitable garbage (all 0's?) from the disk for the invalid inode number 0U; ufs_mknod() returns a wrong vnode, but most callers just vput() it; the correct vnode is eventually obtained by an implicit VFS_VGET() just like it used to be. Bug (2) usually doesn't happen.
* Give vop_mmap an untimely death. The opportunity to give it a timelyeivind2000-11-011-21/+0
| | | | death timed out in 1996.
* Add a missing <sys/systm.h>phk2000-10-301-0/+1
|
* Move suser() and suser_xxx() prototypes and a related #define fromphk2000-10-291-1/+0
| | | | | | | | | <sys/proc.h> to <sys/systm.h>. Correctly document the #includes needed in the manpage. Add one now needed #include of <sys/systm.h>. Remove the consequent 48 unused #includes of <sys/proc.h>.
* Weaken a bogus dependency on <sys/proc.h> in <sys/buf.h> by #ifdef'ingphk2000-10-293-3/+0
| | | | | | | | | | the offending inline function (BUF_KERNPROC) on it being #included already. I'm not sure BUF_KERNPROC() is even the right thing to do or in the right place or implemented the right way (inline vs normal function). Remove consequently unneeded #includes of <sys/proc.h>
* Remove unneeded #include <sys/proc.h> lines.phk2000-10-291-1/+0
|
* o Introduce new VOP_ACCESS() flag VADMIN, allowing file systems to performrwatson2000-10-192-26/+48
| | | | | | | | | | | | | | | | | | | | "administrative" authorization checks. In most cases, the VADMIN test checks to make sure the credential effective uid is the same as the file owner. o Modify vaccess() to set VADMIN as an available right if the uid is appropriate. o Modify references to uid-based access control operations such that they now always invoke VOP_ACCESS() instead of using hard-coded policy checks. o This allows alternative UFS policies to be implemented by replacing only ufs_access() (such as mandatory system policies). o VOP_ACCESS() requires the caller to hold an exclusive vnode lock on the vnode: I believe that new invocations of VOP_ACCESS() are always called with the lock held. o Some direct checks of the uid remain, largely associated with the QUOTA and SUIDDIR code. Reviewed by: eivind Obtained from: TrustedBSD Project
OpenPOWER on IntegriCloud