summaryrefslogtreecommitdiffstats
path: root/sys/ufs
Commit message (Collapse)AuthorAgeFilesLines
* Rework the credential code to support larger values of NGROUPS andbrooks2009-06-191-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | NGROUPS_MAX, eliminate ABI dependencies on them, and raise the to 1024 and 1023 respectively. (Previously they were equal, but under a close reading of POSIX, NGROUPS_MAX was defined to be too large by 1 since it is the number of supplemental groups, not total number of groups.) The bulk of the change consists of converting the struct ucred member cr_groups from a static array to a pointer. Do the equivalent in kinfo_proc. Introduce new interfaces crcopysafe() and crsetgroups() for duplicating a process credential before modifying it and for setting group lists respectively. Both interfaces take care for the details of allocating groups array. crsetgroups() takes care of truncating the group list to the current maximum (NGROUPS) if necessary. In the future, crsetgroups() may be responsible for insuring invariants such as sorting the supplemental groups to allow groupmember() to be implemented as a binary search. Because we can not change struct xucred without breaking application ABIs, we leave it alone and introduce a new XU_NGROUPS value which is always 16 and is to be used or NGRPS as appropriate for things such as NFS which need to use no more than 16 groups. When feasible, truncate the group list rather than generating an error. Minor changes: - Reduce the number of hand rolled versions of groupmember(). - Do not assign to both cr_gid and cr_groups[0]. - Modify ipfw to cache ucreds instead of part of their contents since they are immutable once referenced by more than one entity. Submitted by: Isilon Systems (initial implementation) X-MFC after: never PR: bin/113398 kern/133867
* Keep dirhash tailq locked throughout the entirety of ufsdirhash_destroy() to fixsnb2009-06-171-11/+11
| | | | | | | | | | a potential race pointed out by pjd. Also use TAILQ_FOREACH_SAFE to iterate over dirhashes in ufsdirhash_lowmem(), so that we can continue iterating even after a dirhash is destroyed. Suggested by: pjd Tested by: pho Approved by: dwmalone (mentor)
* Do not use casts (int *)0 and (struct thread *)0 for the arguments ofkib2009-06-162-2/+2
| | | | | | | vn_rdwr, use NULL. Reviewed by: jhb MFC after: 1 week
* Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERICrwatson2009-06-052-2/+0
| | | | | | | | and used in a large number of files, but also because an increasing number of incorrect uses of MAC calls were sneaking in due to copy-and-paste of MAC-aware code without the associated opt_mac.h include. Discussed with: pjd
* Add vm_lowmem event handler for dirhash. This will cause dirhashes to besnb2009-06-032-26/+109
| | | | | | | | | | | | | deleted when the system is low on memory. This ought to allow an increase to vfs.ufs.dirhash_maxmem on machines that have lots of memory, without degrading performance by having too much memory reserved for dirhash when other things need it. The default value for dirhash_maxmem is being kept at 2MB for now, though. This work was mostly done during the 2008 Google Summer of Code. Approved by: dwmalone (mentor), re MFC after: 3 months
* Handle lock recursion differenty by always checking against LO_RECURSABLEattilio2009-06-021-2/+2
| | | | | | instead the lock own flag itself. Tested by: pho
* Add hierarchical jails. A jail may further virtualize its environmentjamie2009-05-271-1/+0
| | | | | | | | | | | | | | | | | | | | | | by creating a child jail, which is visible to that jail and to any parent jails. Child jails may be restricted more than their parents, but never less. Jail names reflect this hierarchy, being MIB-style dot-separated strings. Every thread now points to a jail, the default being prison0, which contains information about the physical system. Prison0's root directory is the same as rootvnode; its hostname is the same as the global hostname, and its securelevel replaces the global securelevel. Note that the variable "securelevel" has actually gone away, which should not cause any problems for code that properly uses securelevel_gt() and securelevel_ge(). Some jail-related permissions that were kept in global variables and set via sysctls are now per-jail settings. The sysctls still exist for backward compatibility, used only by the now-deprecated jail(2) system call. Approved by: bz (mentor)
* Make 'struct acl' larger, as required to support NFSv4 ACLs. Providetrasz2009-05-221-126/+162
| | | | | | compatibility interfaces in both kernel and libc. Reviewed by: rwatson
* Introduce vfs_bio_set_valid() and use it from ffs_realloccg(). Thisalc2009-05-171-8/+6
| | | | | | eliminates the misuse of vfs_bio_clrbuf() by ffs_realloccg(). In collaboration with: tegge
* Remove the thread argument from the FSD (File-System Dependent) parts ofattilio2009-05-114-20/+26
| | | | | | | | | | | | | | | | | the VFS. Now all the VFS_* functions and relating parts don't want the context as long as it always refers to curthread. In some points, in particular when dealing with VOPs and functions living in the same namespace (eg. vflush) which still need to be converted, pass curthread explicitly in order to retain the old behaviour. Such loose ends will be fixed ASAP. While here fix a bug: now, UFS_EXTATTR can be compiled alone without the UFS_EXTATTR_AUTOSTART option. VFS KPI is heavilly changed by this commit so thirdy parts modules needs to be recompiled. Bump __FreeBSD_version in order to signal such situation.
* Do not embed struct ucred into larger netcred parent structures.kan2009-05-091-1/+0
| | | | | | | | | | | | | Credential might need to hang around longer than its parent and be used outside of mnt_explock scope controlling netcred lifetime. Use separate reference-counted ucred allocated separately instead. While there, extend mnt_explock coverage in vfs_stdexpcheck and clean-up some unused declarations in new NFS code. Reported by: John Hickey PR: kern/133439 Reviewed by: dfr, kib
* Change the semantics of i_modrev/va_filerev to what is required forrmacklem2009-04-273-6/+6
| | | | | | | | | | | | | | | | the nfsv4 Change attribute. There are 2 changes: 1 - The value now changes on metadata changes as well as data modifications (incremented for IN_CHANGE instead of IN_UPDATE). 2 - It is now saved in spare space in the on-disk i-node so that it survives a crash. Since va_filerev is not passed out into user space, the only current use of va_filerev is in the nfs server, which uses it as the directory cookie verifier. Since this verifier is only passed back to the server by a client verbatim and then the server doesn't check it, changing the semantics should not break anything currently in FreeBSD. Reviewed by: bde Approved by: kib (mentor)
* In ufs_checkpath(), recheck that '..' still points to the inode withkib2009-04-203-41/+55
| | | | | | | | | | | | | | | | | | the same inode number after VFS_VGET() and relock of the vp. If '..' changed, redo the lookup. To reduce code duplication, move the code to read '..' dirent into the static helper function ufs_dir_dd_ino(). Supply the source inode number as an argument to ufs_checkpath() instead of the source inode itself. The inode is unlocked, thus it might be reclaimed, causing accesses to the freed memory. Use vn_vget_ino() to get the '..' vnode by its inode number, instead of directly code VFS_VGET() and relock, to properly busy the mount point while vp lock is dropped. Noted and reviewed by: tegge Tested by: pho MFC after: 1 month
* When verifying '..' after VFS_VGET() in ufs_lookup(), do not returnkib2009-04-191-15/+17
| | | | | | | | | | | | | error if '..' is still there but changed between lookup and check. Start relookup instead. Rename is supposed to change '..' reference atomically, so transient failures introduced by r191137 are wrong. While rearranging the code to allow lookup restart in ufs_lookup(), remove the comment that only distracts the reader. Noted and reviewed by: tegge Also reported by: pho MFC after: 1 month
* Use acl_alloc() and acl_free() instead of using uma(9) directly.trasz2009-04-181-19/+19
| | | | | | This will make switching to malloc(9) easier; also, it would be neccessary to add these routines if/when we implement variable-size ACLs.
* Verify that '..' still exists with the same inode number afterkib2009-04-161-9/+35
| | | | | | | | | | | | | | | VFS_VGET() has returned in ufs_lookup(). If the '..' lookup started immediately before the parent directory was removed, we might return either cleared or unrelated inode otherwise. Ufs_lookup() is split into new function ufs_lookup_() that either does lookup, or verifies that directory entry exists and references supplied inode number. Reviewed by: tegge Tested by: pho, Andreas Tobler <andreast-list fgznet ch> (previous version) MFC after: 1 month
* Remove VOP_LEASE and supporting functions. This hasn't been used sincerwatson2009-04-101-1/+0
| | | | | | | | | | | | | | the removal of NQNFS, but was left in in case it was required for NFSv4. Since our new NFSv4 client and server can't use it for their requirements, GC the old mechanism, as well as other unused lease- related code and interfaces. Due to its impact on kernel programming and binary interfaces, this change should not be MFC'd. Proposed by: jeff Reviewed by: jeff Discussed with: rmacklem, zach loafman @ isilon
* When removing or renaming snaphost, do not delve into request_cleanup().kib2009-04-041-1/+1
| | | | | | | | The later may need blocks from the underlying device that belongs to normal files, that should not be locked while snap lock is held. Reported and tested by: pho MFC after: 1 month
* Correct typo.kib2009-03-271-2/+2
| | | | Noted by: kensmith
* Fix two issues with bufdaemon, often causing the processes to hang inkib2009-03-161-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | the "nbufkv" sleep. First, ffs background cg group block write requests a new buffer for the shadow copy. When ffs_bufwrite() is called from the bufdaemon due to buffers shortage, requesting the buffer deadlock bufdaemon. Introduce a new flag for getnewbuf(), GB_NOWAIT_BD, to request getblk to not block while allocating the buffer, and return failure instead. Add a flag argument to the geteblk to allow to pass the flags to getblk(). Do not repeat the getnewbuf() call from geteblk if buffer allocation failed and either GB_NOWAIT_BD is specified, or geteblk() is called from bufdaemon (or its helper, see below). In ffs_bufwrite(), fall back to synchronous cg block write if shadow block allocation failed. Since r107847, buffer write assumes that vnode owning the buffer is locked. The second problem is that buffer cache may accumulate many buffers belonging to limited number of vnodes. With such workload, quite often threads that own the mentioned vnodes locks are trying to read another block from the vnodes, and, due to buffer cache exhaustion, are asking bufdaemon for help. Bufdaemon is unable to make any substantial progress because the vnodes are locked. Allow the threads owning vnode locks to help the bufdaemon by doing the flush pass over the buffer cache before getnewbuf() is going to uninterruptible sleep. Move the flushing code from buf_daemon() to new helper function buf_do_flush(), that is called from getnewbuf(). The number of buffers flushed by single call to buf_do_flush() from getnewbuf() is limited by new sysctl vfs.flushbufqtarget. Prevent recursive calls to buf_do_flush() by marking the bufdaemon and threads that temporarily help bufdaemon by TDP_BUFNEED flag. In collaboration with: pho Reviewed by: tegge (previous version) Tested by: glebius, yandex ... MFC after: 3 weeks
* The non-modifying EA VOPs are executed with only shared vnode lock taken.kib2009-03-123-63/+94
| | | | | | | | | | | | | Provide a custom lock around initializing and tearing down EA area, to prevent both memory leaks and double-free of it. Count the number of EA area accessors. Lock protocol requires either holding exclusive vnode lock to modify i_ea_area, or shared vnode lock and owning IN_EA_LOCKED flag in i_flag. Noted by: YAMAMOTO, Taku <taku tackymt homeip net> Tested by: pho (previous version) MFC after: 2 weeks
* Do not double-free the struct inode when insmntque failed. Defaultkib2009-03-111-1/+0
| | | | | | | insmntque destructor reclaims the vnode, and ufs_reclaim frees the memory. Reviewed by: tegge MFC after: 3 days
* Add a new internal mount flag (MNTK_EXTENDED_SHARED) to indicate that ajhb2009-03-111-3/+7
| | | | | | | | | | | | | | | | | | | | | | | | filesystem supports additional operations using shared vnode locks. Currently this is used to enable shared locks for open() and close() of read-only file descriptors. - When an ISOPEN namei() request is performed with LOCKSHARED, use a shared vnode lock for the leaf vnode only if the mount point has the extended shared flag set. - Set LOCKSHARED in vn_open_cred() for requests that specify O_RDONLY but not O_CREAT. - Use a shared vnode lock around VOP_CLOSE() if the file was opened with O_RDONLY and the mountpoint has the extended shared flag set. - Adjust md(4) to upgrade the vnode lock on the vnode it gets back from vn_open() since it now may only have a shared vnode lock. - Don't enable shared vnode locks on FIFO vnodes in ZFS and UFS since FIFO's require exclusive vnode locks for their open() and close() routines. (My recent MPSAFE patches for UDF and cd9660 already included this change.) - Enable extended shared operations on UFS, cd9660, and UDF. Submitted by: ups Reviewed by: pjd (ZFS bits) MFC after: 1 month
* Adjust some variables (mostly related to the buffer cache) that holdjhb2009-03-092-3/+3
| | | | | | | | | | | | | | | | | | | address space sizes to be longs instead of ints. Specifically, the follow values are now longs: runningbufspace, bufspace, maxbufspace, bufmallocspace, maxbufmallocspace, lobufspace, hibufspace, lorunningspace, hirunningspace, maxswzone, maxbcache, and maxpipekva. Previously, a relatively small number (~ 44000) of buffers set in kern.nbuf would result in integer overflows resulting either in hangs or bogus values of hidirtybuffers and lodirtybuffers. Now one has to overflow a long to see such problems. There was a check for a nbuf setting that would cause overflows in the auto-tuning of nbuf. I've changed it to always check and cap nbuf but warn if a user-supplied tunable would cause overflow. Note that this changes the ABI of several sysctls that are used by things like top(1), etc., so any MFC would probably require a some gross shims to allow for that. MFC after: 1 month
* Right now, when trying to unmount a device that's already gone,trasz2009-02-231-2/+2
| | | | | | | | | | | | msdosfs_unmount() and ffs_unmount() exit early after getting ENXIO. However, dounmount() treats ENXIO as a success and proceeds with unmounting. In effect, the filesystem gets unmounted without closing GEOM provider etc. Reviewed by: kib Approved by: rwatson (mentor) Tested by: dho Sponsored by: FreeBSD Foundation
* Refactor, moving error checking outside of thetrasz2009-02-231-7/+7
| | | | | | | | | | 'if (mp->mnt_flag & MNT_SOFTDEP)' conditional. No functional changes. Reviewed by: kib Approved by: rwatson (mentor) Tested by: pho Sponsored by: FreeBSD Foundation
* - If the g_access() call for the initial root mount fails, then fullyjhb2009-02-111-6/+6
| | | | | | | | cleanup. Before the GEOM consumer would not have been closed. - Bump the reference on the character device being mounted while the associated devfs vnode is locked. Reviewed by: kib
* When a device containing mounted UFS filesystem disappears, the typetrasz2009-02-061-4/+4
| | | | | | | | | | | | | | | of devvp becomes VBAD, which UFS incorrectly interprets as snapshot vnode, which in turns causes panic. Fix it by replacing '!= VCHR' with '== VREG'. With this fix in place, you should no longer be able to panic the system by removing a device with an UFS filesystem mounted from it - assuming you don't use softupdates. Reviewed by: kib Tested by: pho Approved by: rwatson (mentor) Sponsored by: FreeBSD Foundation
* Make sure the cdev doesn't go away while the filesystem is still mounted.trasz2009-01-291-0/+3
| | | | | | | | Otherwise dev2udev() could return garbage. Reviewed by: kib Approved by: rwatson (mentor) Sponsored by: FreeBSD Foundation
* Following a fair amount of real world experience with ACLs andrwatson2009-01-275-39/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | extended attributes since FreeBSD 5, make the following semantic changes: - Don't update the inode modification time (mtime) when extended attributes (and hence also ACLs) are added, modified, or removed. - Don't update the inode access tie (atime) when extended attributes (and hence also ACLs) are queried. This means that rsync (and related tools) won't improperly think that the data in the file has changed when only the ACL has changed. Note that ffs_reallocblks() has not been changed to not update on an IO_EXT transaction, but currently EAs don't use the cluster write routines so this shouldn't be a problem. If EAs grow support for clustering, then VOP_REALLOCBLKS() will need to grow a flag argument to carry down IO_EXT to UFS. MFC after: 1 week PR: ports/125739 Reported by: Alexander Zagrebin <alexz@visp.ru> Tested by: pluknet <pluknet@gmail.com>, Greg Byshenk <freebsd@byshenk.net> Discussed with: kib, kientzle, timur, Alexander Bokovoy <ab@samba.org>
* Fix a few style bogons.jhb2009-01-211-2/+2
| | | | Submitted by: bde
* Move the code from ufs_lookup.c used to do dotdot lookup, intokib2009-01-211-22/+1
| | | | | | | | | the helper function. It is supposed to be useful for any filesystem that has to unlock dvp to walk to the ".." entry in lookup routine. Requested by: jhb Tested by: pho MFC after: 1 month
* Move the VA_MARKATIME flag for VOP_SETATTR() out into its own VOP:jhb2009-01-211-11/+22
| | | | | | | | | | | | VOP_MARKATIME() since unlike the rest of VOP_SETATTR(), VA_MARKATIME can be performed while holding a shared vnode lock (the same functionality is done internally by VOP_READ which can run with a shared vnode lock). Add missing locking of the vnode interlock to the ufs implementation and remove a special note and test from the NFS client about not supporting the feature. Inspired by: ups Tested by: pho
* The r187467 should remove all pages for V_NORMAL case too, becausekib2009-01-201-8/+17
| | | | | | | | | | | indirect block pages are not removed by the mentioned invocation of the vnode_pager_setsize(). Put a common code into the helper function ffs_pages_remove(). Reported and tested by: dchagin Reviewed by: ups MFC after: 3 weeks
* Add a comment explaining why the "bufwait" / "dirhash" LOR reported byjhb2009-01-201-0/+12
| | | | | | | WITNESS will not actually result in a deadlock. Discussed with: kib MFC after: 1 week
* When extending inode size, we call vnode_pager_setsize(), to have akib2009-01-202-2/+6
| | | | | | | | | | | | | | address space where to put vnode pages, and then call UFS_BALLOC(), to actually allocate new block and map it. When UFS_BALLOC() returns error, sometimes we forget to revert the vm object size increase, allowing for the pages that are not backed by the logical disk blocks. Revert vnode_pager_setsize() back when UFS_BALLOC() failed, for ffs_truncate() and ffs_write(). PR: 129956 Reviewed by: ups MFC after: 3 weeks
* FFS puts the extended attributes blocks at the negative blocks for thekib2009-01-201-0/+9
| | | | | | | | | | | | | | | | | | vnode, from -1 down. When vinvalbuf(vp, V_ALT) is done for the vnode, it incorrectly does vm_object_page_remove(0, 0), removing all pages from the underlying vm object, not only the pages that back the extended attributes data. Change vinvalbuf() to not remove any pages from the object when V_NORMAL or V_ALT are specified. Instead, the only in-tree caller in ffs_inode.c:ffs_truncate() that specifies V_ALT explicitely removes the corresponding page range. The V_NORMAL caller does vnode_pager_setsize(vp, 0) immediately after the call to vinvalbuf(V_NORMAL) already. Reported by: csjp Reviewed by: ups MFC after: 3 weeks
* Lock the uepm_lock around the autostart of extattrs.kib2009-01-081-15/+30
| | | | | | Reported and tested by: pho Reviewed by: rwatson MFC after: 3 weeks
* If unmount of the ffs mp failed, reinitialize the extended attributeskib2009-01-081-0/+14
| | | | | | | | for the mp, and restart them if autostart is enabled. Reported and tested by: pho Reviewed by: rwatson MFC after: 3 weeks
* Do not busy twice the mount point where a quota operation is performed.kib2008-12-181-4/+0
| | | | | Tested by: pho MFC after: 1 month
* According to phk@, VOP_STRATEGY should never, _ever_, returntrasz2008-12-161-1/+1
| | | | | | | | | | | | | anything other than 0. Make it so. This fixes "panic: VOP_STRATEGY failed bp=0xc320dd90 vp=0xc3b9f648", encountered when writing to an orphaned filesystem. Reason for the panic was the following assert: KASSERT(i == 0, ("VOP_STRATEGY failed bp=%p vp=%p", bp, bp->b_vp)); at vfs_bio:bufstrategy(). Reviewed by: scottl, phk Approved by: rwatson (mentor) Sponsored by: FreeBSD Foundation
* The dqrele() function syncs the dq, then acquires the dqh lock, and thenkib2008-12-081-1/+13
| | | | | | | | | | | | | | | does final drop of the the dq reference to put it onto the free list. There is a possibility that the dq would be found by another thread after sync and before the dqh lock is acquired. If that other thread drops the dq before we have taken the dqh lock, the dirty dq is put on the free list. Recheck the DQ_MOD after the dqh lock is relocked. Repeat dqsync() if the dq is dirty. This ensures that up to date dq is written in the quota file and fixes assertion in dqget(). Reported and tested by: Frode Nordahl <frode nordahl net> MFC after: 3 days
* Improve usefulness of the panic by printing the pointer to the problematickib2008-12-071-1/+1
| | | | | | | dquot. In-tree gdb is often unable to get the dq value, so supply it in panic message. MFC after: 3 days
* Do not lock vnode interlock around reading of v_iflag to check VI_DOOMED.kib2008-12-021-9/+2
| | | | | | | | Read of the pointer is atomic, and flag cannot be set while vnode lock is held. Requested by: jhb MFC after: 1 month
* Busy ufs filesystem around block of code that does ".." lookup. Sincekib2008-11-221-1/+26
| | | | | | | | | | | | | | | | mnt_lock is before lock of any vnode on the mp, it uses LK_NOWAIT. Since MNTK_UNMOUNT may be transient, pdp lock is dropped when vfs_busy() failed, and operation is retried after some time. This way, ffs_vget() is not called on the mp that may be in the process of being destroyed by unmount. Check for the VI_DOOMED flag on pdp after its lock is reacquired, to better detect some situations where directory containing ".." entry is removed during the lookup. Reviewed by: tegge, attilio (previous version) Tested by: pho MFC after: 1 month
* Fix typo.jhb2008-11-191-1/+1
|
* For now on every 10 cyclinder groups flush the buffer cache to freeambrisko2008-11-131-0/+4
| | | | | | | | | up space. If the buffer cache fills up then the disk systems can grind to a halt. Better tuning can be figured out later. Tested by: Tim, others and work Reviewed by: Kostik Belousov PR: 128832
* Quiet a WITNESS warning with the dirhash sx locks by setting the DUPOKjhb2008-11-041-1/+10
| | | | | | | flag. Specifically, if two threads race to create a dirhash for a directory, then one might already have created a private dirhash structure (and locked it) when it realizes the directory now has a structure and tries to lock that one.
* In UFS, when reading EA that contains ACL fails for some reason, includetrasz2008-11-041-2/+5
| | | | | | inode number and filesystem name, so the administrator can fix the problem. Approved by: rwatson (mentor)
* Improve VFS locking:attilio2008-11-022-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Implement real draining for vfs consumers by not relying on the mnt_lock and using instead a refcount in order to keep track of lock requesters. - Due to the change above, remove the mnt_lock lockmgr because it is now useless. - Due to the change above, vfs_busy() is no more linked to a lockmgr. Change so its KPI by removing the interlock argument and defining 2 new flags for it: MBF_NOWAIT which basically replaces the LK_NOWAIT of the old version (which was unlinked from the lockmgr alredy) and MBF_MNTLSTLOCK which provides the ability to drop the mountlist_mtx once the mnt interlock is held (ability still desired by most consumers). - The stub used into vfs_mount_destroy(), that allows to override the mnt_ref if running for more than 3 seconds, make it totally useless. Remove it as it was thought to work into older versions. If a problem of "refcount held never going away" should appear, we will need to fix properly instead than trust on such hackish solution. - Fix a bug where returning (with an error) from dounmount() was still leaving the MNTK_MWAIT flag on even if it the waiters were actually woken up. Just a place in vfs_mount_destroy() is left because it is going to recycle the structure in any case, so it doesn't matter. - Remove the markercnt refcount as it is useless. This patch modifies VFS ABI and breaks KPI for vfs_busy() so manpages and __FreeBSD_version will be modified accordingly. Discussed with: kib Tested by: pho
OpenPOWER on IntegriCloud