summaryrefslogtreecommitdiffstats
path: root/sys/fs
Commit message (Collapse)AuthorAgeFilesLines
* MFC r295717:kib2016-02-241-0/+9
| | | | | | | | After nullfs rmdir operation, reclaim the directory vnode which was unlinked. Otherwise the vnode stays cached, causing leak. This is similar to r292961 for regular files. Approved by: re (marius)
* MFC r295574:markj2016-02-221-1/+4
| | | | | | Clear the cookie pointer on error in tmpfs_readdir(). Approved by: re (glebius)
* MFC r295616:pfg2016-02-171-5/+8
| | | | | | | | | | | | ext2fs: Remove panics for rename() race conditions. Sync with r84642 from UFS: The panics are inappropriate because the IN_RENAME flag only fixes a few of the huge number of race conditions that can result in the source path becoming invalid even prior to the VOP_RENAME() call. Approved by: re (glebius)
* MFC r294595:kib2016-02-141-0/+7
| | | | | | | | When devfs dirent is freed, a vnode might still keep a pointer to it, apparently. Interlock and clear the pointer to avoid free memory dereference. Approved by: re (marius)
* MFC r295209;pfg2016-02-061-5/+7
| | | | | | | | | | | Revert r294695; passthrough any extra timestamps to the dinode struct. The original ext2fs change worked fine on disks formated with default values, but it was the cause of a regression when inodes are small. Revert it for now, while we figure out safer ways pass such values, PR: 206820 Approved by: re
* MFC r294695:pfg2016-01-281-7/+5
| | | | | | | | | | | | | ext2fs: passthrough any extra timestamps to the dinode struct. In general we don't trust any of the extended timestamps unless the EXT2F_ROCOMPAT_EXTRA_ISIZE feature is set. However, in the case where we freshly allocated a new inode the information is valid and it is better to pass it along instead of leaving the value undefined. This should have no practical effect but should reduce the amount of garbage if EXT2F_ROCOMPAT_EXTRA_ISIZE is set, like in cases where the filesystem is converted from ext3 to ext4.
* MFC r294200: [PR 206224] bv_cnt is sometimes examined without holding therpokala2016-01-261-0/+5
| | | | | | | | | bufobj lock Add locking around access to bv_cnt which is currently being done unlocked Approved by: jhb Sponsored by: Panasas, Inc.
* MFC r293059:kib2016-01-231-3/+3
| | | | | Hide transient EBADF errors caused by the parallel revoke(2) or forced unmount of devfs mounts, by restarting the failed syscall.
* MFC 286974,291653:jhb2016-01-231-1/+1
| | | | | | | | | | | | | | | | 286974: Remove reference to non-existent kern_openat(9). 291653: The cdevpriv_dtr_t typedef was not able to be used in a function prototype like the various d_*_t typedefs since it declared a function pointer rather than a function. Add a new d_priv_dtor_t typedef that declares the function and can be used as a function prototype. The previous typedef wasn't useful outside of the cdevpriv implementation, so retire it. The name d_priv_dtor_t was chosen to be more consistent with cdev methods since it is commonly used in place of d_close_t even though it is not a direct pointer in struct cdevsw.
* Revert r294271:pfg2016-01-224-80/+34
| | | | | | | ext4: add support for reading sparse files Our older GCC can't handle anonymous unions, so ia64 and powerpc LINT kernels are now failing.
* MFC r293680pfg2016-01-184-34/+80
| | | | | | | | | | | | ext4: add support for reading sparse files Add support for sparse files in ext4. Also implement read-ahead, which greatly increases the performance when transferring files from ext4. The sparse file support has become very common in ext4. Both features implemented by Damjan Jovanovic. PR: 205816
* MFC r293679:ae2016-01-183-6/+8
| | | | | | | | | | | | Change the type of newsize argument in the smbfs_smb_setfsize() function from int to int64. MSDN says that SMB_SET_FILE_END_OF_FILE_INFO uses signed 64-bit integer to specify offset, but since smbfs_smb_setfsize() has used plain int, a value was truncated in case when offset was larger than 2G. https://msdn.microsoft.com/en-us/library/ff469975.aspx In particular, now `truncate -s 10G` will work correctly on the mounted SMB share.
* MFC r293683:pfg2016-01-141-1/+1
| | | | | | | | | | | ext4: mount panic from freeing invalid pointers Initialize the struct with those fields to zeroes on allocation, preventing the panic. Patch by: Damjan Jovanovic. PR: 206056
* MFC r293370:pfg2016-01-101-6/+13
| | | | | | | | | | ext2fs: reading mmaped file in Ext4 causes panic Always call brelse(path.ep_bp), fixing reading EXT4 files using mmap(). Patch by Damjan Jovanovic. PR: 205938
* MFC r283495:dchagin2016-01-091-0/+2
| | | | Hide vfs.pfs.trace variable if it is not used.
* To facillitate an upcoming Linuxulator merging partiallydchagin2016-01-091-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | MFC r275121 (by kib). Only merge the syntax changes from r275121, PROC_*LOCK() macros still lock the same proc spinlock. The process spin lock currently has the following distinct uses: - Threads lifetime cycle, in particular, counting of the threads in the process, and interlocking with process mutex and thread lock. The main reason of this is that turnstile locks are after thread locks, so you e.g. cannot unlock blockable mutex (think process mutex) while owning thread lock. - Virtual and profiling itimers, since the timers activation is done from the clock interrupt context. Replace the p_slock by p_itimmtx and PROC_ITIMLOCK(). - Profiling code (profil(2)), for similar reason. Replace the p_slock by p_profmtx and PROC_PROFLOCK(). - Resource usage accounting. Need for the spinlock there is subtle, my understanding is that spinlock blocks context switching for the current thread, which prevents td_runtime and similar fields from changing (updates are done at the mi_switch()). Replace the p_slock by p_statmtx and PROC_STATLOCK(). Discussed with: kib
* MFC r293042:kib2016-01-081-1/+1
| | | | Minor style cleanup.
* MFC r292872:pfg2016-01-061-0/+3
| | | | | | | | | | ext2: recognize ext4 INCOMPAT_RECOVER flag This is a flag specific for journalling in ext4. Add it to the list of ext4 features we ignore for read-only purposes. PR: 205668
* MFC r292961:kib2016-01-061-3/+5
| | | | | Force nullfs vnode reclaim after unlinking, to potentially unlink lower vnode.
* MFC of 291244, 291380, 291459, 291460, 291671, and 291743:mckusick2015-12-303-3/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This MFC includes changes to better manage the vnode freelist and to streamline the allocation and freeing of vnodes. Note that to maintain the KPI the VI_AGE flag is left defined in sys/vnode.h though its use is dropped as described in 291380. To maintain KBI the vfs.vlru_alloc_cache_src sysctl variable remains though it no longer has any effect as described in 291244. MFC of 291244: Move the comment about resident pages preventing vnode from leaving active list, into the header comment for vdrop(), which is the function that decides whether to leave the vnode on the list. Note that dirty page write-out in vinactive() is asynchronous. Discussed with: alc Sponsored by: The FreeBSD Foundation MFC of 291380: Remove VI_AGE vnode iflag, it is unused. Noted by: bde Sponsored by: The FreeBSD Foundation MFC of 291459: For performance reasons, it is useful to have a single string used as the name of a filesystem when setting it as the first parameter to the getnewvnode() function. Most filesystems call getnewvnode from just one place so can use a literal string as the first parameter. However, NFS calls getnewvnode from two places, so we create a global constant string that can be used by the two instances. This change also collapses two instances of getnewvnode() in the UFS filesystem to a single call. Reviewed by: kib Tested by: Peter Holm MFC of 291460: As the kernel allocates and frees vnodes, it fully initializes them on every allocation and fully releases them on every free. These are not trivial costs: it starts by zeroing a large structure then initializes a mutex, a lock manager lock, an rw lock, four lists, and six pointers. And looking at vfs.vnodes_created, these operations are being done millions of times an hour on a busy machine. As a performance optimization, this code update uses the uma_init and uma_fini routines to do these initializations and cleanups only as the vnodes enter and leave the vnode_zone. With this change the initializations are only done kern.maxvnodes times at system startup and then only rarely again. The frees are done only if the vnode_zone shrinks which never happens in practice. For those curious about the avoided work, look at the vnode_init() and vnode_fini() functions in kern/vfs_subr.c to see the code that has been removed from the main vnode allocation/free path. Reviewed by: kib Tested by: Peter Holm MFC of 291671: We need to zero out the union of pointers in a freed vnode structure. Fix from: Mateusz Guzik Tested by: Jason Unovitch MFC of 291743: We need to zero out the clustering variables in a freed vnode structure. For completeness add a VNASSERT that there are no threads waiting on a range lock (this was previously checked on every vnode free). Reported by; Rick Macklem Fix from: Mateusz Guzik
* MFC r292621:kib2015-12-291-7/+14
| | | | | Keep devfs mount locked for the whole duration of the devfs_setattr(), and ensure that our dirent is instantiated.
* MFC: r291638rmacklem2015-12-163-0/+52
| | | | | | | Fix the memory leak that occurs when the nfscommon.ko module is unloaded. This leak was introduced by r291527 (r292223 in stable/10). Since the nfscommon.ko module is rarely unloaded, this leak would not have been much of an issue.
* MFC: r291527rmacklem2015-12-146-220/+547
| | | | | | | | | | | | | | | | Add kernel support to the NFS server for the "-manage-gids" option that will be added to the nfsuserd daemon in a future commit. It modifies the cache used by NFSv4 for name<-->id translation (both username/uid and group/gid) to support this. When "-manage-gids" is set, the server looks up each uid for the RPC and uses the list of groups cached in the server instead of the list of groups provided in the RPC request. The cached group list is acquired for the cache by the nfsuserd daemon via getgrouplist(3). This avoids the 16 groups limit for the list in the RPC request. Since the cache is now used for every RPC when "-manage-gids" is enabled, the code also modifies the cache to use a separate mutex for each hash list instead of a single global mutex.
* MFC: r291150rmacklem2015-12-053-7/+50
| | | | | | | | | | | | | When the nfsd threads are terminated, the NFSv4 server state (opens, locks, etc) is retained, which I believe is correct behaviour. However, for NFSv4.1, the server also retained a reference to the xprt (RPC transport socket structure) for the backchannel. This caused svcpool_destroy() to not call SVC_DESTROY() for the xprt and allowed a socket upcall to occur after the mutexes in the svcpool were destroyed, causing a crash. This patch fixes the code so that the backchannel xprt structure is dereferenced just before svcpool_destroy() is called, so the code does do an SVC_DESTROY() on the xprt, which shuts down the socket upcall.
* MFC: r291117rmacklem2015-12-051-0/+38
| | | | | | | | Revert r283330 since it broke directory caching in the client. At this time I cannot see a way to fix directory caching when it has partial blocks in the buffer cache, due to the fact that the syscall's uio_offset won't stay the same as the lblkno * NFS_DIRBLKSIZ offset.
* MFC: r290970rmacklem2015-12-011-1/+3
| | | | | | | | | | mnt_stat.f_iosize (which is used to set bo_bsize) must be set to the largest size of buffer cache block or the mapping of the buffer is bogus. When a mount with rsize=4096,wsize=4096 was done, f_iosize would be set to 4096. This resulted in corrupted directory data, since the buffer cache block size for directories is NFS_DIRBLKSIZ (8192). This patch fixes the code so that it always sets f_iosize to at least NFS_DIRBLKSIZ.
* MFC r287033:trasz2015-10-181-1/+2
| | | | | | | After r286237 it should be fine to call vgone(9) on a busy GEOM vnode; remove KASSERT that would prevent forced devfs unmount from working. Sponsored by: The FreeBSD Foundation
* MFC r283924vangyzen2015-10-021-4/+10
| | | | | | | | | | | | | | | Provide vnode in memory map info for files on tmpfs When providing memory map information to userland, populate the vnode pointer for tmpfs files. Set the memory mapping to appear as a vnode type, to match FreeBSD 9 behavior. This fixes the use of tmpfs files with the dtrace pid provider, procstat -v, procfs, linprocfs, pmc (pmcstat), and ptrace (PT_VM_ENTRY). Submitted by: Eric Badger <eric@badgerio.us> (initial revision) Obtained from: Dell Inc. PR: 198431
* MFC r288044:kib2015-09-271-2/+9
| | | | | Ensure that when a blockable open of fifo returns success, a valid file descriptor opened for complimentary access exists as well.
* MFC 283281,283282,283562,283647,283836,284000,286158:jhb2015-09-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Various fixes to orphan handling which also fix issues with following forks. 283281: Always set p_oppid when attaching to an existing process via procfs tracing. This matches the behavior of ptrace(PT_ATTACH). Also, the procfs detach request assumes p_oppid is always set. 283282: Only reparent a traced process to its old parent if the tracing process is not the old parent. Otherwise, proc_reap() will leave the zombie in place resulting in the process' status being returned twice to its parent. Add test cases for PT_TRACE_ME and PT_ATTACH which are fixed by this change. 283562: Do not allow a process to reap an orphan (a child currently being traced by another process such as a debugger). The parent process does need to check for matching orphan pids to avoid returning ECHILD if an orphan has exited, but it should not return the exited status for the child until after the debugger has detached from the orphan process either explicitly or implicitly via wait(). Add two tests for for this case: one where the debugger is the direct child (thus the parent has a non-empty children list) and one where the debugger is not a direct child (so the only "child" of the parent is the orphan). 283647: Tweak the description of when waitpid() doesn't return any status for a non-blocking wait to avoid the word "empty". 283836: Consistently only use one end of the pipe in the parent and debugger processes and do not rely on EOF due to a close() in the debugger. 284000: Add a CHILD_REQUIRE macro similar to ATF_REQUIRE for use in child processes of the main test process. 286158: Clear P_TRACED before reparenting a detached process back to its original parent. Otherwise the debugee will be set as an orphan of the debugger. Add tests for tracing forks via PT_FOLLOW_FORK.
* MFC: r286790rmacklem2015-08-281-2/+5
| | | | | | | | For the case where an NFSv4.1 ExchangeID operation has the client identifier that already has a confirmed ClientID, the nfsrv_setclient() function would not fill in the clientidp being returned. As such, the value of ClientID returned would be whatever garbage was on the stack. This patch fixes the problem by filling in these fields.
* MFC: r286046rmacklem2015-08-011-1/+2
| | | | | | | | This patch fixes a problem where, if the NFSv4 server has a previous unconfirmed clientid structure for the same client on the last hash list, this old entry would not be removed/deleted. I do not think this bug would have caused serious problems, since the new entry would have been before the old one on the list. This old entry would have eventually been scavenged/removed.
* MFC: r285113rmacklem2015-07-311-2/+6
| | | | | | | | | If a "principal" argument isn't provided for a Kerberized NFS mount, the kernel would generate a bogus one with a ":/<path>" suffix. This would only occur for the case where there was no explicit "principal" argument and the getaddrinfo() call in mount_nfs.c failed to a return a cannonical name for the server. This patch fixes this unusual case.
* MFC: r285066rmacklem2015-07-271-3/+20
| | | | | | | | | | | Alex Burlyga reported a POLA violation for the new NFS client as compared to the old NFS client via email to the freebsd-fs@ mailing list. For the new client, when multiple clients attempted to create a symbolic link concurrently, more that one client would report success instead of EEXIST. This was caused by code in the new client that mapped EEXIST to OK assuming it was caused by a retried RPC request. Since the old client did not do this, the patch defaults to the old behaviour and permits the new behaviour to be enabled via a sysctl.
* MFC r284594:kib2015-06-261-0/+1
| | | | Restore the td_cookie value upon detach.
* MFC r283889,r283891:delphij2015-06-151-0/+1
| | | | | | | | | | | | Clear p_stops when doing PT_DETACH and PROCFS_CTL_DETACH. Without this, if a process was being traced by truss(1), which uses different p_stops bits than gdb(1), the latter would misbehave because of the unexpected bits. Reported by: jceel Submitted by: sef Sponsored by: iXsystems, Inc.
* MFC: r283753rmacklem2015-06-121-5/+12
| | | | | | Make the NFS server use shared vnode locks for a few cases that are allowed by the VFS/VOP interface instead of using exclusive locks.
* Add TUNABLE_INT() macros so that the tunables MFC'd fromrmacklem2015-06-121-0/+5
| | | | | | head as r284216 are set via /boot/loader.conf in stable/10. This is a direct commit to stable/10 because TUNABLE_INT() is deprecated in head.
* MFC: r283635rmacklem2015-06-108-56/+105
| | | | | | | | | | Make the size of the hash tables used by the NFSv4 server tunable. No appreciable change in performance was observed after increasing the sizes of these tables and then testing with a single client. However, there was an email that indicated high CPU overheads for a heavily loaded NFSv4 and it is hoped that increasing the sizes of the hash tables via these tunables might help. The tables remain the same size by default.
* MFC r283600:kib2015-06-101-0/+10
| | | | | | | | Perform SU cleanup in the AST handler. Do not sleep waiting for SU cleanup while owning vnode lock. On MFC, for KBI stability, td_su member was moved to the end of the struct thread.
* MFC: r283330rmacklem2015-06-061-38/+0
| | | | | | | | | | | | The NFS client generated directory block(s) with d_fileno == 0 so that it would not return less data than requested. Since returning less directory data than requested is not a problem for FreeBSD and even UFS no longer returns directory structures with d_fileno == 0, this patch stops the client from doing this. Although entries with d_fileno == 0 should not be a problem, the man pages no longer document that these entries should be ignored, so there was a concern that these entries might be an issue in the future.
* MFC: r283273rmacklem2015-06-041-1/+15
| | | | | | | | The NFS client wasn't handling getdirentries(2) requests for sizes that are not an exact multiple of DIRBLKSIZ correctly. Fortunately readdir(3) always uses an exact multiple of DIRBLKSIZ, so few applications were affected. This patch fixes this problem by reducing the size of the directory read to an exact multiple of DIRBLKSIZ.
* MFC r282881: Do not promote large async writes to sync.mav2015-05-241-34/+15
| | | | | | | | | Present implementation of large sync writes is too strict and so can be quite slow. Instead of doing that, execute large async write in chunks, syncing each chunk separately. It would be good to fix large sync writes too, but I leave it to somebody with more skills in this area.
* MFC r279536:trasz2015-05-153-1/+29
| | | | | | | | | | Make fuse(4) respect FOPEN_DIRECT_IO. This is required for correct operation of GlusterFS. PR: 192701 Submitted by: harsha at harshavardhana.net Reviewed by: kib@ Sponsored by: The FreeBSD Foundation
* MFC: r281960rmacklem2015-05-142-5/+5
| | | | | | | | | | | | | MAXBSIZE defines both the largest UFS block size and the largest size for a buffer in the buffer cache. This patch defines a new constant MAXBCACHEBUF, which is the largest size for a buffer in the buffer cache. Having a separate constant allows MAXBCACHEBUF to be set larger than MAXBSIZE on a per-architecture basis, so that NFS can do larger read/writes for these architectures. It modifies sys/param.h so that BKVASIZE can also be set on a per-architecture basis. A couple of cases where NFS used MAXBSIZE instead of NFS_MAXBSIZE is fixed as well.
* MFC r281738:mav2015-05-031-4/+7
| | | | | | | | | | | | | | | | | | Change wcommitsize default from one empirical value to another. The new value is more predictable with growing RAM size: hibufspace maxvnodes old new i386: 256MB 32980992 15800 2198732 2097152 2GB 94027776 107677 878764 4194304 amd64: 256MB 32980992 15800 2198732 2097152 1GB 114114560 68062 1678155 4194304 4GB 217055232 111807 1955452 4194304 16GB 1717846016 337308 5097465 16777216 64GB 1734918144 1164427 1490479 16777216 256GB 1734918144 4426453 391983 16777216
* MFC: r281962rmacklem2015-05-021-1/+2
| | | | | | Fix the NFS server's handling of a bogus NFSv2 ROOT RPC. The ROOT RPC is deprecated in the NFSv2 RFC, RFC-1094 and should never be used by a client.
* MFC: r281628rmacklem2015-04-304-7/+18
| | | | | | | | | | | mav@ has found that NFS servers exporting ZFS file systems can perform better when using a 128K read/write data size. This patch changes NFS_MAXDATA from 64K to 128K so that clients can use 128K for NFS mounts to allow this. The patch also renames NFS_MAXDATA to NFS_SRVMAXIO so that it is clear that it applies to the NFS server side only. It also avoids a name conflict with the NFS_MAXDATA defined in rpcsvc/nfs_prot.h, that is used for userland RPC.
* MFC: r281562rmacklem2015-04-307-3/+13
| | | | | | | | | | | File systems that do not use the buffer cache (such as ZFS) must use VOP_FSYNC() to perform the NFS server's Commit operation. This patch adds a mnt_kern_flag called MNTK_USES_BCACHE which is set by file systems that use the buffer cache. If this flag is not set, the NFS server always does a VOP_FSYNC(). This should be ok for old file system modules that do not set MNTK_USES_BCACHE, since calling VOP_FSYNC() is correct, although it might not be optimal for file systems that use the buffer cache.
* MFC r281756:pfg2015-04-241-1/+3
| | | | | | | | nfsrpc_createv4: fix double free. Reported by: Oliver Pinter, clang static checker Obtained from: HardenedBSD (63cac77c42c0c3fc67da62f97d5ab651d52ae707) Reviewed by: rmacklem
OpenPOWER on IntegriCloud