summaryrefslogtreecommitdiffstats
path: root/sys/kern
Commit message (Collapse)AuthorAgeFilesLines
* MFC r258893, r258956:cperciva2014-01-081-4/+9
| | | | | | | Add a new sysctl / loader tunable kern.panic_reboot_wait_time which defaults to PANIC_REBOOT_WAIT_TIME (a long-existing kernel config setting). Use this now-variable value in place of the defined constant to control how long the system waits after a panic before rebooting.
* MFC r260232:mjg2014-01-071-6/+1
| | | | | | | | Don't check for fd limits in fdgrowtable_exp. Callers do that already and additional check races with process decreasing limits and can result in not growing the table at all, which is currently not handled.
* MFC Alexander Motin's direct dispatch, multi-queue, and finer-grainedscottl2014-01-071-27/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | locking support for CAM r256826: Fix several target mode SIMs to not blindly clear ccb_h.flags field of ATIO CCBs. Not all CCB flags there belong to them. r256836: Remove hard limit on number of BIOs handled with one ATA TRIM request. r256843: Merge CAM locking changes from the projects/camlock branch to radically reduce lock congestion and improve SMP scalability of the SCSI/ATA stack, preparing the ground for the coming next GEOM direct dispatch support. r256888: Unconditionally acquire periph reference on CCB allocation failure. r256895: Fix memory and references leak due to unfreed path. r256960: Move CAM_UNQUEUED_INDEX setting to the last moment and under the periph lock. This fixes race condition with cam_periph_ccbwait(), causing use-after-free. r256975: Minor (mostly cosmetical) addition to r256960. r257054: Some microoptimizations for da and ada drivers: - Replace ordered_tag_count counter with single flag; - From da remove outstanding_cmds counter, duplicating pending_ccbs list; - From da_softc remove unused links field. r257482: Fix lock recursion, triggered by `smartctl -a /dev/adaX`. r257501: Make getenv_*() functions and respectively TUNABLE_*_FETCH() macros not allocate memory and so not require sleepable environment. getenv() has already used on-stack temporary storage, so just use it more rationally. getenv_string() receives buffer as argument, so don't need another one. r257914: Some CAM locks polishing: - Fix LOR and possible lock recursion when handling high-power commands. Introduce new lock to protect left power quota and list of frozen devices. - Correct locking around xpt periph creation. - Remove seems never used XPT_FLAG_OPEN xpt periph flag. Again, Netflix assisted with testing the merge, but all of the credit goes to Alexander and iX Systems. Submitted by: mav Sponsored by: iX Systems
* MFC Alexander Motin's GEOM direct dispatch work:scottl2014-01-072-12/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | r256603: Introduce new function devstat_end_transaction_bio_bt(), adding new argument to specify present time. Use this function to move binuptime() out of lock, substantially reducing lock congestion when slow timecounter is used. r256606: Move g_io_deliver() out of the lock, as required for direct dispatch. Move g_destroy_bio() out too to reduce lock scope even more. r256607: Fix passing uninitialized bio_resid argument to g_trace(). r256610: Add unmapped I/O support to GEOM RAID. r256830: Restore BIO_UNMAPPED and BIO_TRANSIENT_MAPPING in biodonne() when unmapping temporary mapped buffer. That fixes double unmap if biodone() called twice for the same BIO (but with different done methods). r256880: Merge GEOM direct dispatch changes from the projects/camlock branch. When safety requirements are met, it allows to avoid passing I/O requests to GEOM g_up/g_down thread, executing them directly in the caller context. That allows to avoid CPU bottlenecks in g_up/g_down threads, plus avoid several context switches per I/O. r259247: Fix bug introduced at r256607. We have to recalculate bp_resid here since sizes of original and completed requests may differ due to end of media. Testing of the stable/10 merge was done by Netflix, but all of the credit goes to Alexander and iX Systems. Submitted by: mav Sponsored by: iX Systems
* MFC r256885:mav2014-01-051-5/+3
| | | | | Remove global device lock acquisition from dev_relthread(), replacing it with atomics on per-device data.
* MFC r256614:mav2014-01-051-10/+9
| | | | | | | - Take BIO lock in biodone() only when there is no completion callback set and so we should wake up thread waiting in biowait(). - Remove msleep() timeout from biowait(). It was added 11 years ago, when there was no locks used, and it should not be needed any more.
* MFC r259232:mav2014-01-041-9/+17
| | | | | | | | | | | | | | | | | | | | | Create own free list for each of the first 32 possible allocation sizes. In case of 4K allocation quantum that means for allocations up to 128K. With growth of memory fragmentation these lists may grow to quite a large sizes (tenths and hundreds of thousands items). Having in one list items of different sizes in worst case may require full linear list traversal, that may be very expensive. Having lists for items of single size means that unless user specify some alignment or border requirements (that are very rare cases) first item found on the list should satisfy the request. While running SPEC NFS benchmark on top of ZFS on 24-core machine with 84GB RAM this change reduces CPU time spent in vmem_xalloc() from 8% and lock congestion spinning around it from 20% to invisible levels. And that all is by the cost of just 26 more pointers per vmem instance. If at some point our kernel will start to actively use KVA allocations with odd sizes above 128K, something may need to be done to bigger lists also.
* MFC r259464:mav2014-01-031-1/+2
| | | | Fix periodic per-CPU timers startup on boot.
* MFC r259953:kib2014-01-031-4/+10
| | | | Fix accounting for the negative cache entries when reusing v_cache_dd.
* MFC r258281: Fix siginfo_t.si_status for wait6/waitid/SIGCHLD.jilles2014-01-012-12/+18
| | | | | | | | | | | | | Per POSIX, si_status should contain the value passed to exit() for si_code==CLD_EXITED and the signal number for other si_code. This was incorrect for CLD_EXITED and CLD_DUMPED. This is still not fully POSIX-compliant (Austin group issue #594 says that the full value passed to exit() shall be returned via si_status, not just the low 8 bits) but is sufficient for a si_status-related test in libnih (upstart, Debian/kFreeBSD). PR: kern/184002
* MFC r259892:dim2013-12-281-7/+0
| | | | | In sys/kern/vfs_mountroot.c, remove static function parse_isspace(), which is unused since r214006.
* MFC r259876:dim2013-12-281-7/+0
| | | | | In sys/kern/subr_witness.c, remove static function witness_lock_order_key_empty(), which is unused since r181695.
* MFC r259875:dim2013-12-281-24/+0
| | | | | In sys/kern/sched_ule.c, remove static function sched_both(), which is unused since r232207.
* MFC r259520:ae2013-12-241-1/+1
| | | | Fix copy/paste typo.
* MFC 258869:jhb2013-12-241-1/+1
| | | | | | | Fix an off-by-one error in r228960. The maximum priority delta provided by SCHED_PRI_TICKS should be SCHED_PRI_RANGE - 1 so that the resulting priority value (before nice adjustment) is between SCHED_PRI_MIN and SCHED_PRI_MAX, inclusive.
* MFC r259522:kib2013-12-241-0/+3
| | | | | | If vn_open_vnode() succeeded in opening the vnode, but subsequent advisory lock cannot be obtained, prevent double-close of the vnode in vn_close() called from the fdrop(), by resetting file' f_ops methods.
* MFC r257228:kib2013-12-171-8/+17
| | | | | Add bus_dmamap_load_ma() function to load map with the array of vm_pages.
* MFC r258039:kib2013-12-171-10/+9
| | | | | | | | Avoid overflow for the page counts. MFC r258365: Revert back to use int for the page counts. Rearrange the checks to correctly handle overflowing address arithmetic.
* MFC r258819,258928:nwhitehorn2013-12-161-0/+7
| | | | | | | | | | | | | | | | | | Add new sysctl, kern.supported_archs, containing the list of FreeBSD MACHINE_ARCH values whose binaries this kernel can run. This patch provides a feature requested for implementing pkgng ABI identifiers in a robust way. The list is designed to indicate whether, say, an i386 package can be run on the current system. If kern.supported_abis contains "i386", then the answer is yes. Otherwise, the answer is no. At the moment, this only supports MACHINE_ARCH and MACHINE_ARCH32. As we gain support for more interesting combinations, this needs to become more flexible, possibily through the sysent framework, along with the hw.machine_arch emulation immediately preceding this code in kern_mib.c. Reviewed by: imp
* MFC r257899:kib2013-12-131-1/+1
| | | | | If filesystem declares that it supports shared locking for writes, use shared vnode lock for VOP_PUTPAGES() as well.
* MFC r257898:kib2013-12-132-9/+4
| | | | | Change VFS_PROLOGUE() to evaluate the mp once, convert MNTK_SHARED_WRITES and MNTK_EXTENDED_SHARED tests into inline functions.
* MFC: r258718: fix emulated jail_v0 byte orderpeter2013-12-041-1/+1
| | | | Approved by: re (gjb)
* MFC r258661:kib2013-12-031-0/+58
| | | | | | | Add sysctl KERN_PROC_SIGTRAMP to retrieve signal trampoline location for the given process. Approved by: re (gjb)
* Merge r258128 from head:glebius2013-11-221-0/+1
| | | | | | Fix a very bad typo from r248887. Approved by: re (kib)
* MFC r258148,r258149,r258150,r258152,r258153,r258154,r258181,r258182:pjd2013-11-184-14/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | r258148: Add a note that this file is compiled as part of the kernel and libc. Requested by: kib r258149: Change cap_rights_merge(3) and cap_rights_remove(3) to return pointer to the destination cap_rights_t structure. This already matches manual page. r258150: Sync return value with actual implementation. r258151: Style. r258152: Precisely document capability rights here too (they are already documented in rights(4)). r258153: The CAP_LINKAT, CAP_MKDIRAT, CAP_MKFIFOAT, CAP_MKNODAT, CAP_RENAMEAT, CAP_SYMLINKAT and CAP_UNLINKAT capability rights make no sense without the CAP_LOOKUP right, so include this rights. r258154: - Move CAP_EXTATTR_* and CAP_ACL_* rights to index 1 to have more room in index 0 for the future. - Move CAP_BINDAT and CAP_CONNECTAT rights to index 0 so we can include CAP_LOOKUP right in them. - Shuffle the bits around so there are no gaps. This is last chance to do that as all moved rights are not used yet. r258181: Replace CAP_POLL_EVENT and CAP_POST_EVENT capability rights (which I had a very hard time to fully understand) with much more intuitive rights: CAP_EVENT - when set on descriptor, the descriptor can be monitored with syscalls like select(2), poll(2), kevent(2). CAP_KQUEUE_EVENT - When set on a kqueue descriptor, the kevent(2) syscall can be called on this kqueue to with the eventlist argument set to non-NULL value; in other words the given kqueue descriptor can be used to monitor other descriptors. CAP_KQUEUE_CHANGE - When set on a kqueue descriptor, the kevent(2) syscall can be called on this kqueue to with the changelist argument set to non-NULL value; in other words it allows to modify events monitored with the given kqueue descriptor. Add alias CAP_KQUEUE, which allows for both CAP_KQUEUE_EVENT and CAP_KQUEUE_CHANGE. Add backward compatibility define CAP_POLL_EVENT which is equal to CAP_EVENT. r258182: Correct right names. Sponsored by: The FreeBSD Foundation Approved by: re (kib)
* Merge r257996,r258001,r258069 from head: fixes for HyperV guest.pluknet2013-11-141-0/+2
| | | | | | | | - Set description string for VM_GUEST_HV (HyperV guest). - Add a brief comment about VM_GUEST and vm_guest_sysctl_names relationship. - CTASSERT that vm_guest range is covered by vm_guest_sysctl_names. Approved by: re (glebius)
* MFC r257214:kib2013-11-031-0/+2
| | | | | | Inform about the kdb re-entry. Approved by: re (gjb)
* MFC r257221:kib2013-10-311-1/+1
| | | | | | Fix typo. Approved by: re (glebius)
* MFC r256847:kib2013-10-291-1/+2
| | | | | | Print more useful information about the transfer that trigger the assertion. Approved by: re (glebius)
* MFC r256504:kib2013-10-251-0/+14
| | | | | | | | Add a sysctl kern.disallow_high_osrel which disables executing the images compiled on the world with higher major version number than the high version number of the booted kernel. Default to disable. Approved by: re (glebius)
* MFC r256502:kib2013-10-251-0/+4
| | | | | | | | Similar to debug.iosize_max_clamp sysctl, introduce devfs_iosize_max_clamp sysctl, which allows/disables SSIZE_MAX-sized i/o requests on the devfs files. Approved by: re (glebius)
* Merge from project branch via main. Uninteresting commits are trimmed.markm2013-10-122-3/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Refactor of /dev/random device. Main points include: * Userland seeding is no longer used. This auto-seeds at boot time on PC/Desktop setups; this may need some tweeking and intelligence from those folks setting up embedded boxes, but the work is believed to be minimal. * An entropy cache is written to /entropy (even during installation) and the kernel uses this at next boot. * An entropy file written to /boot/entropy can be loaded by loader(8) * Hardware sources such as rdrand are fed into Yarrow, and are no longer available raw. ------------------------------------------------------------------------ r256240 | des | 2013-10-09 21:14:16 +0100 (Wed, 09 Oct 2013) | 4 lines Add a RANDOM_RWFILE option and hide the entropy cache code behind it. Rename YARROW_RNG and FORTUNA_RNG to RANDOM_YARROW and RANDOM_FORTUNA. Add the RANDOM_* options to LINT. ------------------------------------------------------------------------ r256239 | des | 2013-10-09 21:12:59 +0100 (Wed, 09 Oct 2013) | 2 lines Define RANDOM_PURE_RNDTEST for rndtest(4). ------------------------------------------------------------------------ r256204 | des | 2013-10-09 18:51:38 +0100 (Wed, 09 Oct 2013) | 2 lines staticize struct random_hardware_source ------------------------------------------------------------------------ r256203 | markm | 2013-10-09 18:50:36 +0100 (Wed, 09 Oct 2013) | 2 lines Wrap some policy-rich code in 'if NOTYET' until we can thresh out what it really needs to do. ------------------------------------------------------------------------ r256184 | des | 2013-10-09 10:13:12 +0100 (Wed, 09 Oct 2013) | 2 lines Re-add /dev/urandom for compatibility purposes. ------------------------------------------------------------------------ r256182 | des | 2013-10-09 10:11:14 +0100 (Wed, 09 Oct 2013) | 3 lines Add missing include guards and move the existing ones out of the implementation namespace. ------------------------------------------------------------------------ r256168 | markm | 2013-10-08 23:14:07 +0100 (Tue, 08 Oct 2013) | 10 lines Fix some just-noticed problems: o Allow this to work with "nodevice random" by fixing where the MALLOC pool is defined. o Fix the explicit reseed code. This was correct as submitted, but in the project branch doesn't need to set the "seeded" bit as this is done correctly in the "unblock" function. o Remove some debug ifdeffing. o Adjust comments. ------------------------------------------------------------------------ r256159 | markm | 2013-10-08 19:48:11 +0100 (Tue, 08 Oct 2013) | 6 lines Time to eat crow for me. I replaced the sx_* locks that Arthur used with regular mutexes; this turned out the be the wrong thing to do as the locks need to be sleepable. Revert this folly. # Submitted by: Arthur Mesh <arthurmesh@gmail.com> (In original diff) ------------------------------------------------------------------------ r256138 | des | 2013-10-08 12:05:26 +0100 (Tue, 08 Oct 2013) | 10 lines Add YARROW_RNG and FORTUNA_RNG to sys/conf/options. Add a SYSINIT that forces a reseed during proc0 setup, which happens fairly late in the boot process. Add a RANDOM_DEBUG option which enables some debugging printf()s. Add a new RANDOM_ATTACH entropy source which harvests entropy from the get_cyclecount() delta across each call to a device attach method. ------------------------------------------------------------------------ r256135 | markm | 2013-10-08 07:54:52 +0100 (Tue, 08 Oct 2013) | 8 lines Debugging. My attempt at EVENTHANDLER(multiuser) was a failure; use EVENTHANDLER(mountroot) instead. This means we can't count on /var being present, so something will need to be done about harvesting /var/db/entropy/... . Some policy now needs to be sorted out, and a pre-sync cache needs to be written, but apart from that we are now ready to go. Over to review. ------------------------------------------------------------------------ r256094 | markm | 2013-10-06 23:45:02 +0100 (Sun, 06 Oct 2013) | 8 lines Snapshot. Looking pretty good; this mostly works now. New code includes: * Read cached entropy at startup, both from files and from loader(8) preloaded entropy. Failures are soft, but announced. Untested. * Use EVENTHANDLER to do above just before we go multiuser. Untested. ------------------------------------------------------------------------ r256088 | markm | 2013-10-06 14:01:42 +0100 (Sun, 06 Oct 2013) | 2 lines Fix up the man page for random(4). This mainly removes no-longer-relevant details about HW RNGs, reseeding explicitly and user-supplied entropy. ------------------------------------------------------------------------ r256087 | markm | 2013-10-06 13:43:42 +0100 (Sun, 06 Oct 2013) | 6 lines As userland writing to /dev/random is no more, remove the "better than nothing" bootstrap mode. Add SWI harvesting to the mix. My box seeds Yarrow by itself in a few seconds! YMMV; more to follow. ------------------------------------------------------------------------ r256086 | markm | 2013-10-06 13:40:32 +0100 (Sun, 06 Oct 2013) | 11 lines Debug run. This now works, except that the "live" sources haven't been tested. With all sources turned on, this unlocks itself in a couple of seconds! That is no my box, and there is no guarantee that this will be the case everywhere. * Cut debug prints. * Use the same locks/mutexes all the way through. * Be a tad more conservative about entropy estimates. ------------------------------------------------------------------------ r256084 | markm | 2013-10-06 13:35:29 +0100 (Sun, 06 Oct 2013) | 5 lines Don't use the "real" assembler mnemonics; older compilers may not understand them (like when building CURRENT on 9.x). # Submitted by: Konstantin Belousov <kostikbel@gmail.com> ------------------------------------------------------------------------ r256081 | markm | 2013-10-06 10:55:28 +0100 (Sun, 06 Oct 2013) | 12 lines SNAPSHOT. Simplify the malloc pools; We only need one for this device. Simplify the harvest queue. Marginally improve the entropy pool hashing, making it a bit faster in the process. Connect up the hardware "live" source harvesting. This is simplistic for now, and will need to be made rate-adaptive. All of the above passes a compile test but needs to be debugged. ------------------------------------------------------------------------ r256042 | markm | 2013-10-04 07:55:06 +0100 (Fri, 04 Oct 2013) | 25 lines Snapshot. This passes the build test, but has not yet been finished or debugged. Contains: * Refactor the hardware RNG CPU instruction sources to feed into the software mixer. This is unfinished. The actual harvesting needs to be sorted out. Modified by me (see below). * Remove 'frac' parameter from random_harvest(). This was never used and adds extra code for no good reason. * Remove device write entropy harvesting. This provided a weak attack vector, was not very good at bootstrapping the device. To follow will be a replacement explicit reseed knob. * Separate out all the RANDOM_PURE sources into separate harvest entities. This adds some secuity in the case where more than one is present. * Review all the code and fix anything obviously messy or inconsistent. Address som review concerns while I'm here, like rename the pseudo-rng to 'dummy'. # Submitted by: Arthur Mesh <arthurmesh@gmail.com> (the first item) ------------------------------------------------------------------------ r255319 | markm | 2013-09-06 18:51:52 +0100 (Fri, 06 Sep 2013) | 4 lines Yarrow wants entropy estimations to be conservative; the usual idea is that if you are certain you have N bits of entropy, you declare N/2. ------------------------------------------------------------------------ r255075 | markm | 2013-08-30 18:47:53 +0100 (Fri, 30 Aug 2013) | 4 lines Remove short-lived idea; thread to harvest (eg) RDRAND enropy into the usual harvest queues. It was a nifty idea, but too heavyweight. # Submitted by: Arthur Mesh <arthurmesh@gmail.com> ------------------------------------------------------------------------ r255071 | markm | 2013-08-30 12:42:57 +0100 (Fri, 30 Aug 2013) | 4 lines Separate out the Software RNG entropy harvesting queue and thread into its own files. # Submitted by: Arthur Mesh <arthurmesh@gmail.com> ------------------------------------------------------------------------ r254934 | markm | 2013-08-26 20:07:03 +0100 (Mon, 26 Aug 2013) | 2 lines Remove the short-lived namei experiment. ------------------------------------------------------------------------ r254928 | markm | 2013-08-26 19:35:21 +0100 (Mon, 26 Aug 2013) | 2 lines Snapshot; Do some running repairs on entropy harvesting. More needs to follow. ------------------------------------------------------------------------ r254927 | markm | 2013-08-26 19:29:51 +0100 (Mon, 26 Aug 2013) | 15 lines Snapshot of current work; 1) Clean up namespace; only use "Yarrow" where it is Yarrow-specific or close enough to the Yarrow algorithm. For the rest use a neutral name. 2) Tidy up headers; put private stuff in private places. More could be done here. 3) Streamline the hashing/encryption; no need for a 256-bit counter; 128 bits will last for long enough. There are bits of debug code lying around; these will be removed at a later stage. ------------------------------------------------------------------------ r254784 | markm | 2013-08-24 14:54:56 +0100 (Sat, 24 Aug 2013) | 39 lines 1) example (partially humorous random_adaptor, that I call "EXAMPLE") * It's not meant to be used in a real system, it's there to show how the basics of how to create interfaces for random_adaptors. Perhaps it should belong in a manual page 2) Move probe.c's functionality in to random_adaptors.c * rename random_ident_hardware() to random_adaptor_choose() 3) Introduce a new way to choose (or select) random_adaptors via tunable "rngs_want" It's a list of comma separated names of adaptors, ordered by preferences. I.e.: rngs_want="yarrow,rdrand" Such setting would cause yarrow to be preferred to rdrand. If neither of them are available (or registered), then system will default to something reasonable (currently yarrow). If yarrow is not present, then we fall back to the adaptor that's first on the list of registered adaptors. 4) Introduce a way where RNGs can play a role of entropy source. This is mostly useful for HW rngs. The way I envision this is that every HW RNG will use this functionality by default. Functionality to disable this is also present. I have an example of how to use this in random_adaptor_example.c (see modload event, and init function) 5) fix kern.random.adaptors from kern.random.adaptors: yarrowpanicblock to kern.random.adaptors: yarrow,panic,block 6) add kern.random.active_adaptor to indicate currently selected adaptor: root@freebsd04:~ # sysctl kern.random.active_adaptor kern.random.active_adaptor: yarrow # Submitted by: Arthur Mesh <arthurmesh@gmail.com> Submitted by: Dag-Erling Smørgrav <des@FreeBSD.org>, Arthur Mesh <arthurmesh@gmail.com> Reviewed by: des@FreeBSD.org Approved by: re (delphij) Approved by: secteam (des,delphij)
* Ignore attempts to set the nmbcluster sysctls to their current valuejhb2013-10-101-5/+5
| | | | | | | | rather than failing with an error. Reviewed by: andre Approved by: re (delphij) MFC after: 2 weeks
* The device vnodes are often unlocked when bread() or bwrite() iskib2013-10-091-1/+2
| | | | | | | | | | | called. This probably should be fixed eventually, but for now it is not needed to try to flush such vnodes from the buffer allocation context. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (gjb)
* Do not flush buffers when the v_object of the passed vnode does notkib2013-10-091-0/+2
| | | | | | | | | | | | | | | | | | really belong to it. Such vnodes, with the pointers to other vnodes v_objects, are typically instantiated by the bypass filesystems. Invalidating mappings of other vnode pages and the pages is wrong, since reclamation of the upper vnode does not imply that lower vnode is reclaimed too. One of the consequences of the improper reclamation was destruction of the wired mappings of the lower vnode pages, triggering miscellaneous assertions in the VM system. Reported by: John Marshall <john.marshall@riverwillow.com.au> Tested by: John Marshall <john.marshall@riverwillow.com.au>, pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (gjb)
* When growing the file descriptor table, new larger memory chunk iskib2013-10-091-2/+21
| | | | | | | | | | | | | | | allocated, but the old table is kept around to handle the case of threads still performing unlocked accesses to it. Grow the table exponentially instead of increasing its size by sizeof(long) * 8 chunks when overflowing. This mode significantly reduces the total memory use for the processes consuming large numbers of the file descriptors which open them one by one. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (marius)
* Reduce code duplication, introduce the getmaxfd() helper to calculatekib2013-10-091-9/+16
| | | | | | | | | the max filedescriptor index. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (marius)
* - Substitute sbdrop_internal() with sbcut_internal(). The latter doesn't freeglebius2013-10-091-11/+34
| | | | | | | | | | | mbufs, but return chain of free mbufs to a caller. Caller can either reuse them or return to allocator in a batch manner. - Implement sbdrop()/sbdrop_locked() as a wrapper around sbcut_internal(). - Expose sbcut_locked() for outside usage. Sponsored by: Netflix Sponsored by: Nginx, Inc. Approved by: re (marius)
* Remove the uipc_cow.c file, which is not used since the zero copykib2013-10-061-182/+0
| | | | | | | | sockets removal. Noted by: alc Sponsored by: The FreeBSD Foundation Approved by: re (delphij)
* Tidy up kmeminit(): Since r245575, 'nmbclusters' is calculated afteralc2013-10-051-3/+2
| | | | | | | | | | kmeminit() runs, so it contributes nothing to 'vm_kmem_size'; update a comment to reflect that r254025 replaced the kmem submap with the kmem arena. Reviewed by: kib Approved by: re (gjb) Sponsored by: EMC / Isilon Storage Division
* Change len checks for fstypelen and fspathlen to be against absolute lensbruno2013-10-031-1/+1
| | | | | | | | | | | | | not strlen as they are *not* strings. Discovered by GSOC student, Mike Ma <mikemandarine@gmail.com> during his fuse.glusterfs port to FreeBSD. Final patch from mckusick@ Submitted by: mckusick@ Approved by: re (hrs) MFC after: 2 weeks
* When helping the bufdaemon from the buffer allocation context, therekib2013-10-021-63/+41
| | | | | | | | | | | | | | | | | | | | | | | | | is no sense to walk the whole dirty buffer queue. We are only interested in, and can operate on, the buffers owned by the current vnode [1]. Instead of calling generic queue flush routine, do VOP_FSYNC() if possible. Holding the dirty buffer queue lock in the bufdaemon, without dropping it, can cause starvation of buffer writes from other threads. This is esp. easy to reproduce on the big memory machines, where large files are written, causing almost all dirty buffers accumulating in several big files, which vnodes are locked by writers. Bufdaemon cannot flush any buffer, but is iterating over the whole dirty queue continuously. Since dirty queue mutex is not dropped, bufdone() in g_up thread is starved, usually deadlocking the machine [2]. Mitigate this by dropping the queue lock after the vnode is locked, allowing other queue lock contenders to make a progress. Discussed with: Jeff [1] Reported by: pho [2] Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Approved by: re (hrs)
* When printing the vnode information from ddb, print the lengths of thekib2013-10-011-2/+5
| | | | | | | | dirty and clean buffer queues. Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (gjb)
* For vunref(), try to upgrade the vnode lock if the function was calledkib2013-09-291-2/+4
| | | | | | | | | | with the vnode shared-locked. If upgrade succeeded, the inactivation can be done immediately, instead of being postponed. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (glebius)
* Reimplement r255797 using LK_TRYUPGRADE.kib2013-09-291-2/+14
| | | | | | | | | | | | | | The r255797 was: Increase the chance of the buffer write from the bufdaemon helper context to succeed. If the locked vnode which owns the buffer to be written is shared locked, try the non-blocking upgrade of the lock to exclusive. PR: kern/178997 Reported and tested by: Klaus Weber <fbsd-bugs-2013-1@unix-admin.de> Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (glebius)
* Add LK_TRYUPGRADE operation for lockmgr(9), which attempts tokib2013-09-291-0/+13
| | | | | | | | | | | atomically upgrade shared lock to exclusive. On failure, error is returned and lock is not dropped in the process. Tested by: pho (previous version) No objections from: attilio Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (glebius)
* it must be the last member, not might...jmg2013-09-261-1/+1
| | | | | Reviewed by: attilio Approved by: re (delphij, gjb)
* Do not allow negative timeouts for kqueue timers, check for thekib2013-09-261-2/+10
| | | | | | | | | | | | | negative timeout both before and after the conversion to sbintime_t. For periodic kqueue timer, convert zero timeout into 1ms, to avoid interrupt storm on fast event timers. Reported and tested by: pho Discussed with: mav Reviewed by: davide Sponsored by: The FreeBSD Foundation Approved by: re (marius)
* Acquire a hold reference on the vnode when a knote is instantiated.kib2013-09-261-0/+2
| | | | | | | | | | | | Otherwise, knote keeps a pointer to a vnode which could become invalid any time. Reported by: many Tested by: Patrick Lamaiziere <patfbsd@davenulle.org> Discussed with: jmg Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (marius)
* Make the callout arithmetic more robust adding checks for overflow.davide2013-09-261-1/+6
| | | | | | | | | | | | | | | | Without these, if the timeout value passed is "large enough", the value of the sum of it and other factors (e.g. current time as returned by sbinuptime() or 'precision' argument) might result in a negative number. This negative number is then passed to eventtimers(4), which causes et_start() routine to load et_min_period into eventtimer, making the CPU where the thread is stuck forever in timer interrupt handler routine. This is now avoided rounding to INT64_MAX the timeout period in case of overflow. Reported by: kib, pho Discussed with: kib, mav Tested by: pho (stress2 suite, kevent7.sh scenario) Approved by: re (kib)
OpenPOWER on IntegriCloud