summaryrefslogtreecommitdiffstats
path: root/sys/kern
Commit message (Collapse)AuthorAgeFilesLines
* Don't make Linux stat() open character devices to resolve its name.ed2009-02-201-4/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The existing code calls kern_open() to resolve the vnode of a pathname right after a stat(). This is not correct, because it causes random character devices to be opened in /dev. This means ls'ing a tape streamer will cause it to rewind, for example. Changes I have made: - Add kern_statat_vnhook() to allow binary emulators to `post-process' struct stat, using the proper vnode. - Remove unneeded printf's from stat() and statfs(). - Make the Linuxolator use kern_statat_vnhook(), replacing translate_path_major_minor_at(). - Let translate_fd_major_minor() use vp->v_rdev instead of vp->v_un.vu_cdev. Result: crw-rw-rw- 1 root root 0, 14 Feb 20 13:54 /dev/ptmx crw--w---- 1 root adm 136, 0 Feb 20 14:03 /dev/pts/0 crw--w---- 1 root adm 136, 1 Feb 20 14:02 /dev/pts/1 crw--w---- 1 ed tty 136, 2 Feb 20 14:03 /dev/pts/2 Before this commit, ptmx also had a major number of 136, because it silently allocated and deallocated a pseudo-terminal. Device nodes that cannot be opened now have proper major/minor-numbers. Reviewed by: kib, netchild, rdivacky (thanks!)
* Enable caching of negative pathname lookups in the NFS client. To avoidjhb2009-02-191-0/+18
| | | | | | | | | | | | | | stale entries, we save a copy of the directory's modification time when the first negative cache entry was added in the directory's NFS node. When a negative cache entry is hit during a pathname lookup, the parent directory's modification time is checked. If it has changed, all of the negative cache entries for that parent are purged and the lookup falls back to using the RPC. This required adding a new cache_purge_negative() method to the name cache to purge only negative cache entries for a given directory. Submitted by: mohans, Rick Macklem, Ricardo Labiaga @ NetApp Reviewed by: mohans
* Squash some small bugs in pts(4).ed2009-02-191-6/+3
| | | | | | | | | | | - Don't return a negative errno when using an unknown ioctl() on a pseudo-terminal master device. Be sure to convert ENOIOCTL to ENOTTY, just like the TTY layer does. - Even though we should return st_rdev of the master device node when emulating pty(4) devices, FIODGNAME should still return the name of the slave device. Otherwise ptsname(3) and ttyname(3) return an invalid device name.
* - Add a function (fill_kinfo_aggregate()) which aggregates relevantattilio2009-02-181-22/+44
| | | | | | | | | | | | | | | | | | | | | | | | | members for a kinfo entry on a process-wide system. - Use the newly introduced function in order to fix cases like KERN_PROC_PROC where aggregating stats are broken because they just consider the first thread in the pool for each process. (Note, additively, that KERN_PROC_PROC is rather inaccurate on thread-wide informations like the 'state' of the process. Such informations should maybe be invalidated and being forceably discarded by the consumers?). - Simplify the logic of sysctl_out_proc() and adjust the fill_kinfo_thread() accordingly. - Remove checks on the FIRST_THREAD_IN_PROC() being NULL but add assertives. This patch should fix aggregate statistics for KERN_PROC_PROC. This is one of the reasons why top doesn't use this option and now it can be use it safely. ps, when launched in order to display just processes, now should report correct cpu utilization percentages and times (as opposed by the old code). Reviewed by: jhb, emaste Sponsored by: Sandvine Incorporated
* Remove the printf's when the vnode to be exported for procstat is not a VDIR.marcus2009-02-141-4/+0
| | | | | | | | | If the file system backing a process' cwd is removed, and procstat -f PID is called, then these messages would have been printed. The extra verbosity is not required in this situation. Requested by: kib Approved by: kib
* Change two KASSERTS to printfs and simple returns. Stress testing hasmarcus2009-02-141-2/+12
| | | | | | | | | | | revealed that a process' current working directory can be VBAD if the directory is removed. This can trigger a panic when procstat -f PID is run. Tested by: pho Discovered by: phobot Reviewed by: kib Approved by: kib
* Remove semicolon left in the last committhompsa2009-02-131-1/+1
| | | | Spotted by: csjp
* Use shared vnode locks when invoking VOP_READDIR().jhb2009-02-131-3/+2
| | | | MFC after: 1 month
* Clarify and reimplement the bioq API so that bioq_disksort() hasluigi2009-02-131-65/+120
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | the correct behaviour (sorting by distance from the current head position in the scan direction) and bioq_insert_head() and bioq_insert_tail() have a well defined (and useful) behaviour, especially when intermixed with calls to bioq_disksort(). In particular: - fix a bug in the existing bioq_disksort() that did not use the current head position correctly; - redefine semantics of bioq_insert_head() and bioq_insert_tail(). bioq_insert_tail() can now be used as a barrier between previous and subsequent calls to bioq_disksort(). The code is heavily documented in the source code so please refer to that for the details. Much of this code comes from Fabio Checconi. Also thanks to Kirk for feedback on the (re)definition of bioq_insert_tail(). NOTE: in the current tree there is only a handful of files which intermix calls to bioq_disksort() with bioq_insert_head() and bioq_insert_tail(). The ordering of the queue in these situation was not specified (nor easy to figure out) before, so I doubt any of that code could be affected by the specification of the API. Also note that the current implementation is significantly simpler than the previous one (also used in ata_sort_queue()). It would be useful to reimplement ata_sort_queue() using the same code used in bioq_disksort(). MFC after: 1 week
* Check the exit flag at the start of the taskqueue loop rather than the end. Itthompsa2009-02-131-2/+2
| | | | | | | | is possible to tear down the taskqueue before the thread has run and the taskqueue loop would sleep forever. Reviewed by: sam MFC after: 1 week
* Serialize write() calls on TTYs.ed2009-02-111-6/+24
| | | | | | | | | | | | | | | | | | | | Just like the old TTY layer, the current MPSAFE TTY layer does not make any attempt to serialize calls of write(). Data is copied into the kernel in 256 (TTY_STACKBUF) byte chunks. If a write() call occurs at the same time, the data may interleave. This is especially likely when the TTY starts blocking, because the output queue reaches the high watermark. I've implemented this by adding a new flag, TTY_BUSY_OUT, which is used to mark a TTY as having a thread stuck in write(). Because I don't want non-blocking processes to be possibly blocked by a sleeping thread, I'm still allowing it to bypass the protection. According to this message, the Linux kernel returns EAGAIN in such cases, but I think that's a little too restrictive: http://kerneltrap.org/index.php?q=mailarchive/linux-kernel/2007/5/2/85418/thread PR: kern/118287
* Modify fdcopy() so that, during fork(2), it won't copy file descriptorsrwatson2009-02-111-1/+2
| | | | | | | | | | | | | | | | | | | | | | | from the parent to the child process if they have an operation vector of &badfileops. This narrows a set of races involving system calls that allocate a new file descriptor, potentially block for some extended period, and then return the file descriptor, when invoked by a threaded program that concurrently invokes fork(2). Similar approches are used in both Solaris and Linux, and the wideness of this race was introduced in FreeBSD when we moved to a more optimistic implementation of accept(2) in order to simplify locking. A small race necessarily remains because the fork(2) might occur after the finit() in accept(2) but before the system call has returned, but that appears unavoidable using current APIs. However, this race is vastly narrower. The fix can be validated using the newfileops_on_fork regression test. PR: kern/130348 Reported by: Ivan Shcheklein <shcheklein at gmail dot com> Reviewed by: jhb, kib MFC after: 1 week
* o Use NULL in pereference to 0 in pointer contexts.imp2009-02-111-8/+8
| | | | | o Use newly minted KOBJMETHOD_END as appropriate o fix prototype for root_setup_intr.
* Check for device_set_devclass() errors and skip driver probe/attach if any.mav2009-02-101-4/+12
| | | | | | | | Attach call without devclass set crashes the system. On resume AHCI driver sometimes tries to create duplicate adX device. It is surely his own problem, but IMHO it is not a reason to crash here. Other reasons are also possible.
* Scanning all the formats for binary translation of modules loading canattilio2009-02-103-6/+11
| | | | | | | | | | | | | | | | | | | result in errors for a format loading but subsequent correct recognizing for another format. File format loading functions should avoid printing any additional informations but just returning appropriate (and different between each other) error condition, characterizing different informations. Additively, the linker should handle appropriately different format loading errors. While a general mechanism is desired, fix a simple and common case on amd64: file type is not recognized for link elf and confuses the linker. Printout an error if all the registered linker classes can't recognize and load the module. Reviewed by: jhb Sponsored by: Sandvine Incorporated
* Remove extra 'comma = 0' in socket state printing code, which otherwiserwatson2009-02-091-1/+0
| | | | | | could lead to an extra comma in output. Submitted by: Christoph Mallon <christoph dot mallon at gmx dot de>
* s/SS_FDREF/SS_NOFDREF/mbr2009-02-091-1/+1
|
* Remove a stale comment from the clists code.ed2009-02-091-4/+0
| | | | We don't support quote bits.
* Tweak the output of VOP_PRINT/vn_printf() some.jhb2009-02-063-4/+4
| | | | | | | | - Align the fifo output in fifo_print() with other vn_printf() output. - Remove the leading space from lockmgr_printinfo() so its output lines up in vn_printf(). - lockmgr_printinfo() now ends with a newline, so remove an extra newline from vn_printf().
* Add KASSERTs to make it easier to debug problems like the one fixedtrasz2009-02-061-0/+1
| | | | | | | | | in r188141. Reviewed by: kib,attilio Approved by: rwatson (mentor) Tested by: pho Sponsored by: FreeBSD Foundation
* Expand the scope of the sysctllock sx lock to protect the sysctl tree itself.jhb2009-02-063-25/+110
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Back in 1.1 of kern_sysctl.c the sysctl() routine wired the "old" userland buffer for most sysctls (everything except kern.vnode.*). I think to prevent issues with wiring too much memory it used a 'memlock' to serialize all sysctl(2) invocations, meaning that only one user buffer could be wired at a time. In 5.0 the 'memlock' was converted to an sx lock and renamed to 'sysctl lock'. However, it still only served the purpose of serializing sysctls to avoid wiring too much memory and didn't actually protect the sysctl tree as its name suggested. These changes expand the lock to actually protect the tree. Later on in 5.0, sysctl was changed to not wire buffers for requests by default (sysctl_handle_opaque() will still wire buffers larger than a single page, however). As a result, user buffers are no longer wired as often. However, many sysctl handlers still wire user buffers, so it is still desirable to serialize userland sysctl requests. Kernel sysctl requests are allowed to run in parallel, however. - Expose sysctl_lock()/sysctl_unlock() routines to exclusively lock the sysctl tree for a few places outside of kern_sysctl.c that manipulate the sysctl tree directly including the kernel linker and vfs_register(). - sysctl_register() and sysctl_unregister() require the caller to lock the sysctl lock using sysctl_lock() and sysctl_unlock(). The rest of the public sysctl API manage the locking internally. - Add a locked variant of sysctl_remove_oid() for internal use so that external uses of the API do not need to be aware of locking requirements. - The kernel linker no longer needs Giant when manipulating the sysctl tree. - Add a missing break to the loop in vfs_register() so that we stop looking at the sysctl MIB once we have changed it. MFC after: 1 month
* Drop the kernel linker lock while running SYSUNINIT routines and removingjhb2009-02-051-0/+3
| | | | | | | | | sysctls during a linker file unload. We drop the lock when doing similar operations during a linker file load. To close races, clear the LINKED flag before dropping the lock so that the linker file is no longer visible to userland. MFC after: 1 week
* Add more KTR_VFS logging point in order to have a more effective tracing.attilio2009-02-052-23/+78
| | | | | Reviewed by: brueffer, kib Tested by: Gianni Trematerra <giovanni D trematerra A gmail D com>
* Don't leave the console TTY constantly open.ed2009-02-051-31/+40
| | | | | | | | | | | | | | | | | When we leave the console TTY constantly open, we never reset the termios attributes. This causes output processing, echoing, etc. not to be reset to the proper values when going into single user mode after the system has booted. It also causes nl-to-crnl-conversion not to take place during shutdown, which causes a `staircase effect'. This patch adds a new TTY flag, TF_OPENED_CONS, which is set when the TTY is opened through /dev/console. Because the flags are only used by the kernel and the pstat(8) utility, I've decided to renumber the TTY flags. This shouldn't be an issue, because the TTY layer is not yet part of a stable release. Reported by: Mark Atkinson <atkin901 yahoo com> Tested by: sepotvin
* Don't allow creating a socket with a protocol family that the currentjamie2009-02-052-8/+43
| | | | | | | | | | jail doesn't support. This involves a new function prison_check_af, like prison_check_ip[46] but that checks only the family. With this change, most of the errors generated by jailed sockets shouldn't ever occur, at least until jails are changeable. Approved by: bz (mentor)
* Standardize the various prison_foo_ip[46] functions and prison_if tojamie2009-02-051-70/+74
| | | | | | | | | | | | | | | return zero on success and an error code otherwise. The possible errors are EADDRNOTAVAIL if an address being checked for doesn't match the prison, and EAFNOSUPPORT if the prison doesn't have any addresses in that address family. For most callers of these functions, use the returned error code instead of e.g. a hard-coded EADDRNOTAVAIL or EINVAL. Always include a jailed() check in these functions, where a non-jailed cred always returns success (and makes no changes). Remove the explicit jailed() checks that preceded many of the function calls. Approved by: bz (mentor)
* In some situations, mnt_lockref could go negative due to vfs_unbusy() beingtrasz2009-02-051-3/+5
| | | | | | | | | | | called without calling vfs_busy() first. This made umount(8) hang waiting for mnt_lockref to become zero, which would never happen. Reviewed by: kib Approved by: rwatson (mentor) Reported by: pho Found with: stress2 Sponsored by: FreeBSD Foundation
* Remove written-to but never read local variable 'offset' fromrwatson2009-02-041-2/+1
| | | | | | | soreceive_dgram(). Submitted by: Christoph Mallon <christoph dot mallon at gmx dot de> MFC after: 1 week
* Remove slush space from clists.ed2009-02-041-75/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | Right now we only have a very small amount of drivers that use clists, but we still allocate 50 cblocks as slush space, which allows drivers to temporarily overcommit their storage. Most of the drivers don't allow this anyway. I've performed the following changes: - We don't allocate any cblocks on startup. - I've removed the DDB command, because it has nothing useful to print now. You can obtain the amount of allocated blocks by running `vmstat -m | grep clist'. - I've removed cfreecount, which is now unused. - The old code first tries to allocate using M_NOWAIT, followed by M_WAITOK. This doesn't make any sense, so just remove this logic. It seems the drivers allow us to sleep anyway. We can even remove ccmax from clist_alloc_cblocks and c_cbmax from struct clist, but this breaks binary compatibility. This reduces the amount of allocated cblocks on my system from 54 to 4.
* Slightly improve the design of the TTY buffer.ed2009-02-033-73/+102
| | | | | | | | | | | | | | | | | | | The TTY buffers used the standard <sys/queue.h> lists. Unfortunately they have a big shortcoming. If you want to have a double linked list, but no tail pointer, it's still not possible to obtain the previous element in the list. Inside the buffers we don't need them. This is why I switched to custom linked list macros. The macros will also keep track of the amount of items in the list. Because it doesn't use a sentinel, we can just initialize the queues with zero. In its simplest form (the output queue), we will only keep two references to blocks in the queue, namely the head of the list and the last block in use. All free blocks are stored behind the last block in use. I noticed there was a very subtle bug in the previous code: in a very uncommon corner case, it would uma_zfree() a block in the queue before calling memcpy() to extract the data from the block.
* Use NULL in preference to 0 in pointer contexts.imp2009-02-032-9/+9
|
* Make bioq_disksort have a ANSI-C definition rather than a K&R definition.imp2009-02-031-3/+1
|
* rman_debug should be static, so make it static.imp2009-02-031-1/+1
|
* Use ANSI function definition for profil.imp2009-02-031-3/+1
|
* Prefer ANSI function definitions to K&R ones.imp2009-02-031-6/+3
|
* Use NULL in preference to 0 for pointers.imp2009-02-032-7/+7
|
* Use NULL in preference to 0 for pointers.imp2009-02-032-2/+2
|
* o Use unsigned for bit fields.imp2009-02-031-3/+3
| | | | o Use NULL for pointers in preference to 0.
* int foo(void) is the proper ANSI function definition when there's noimp2009-02-031-1/+1
| | | | parameters. Use it for resettodr().
* Declare bus_data_devices to be static: it isn't used elsewhere.imp2009-02-031-5/+5
| | | | | Use NULL in a couple of places rather than 0 in the context of pointers to be consistent with the rest of the file.
* Fix select on platforms where sizeof(long) != sizeof(int). This usedsepotvin2009-02-021-2/+2
| | | | | | to work by accident before the cleanup done in revision 187693. Approved by: kan (mentor)
* If a process is a zombie and we couldn't identify another useful state,rwatson2009-01-291-0/+2
| | | | | | | | print out the state as "zombine" in preference to "unknown" when ^T is pressed. MFC after: 3 days Sponsored by: Google, Inc.
* Mark most often used sysctl's as MPSAFE.ed2009-01-283-14/+18
| | | | | | | | | | | | | | | | | | | | | | | | | After running a `make buildkernel', I noticed most of the Giant locks in sysctl are only caused by a very small amount of sysctl's: - sysctl.name2oid. This one is locked by SYSCTL_LOCK, just like sysctl.oidfmt. - kern.ident, kern.osrelease, kern.version, etc. These are just constant strings. - kern.arandom, used by the stack protector. It is already protected by arc4_mtx. I also saw the following sysctl's show up. Not as often as the ones above, but still quite often: - security.jail.jailed. Also mark security.jail.list as MPSAFE. They don't need locking or already use allprison_lock. - kern.devname, used by devname(3), ttyname(3), etc. This seems to reduce Giant locking inside sysctl by ~75% in my primitive test setup.
* Convert the global mutex protecting the directory lookup name cache from ajhb2009-01-281-46/+81
| | | | | | | mutex to a reader/writer lock. Lookup operations first grab a read lock and perform the lookup. If the operation results in a need to modify the cache, then it tries to do an upgrade. If that fails, it drops the read lock, obtains a write lock, and redoes the lookup.
* Use the proper flag to let kern.ttys be executed without Giant.ed2009-01-261-1/+1
| | | | Pointed out by: jhb
* Whitespace tweak.jhb2009-01-261-1/+1
|
* - bit has to be fd_mask to work properly on 64bit platforms. Constantsjeff2009-01-251-5/+6
| | | | | | must also be cast even though the result ultimately is promoted to 64bit. - Correct a loop index upper bound in selscan().
* When a statically linked binary is executed (or at least, one withoutrwatson2009-01-251-1/+2
| | | | | | | | | | | | | | | | | | | | | an interpreter definition in its program header), set the auxiliary ELF argument AT_BASE to 0 rather than to the address that we would have mapped the interpreter at if there had been one. The ELF ABI specifications appear to be ambiguous as to the desired behavior in this situation, as they define AT_BASE as the base address of the interpreter, but do not mention what to do if there is none. On Solaris, AT_BASE will be set to the base address of the static binary if there is no interpreter, and on Linux, AT_BASE is set to 0. We go with the Linux semantics as they are of more immediate utility and allow the early runtime environment to know that the kernel has not mapped an interpreter, but because AT_PHDR points at the ELF header for the running binary, it is still possible to retrieve all required mapping information when the process starts should it be required. Either approach would be preferable to our current behavior of passing a pointer to an unmapped region of user memory as AT_BASE. MFC after: 3 weeks
* For consistency with prison_{local,remote,check}_ipN renamebz2009-01-251-2/+2
| | | | | | | prison_getipN to prison_get_ipN. Submitted by: jamie (as part of a larger patch) MFC after: 1 week
* - Correct a typo in a comment.jeff2009-01-251-1/+1
| | | | Noticed by: danger
OpenPOWER on IntegriCloud