summaryrefslogtreecommitdiffstats
path: root/sys/cddl
Commit message (Collapse)AuthorAgeFilesLines
* MFS r320605, r320610: MFC r303052, r309017 (by alc):markj2017-07-051-1/+1
| | | | | | | Omit v_cache_count when computing the number of free pages, since its value is always 0. Approved by: re (gjb, kib)
* MFC r318943 (avg):gjb2017-06-061-0/+3
| | | | | | | | | | | | | | | | | | | | | | | MFV r318942: 8166 zpool scrub thinks it repaired offline device https://www.illumos.org/issues/8166 If we do a scrub while a leaf device is offline (via "zpool offline"), we will inadvertently clear the DTL (dirty time log) of the offline device, even though it is still damaged. When the device comes back online, we will incompletely resilver it, thinking that the scrub repaired blocks written before the scrub was started. The incomplete resilver can lead to data loss if there is a subsequent failure of a different leaf device. The fix is to never clear the DTL of offline devices. Note that if a device is onlined while a scrub is in progress, the scrub will be restarted. The problem can be worked around by running "zpool scrub" after "zpool online". See also https://github.com/zfsonlinux/zfs/issues/5806 PR: 219537 Approved by: re (kib) Sponsored by: The FreeBSD Foundation
* MFC r318832: MFV r316923: 8026 retire zfs_throttle_delay and ↵avg2017-06-011-3/+0
| | | | zfs_throttle_resolution
* MFC r318830: MFV r316921: 8027 tighten up dsl_pool_dirty_deltaavg2017-06-011-1/+1
|
* MFC r319096: zfs_lookup: fix bogus arguments to lookup of "snapshot" directoryavg2017-06-011-1/+1
|
* MFC r308826: zfs: fix up after the removal of PG_CACHED pages in r308691avg2017-05-291-2/+2
| | | | | | | | Now that r308691 has been MFC-ed as a part of r318716, r308826 must be MFC-ed as well. PR: 214629 Reported by: mshirk@daemon-security.com [head], lev [stable/11]
* MFC r316854: rename vfs.zfs.debug_flags to vfs.zfs.debugflagsavg2017-05-241-1/+5
| | | | | Since this is a stable branch vfs.zfs.debug_flags sysctl is also kept. The corresponing tunable could never work.
* MFC r308474, r308691, r309203, r309365, r309703, r309898, r310720,markj2017-05-232-6/+1
| | | | | r308489, r308706: Add PQ_LAUNDRY and remove PG_CACHED pages.
* MFC r318189:asomers2017-05-221-49/+65
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | vdev_geom may associate multiple vdevs per g_consumer vdev_geom.c currently uses the g_consumer's private field to point to a vdev_t. That way, a GEOM event can cause a change to a ZFS vdev. For example, when you remove a disk, the vdev's status will change to REMOVED. However, vdev_geom will sometimes attach multiple vdevs to the same GEOM consumer. If this happens, then geom events will only be propagated to one of the vdevs. Fix this by storing a linked list of vdevs in g_consumer's private field. sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c * g_consumer.private now stores a linked list of vdev pointers associated with the consumer instead of just a single vdev pointer. * Change vdev_geom_set_physpath's signature to more closely match vdev_geom_set_rotation_rate * Don't bother calling g_access in vdev_geom_set_physpath. It's guaranteed that we've already accessed the consumer by the time we get here. * Don't call vdev_geom_set_physpath in vdev_geom_attach. Instead, call it in vdev_geom_open, after we know that the open has succeeded. PR: 218634 Reviewed by: gibbs Sponsored by: Spectra Logic Corp Differential Revision: https://reviews.freebsd.org/D10391
* MFC r314370,r318130,r318167:jhibbits2017-05-202-31/+28
| | | | | | | | | | | DTrace related fixes for PowerPC. r314370: Unbreak kernel breakpoints, broken for ~4 years now r318130: Fix the encoded instruction for FBT traps on powerpc r318167: Fix stack tracing in dtrace for powerpc
* MFC r316760:asomers2017-05-051-31/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix vdev_geom_attach_by_guids for partitioned disks When opening a vdev whose path is unknown, vdev_geom must find a geom provider with a label whose guids match the desired vdev. However, due to partitioning, it is possible that two non-synonomous providers will share some labels. For example, if the first partition starts at the beginning of the drive, then ada0 and ada0p1 will share the first label. More troubling, if the last partition runs to the end of the drive, then ada0p3 and ada0 will share the last label. If vdev_geom opens ada0 when it should've opened ada0p3, then the pool won't be readable. If it opens ada0 when it should've opened ada0p1, then it will corrupt some other partition when it writes the 3rd and 4th labels. The easiest way to reproduce this problem is to install a mirrored root pool with the default partition layout, then swap the positions of the two boot drives and reboot. Whether the bug manifests depends on the order in which geom lists its providers, which is arbitrary. Fix this situation by modifying the search algorithm to prefer geom providers that have all four labels intact. If no such provider exists, then open whichever provider has the most. Reviewed by: mav Sponsored by: Spectra Logic Corp Differential Revision: https://reviews.freebsd.org/D10365
* MFC r315449:smh2017-04-261-1/+1
| | | | | | Reduce ARC fragmentation threshold Sponsored by: Multiplay
* MFC r316460:smh2017-04-261-7/+17
| | | | | | Fix expandsz 16.0E vals and vdev_min_asize of RAIDZ children Sponsored by: Multiplay
* MFC r315852: zfs: add zio_buf_alloc_nowait and use it in vdev_queue_aggregateavg2017-04-143-7/+28
|
* MFC r315853: zfs_putpages: use TXG_WAITavg2017-04-141-7/+1
|
* MFC r315208:markj2017-04-111-1/+1
| | | | Fix a backwards comparison in the code to dump a DTrace debug buffer.
* MFC r313483:asomers2017-04-022-15/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | Fix setting birthtime in ZFS sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c * In zfs_freebsd_setattr, if the caller wants to set the birthtime, set the bits that zfs_settattr expects * In zfs_setattr, if XAT_CREATETIME is set, set xoa_createtime, expected by zfs_xvattr_set. The two levels of indirection seem excessive, but it minimizes diffs vs OpenZFS. * In zfs_setattr, check for overflow of va_birthtime (from delphij) * Remove red herring in zfs_getattr sys/cddl/contrib/opensolaris/uts/common/sys/vnode.h * Un-booby-trap some macros New tests are under review at https://github.com/pjd/pjdfstest/pull/6 Reviewed by: avg MFC after: 3 weeks Sponsored by: Spectra Logic Corp Differential Revision: https://reviews.freebsd.org/D9353
* MFC: 313176, 313177, 313359gnn2017-03-304-2/+157
| | | | | | | | Replace the implementation of DTrace's RAND subroutine for generating low-quality random numbers with a modern implementation (xoroshiro128+) that is capable of generating better quality randomness without compromising performance. Submitted by: Graeme Jenkinson
* MFC r315076: zfs: provide a special vptocnp method for the .zfs vnodeavg2017-03-231-0/+23
|
* MFC r314048,r314194: reimplement zfsctl (.zfs) supportavg2017-03-2310-2749/+756
|
* MFC r314913: MFV r314911: 7867 ARC space accounting leakavg2017-03-231-0/+6
|
* MFC r314912: MFV r314910: 7843 get_clones_stat() is suboptimal for lots of ↵avg2017-03-231-1/+12
| | | | clones
* MFC r309856: Postpone ZVOL media/block size caching till first open.mav2017-03-171-11/+15
| | | | | | | | | At least on FreeBSD there are no legal way to access media or get its size without opening device/provider first. Postponing this caching allows to skip several disk seeks per ZVOL/snapshot during import. For HDD pool with 1 ZVOL in dev mode with 1000 snapshots this reduces pool import time from 40 to 10 seconds.
* MFC r309833: Add missed vfs.zfs.zfetch.max_idistance sysctl.mav2017-03-171-0/+2
|
* MFC r308099:mav2017-03-172-0/+9
| | | | Add sysctls for zfs_immediate_write_sz and zvol_immediate_write_sz.
* MFC r308782:mav2017-03-177-109/+100
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After some ZIL changes 6 years ago zil_slog_limit got partially broken due to zl_itx_list_sz not updated when async itx'es upgraded to sync. Actually because of other changes about that time zl_itx_list_sz is not really required to implement the functionality, so this patch removes some unneeded broken code and variables. Original idea of zil_slog_limit was to reduce chance of SLOG abuse by single heavy logger, that increased latency for other (more latency critical) loggers, by pushing heavy log out into the main pool instead of SLOG. Beside huge latency increase for heavy writers, this implementation caused double write of all data, since the log records were explicitly prepared for SLOG. Since we now have I/O scheduler, I've found it can be much more efficient to reduce priority of heavy logger SLOG writes from ZIO_PRIORITY_SYNC_WRITE to ZIO_PRIORITY_ASYNC_WRITE, while still leave them on SLOG. Existing ZIL implementation had problem with space efficiency when it has to write large chunks of data into log blocks of limited size. In some cases efficiency stopped to almost as low as 50%. In case of ZIL stored on spinning rust, that also reduced log write speed in half, since head had to uselessly fly over allocated but not written areas. This change improves the situation by offloading problematic operations from z*_log_write() to zil_lwb_commit(), which knows real situation of log blocks allocation and can split large requests into pieces much more efficiently. Also as side effect it removes one of two data copy operations done by ZIL code WR_COPIED case. While there, untangle and unify code of z*_log_write() functions. Also zfs_log_write() alike to zvol_log_write() can now handle writes crossing block boundary, that may also improve efficiency if ZPL is made to do that. Sponsored by: iXsystems, Inc.
* MFC r307397: Add vfs.zfs.zil_log_limit sysctl.mav2017-03-171-0/+2
| | | | It is at least partially broken now, but that is another question.
* MFC r314549: Execute last ZIO of log commit synchronously.mav2017-03-161-3/+5
| | | | | | | | | | For short transactions overhead of context switch can be too large. Skipping it gives significant latency reduction. For large ones, including multiple ZIOs, latency is less critical, while throughput there may become limited by checksumming speed of single CPU core. To get best of both cases, execute last ZIO directly from calling thread context to save latency, while all others (if there are any) enqueue to taskqueues in traditional way.
* MFC r314548: Completely skip cache flushing for not supporting log devices.mav2017-03-161-5/+7
|
* MFC r314274: l2arc: fix write size calculation broken by Compressed ARC commitavg2017-03-111-18/+18
|
* MFC r313841, r313850:markj2017-03-102-2/+20
| | | | Prevent CPU migration when checking the DTrace nofault flag on x86.
* MFC r314497:ae2017-03-081-1/+10
| | | | | Do not invoke the resize event when previous provider's size was zero. This is similar to r303637 fix for geom_disk.
* MFC 313879jpaetzel2017-03-071-3/+8
| | | | | | | | | | | | | | | | | | | | | | MVF: 313876 7504 kmem_reap hangs spa_sync and administrative tasks illumos/illumos-gate@405a5a0f5c3ab36cb76559467d1a62ba648bd809 https://github.com/illumos/illumos-gate/commit/405a5a0f5c3ab36cb76559467d1a62ba648bd80 https://www.illumos.org/issues/7504 We see long spa_sync(). We are waiting to hold dp_config_rwlock for writer. Some other thread holds dp_config_rwlock for reader, then calls arc_get_data_buf(), which finds that arc_is_overflowing()==B_TRUE. So it waits (while holding dp_config_rwlock for reader) for arc_reclaim_thread to signal arc_reclaim_waiters_cv. Before signaling, arc_reclaim_thread does arc_kmem_reap_now(), which takes ~seconds. Author: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com>
* MFC r314058: zfs: lower priority of zio_write_issue threads by fouravg2017-03-071-1/+9
| | | | | Obtained from: Panzura Sponsored by: Panzura
* MFC r314572:mm2017-03-051-0/+3
| | | | | | | Fix null pointer dereference in zfs_freebsd_setacl(). Prevents unprivileged users from panicking the kernel by calling __acl_delete_*() on files or directories inside a ZFS mount.
* MFC r314273: zfs: call spa_deadman on a taskqueue threadavg2017-03-043-8/+32
|
* MFC r314059: zfs: move zio_taskq_basedc under SYSDCavg2017-02-271-1/+1
|
* MFC r313687: remove l2_padding_needed statistic from zfs arcavg2017-02-211-2/+0
|
* MFC r313686: check remaining space in zfs implementations of vptocnpavg2017-02-212-6/+15
| | | | PR: 216939
* MFC r313133:markj2017-02-172-53/+639
| | | | Sync the x86 dis_tables.c with upstream.
* MFC r312893:markj2017-02-031-1/+1
| | | | Fix an off-by-one in an assertion on fasttrap tracepoint sizes.
* MFC r310647:markj2017-01-031-2/+0
| | | | Remove an obsolete pragma from dtrace.h.
* MFC r305378,r305379,r305386,r305684,r306224,r306608,r306803,r307650,r307685,mjg2016-12-311-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | r308407,r308665,r308667,r309067: cache: put all negative entry management code into dedicated functions == cache: manage negative entry list with a dedicated lock Since negative entries are managed with a LRU list, a hit requires a modificaton. Currently the code tries to upgrade the global lock if needed and is forced to retry the lookup if it fails. Provide a dedicated lock for use when the cache is only shared-locked. == cache: defer freeing entries until after the global lock is dropped This also defers vdrop for held vnodes. == cache: improve scalability by introducing bucket locks An array of bucket locks is added. All modifications still require the global cache_lock to be held for writing. However, most readers only need the relevant bucket lock and in effect can run concurrently to the writer as long as they use a different lock. See the added comment for more details. This is an intermediate step towards removal of the global lock. == cache: get rid of the global lock Add a table of vnode locks and use them along with bucketlocks to provide concurrent modification support. The approach taken is to preserve the current behaviour of the namecache and just lock all relevant parts before any changes are made. Lookups still require the relevant bucket to be locked. == cache: ignore purgevfs requests for filesystems with few vnodes purgevfs is purely optional and induces lock contention in workloads which frequently mount and unmount filesystems. In particular, poudriere will do this for filesystems with 4 vnodes or less. Full cache scan is clearly wasteful. Since there is no explicit counter for namecache entries, the number of vnodes used by the target fs is checked. The default limit is the number of bucket locks. == (by kib) Limit scope of the optimization in r306608 to dounmount() caller only. Other uses of cache_purgevfs() do rely on the cache purge for correct operations, when paths are invalidated without unmount. == cache: split negative entry LRU into multiple lists This splits the ncneg_mtx lock while preserving the hit ratio at least during buildworld. Create N dedicated lists for new negative entries. Entries with at least one hit get promoted to the hot list, where they get requeued every M hits. Shrinking demotes one hot entry and performs a round-robin shrinking of regular lists. == cache: fix up a corner case in r307650 If no negative entry is found on the last list, the ncp pointer will be left uninitialized and a non-null value will make the function assume an entry was found. Fix the problem by initializing to NULL on entry. == (by kib) vn_fullpath1() checked VV_ROOT and then unreferenced vp->v_mount->mnt_vnodecovered unlocked. This allowed unmount to race. Lock vnode after we noticed the VV_ROOT flag. See comments for explanation why unlocked check for the flag is considered safe. == cache: fix a race between entry removal and demotion The negative list shrinker can demote an entry with only hotlist + neglist locks held. On the other hand entry removal possibly sets the NCF_DVDROP without aformentioned locks held prior to detaching it from the respective netlist., which can lose the update made by the shrinker. == cache: plug a write-only variable in cache_negative_zap_one == cache: ensure that the number of bucket locks does not exceed hash size The size can be changed by side effect of modifying kern.maxvnodes. Since numbucketlocks was not modified, setting a sufficiently low value would give more locks than actual buckets, which would then lead to corruption. Force the number of buckets to be not smaller. Note this should not matter for real world cases.
* MFC: 310175gnn2016-12-301-2/+2
| | | | | | Remove extra DOF_SEC_XLIMPORT from the DOF_SEC_ISLOADABLE macro Sponsored by: DARPA, AFRL
* MFC r309250: MFV r309249: 3821 Race in rollback, zil close, and zil flushavg2016-12-243-10/+68
|
* MFC r309099: MFV r308990: 7181 race between zfs_mount and zfs_ioc_rollbackavg2016-12-241-7/+7
|
* MFC r309098: MFV r308988: 7199, 7200 dsl_dataset_rollback_sync may tryavg2016-12-243-18/+59
| | | | to free already free blocks
* MFC r309097: MFV r308987: 7180 potential race betweenavg2016-12-243-11/+18
| | | | zfs_suspend_fs+zfs_resume_fs and zfs_ioc_rename
* MFC r303796:jhibbits2016-12-232-2/+8
| | | | | | | Two fixups for dtrace * Use the right incantation to get the next stack pointer. * Clear EE using the correct instruction sequence.
* MFC: 309669gnn2016-12-201-4/+4
| | | | | | | | | | Fix a kernel panic in DTrace's rw_iswriter subroutine. On FreeBSD the sense of rw_write_held() and rw_iswriter() were reversed, probably due to a cut and paste error. Using rw_iswriter() would cause the kernel to panic. Reviewed by: markj Sponsored by: DARPA, AFRL
OpenPOWER on IntegriCloud