| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
| |
Remove obsolete register keyword from opensolaris's sysmacros.h. When
compiling zfsd with recent clang, it leads to a warning about the
register storage class being incompatible with C++17.
(cherry picked from commit 016480dfbf6cf81626fb8c9cb4827635b58cdd12)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
MFV r318942: 8166 zpool scrub thinks it repaired offline device
https://www.illumos.org/issues/8166
If we do a scrub while a leaf device is offline (via "zpool offline"),
we will inadvertently clear the DTL (dirty time log) of the offline
device, even though it is still damaged. When the device comes back
online, we will incompletely resilver it, thinking that the scrub
repaired blocks written before the scrub was started. The incomplete
resilver can lead to data loss if there is a subsequent failure of a
different leaf device.
The fix is to never clear the DTL of offline devices. Note that if a
device is onlined while a scrub is in progress, the scrub will be
restarted.
The problem can be worked around by running "zpool scrub" after
"zpool online".
See also https://github.com/zfsonlinux/zfs/issues/5806
PR: 219537
Approved by: re (kib)
Sponsored by: The FreeBSD Foundation
|
|
|
|
| |
zfs_throttle_resolution
|
| |
|
| |
|
|
|
|
|
|
|
|
| |
Now that r308691 has been MFC-ed as a part of r318716,
r308826 must be MFC-ed as well.
PR: 214629
Reported by: mshirk@daemon-security.com [head], lev [stable/11]
|
|
|
|
|
| |
Since this is a stable branch vfs.zfs.debug_flags sysctl is also kept.
The corresponing tunable could never work.
|
|
|
|
|
| |
r308489, r308706:
Add PQ_LAUNDRY and remove PG_CACHED pages.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
vdev_geom may associate multiple vdevs per g_consumer
vdev_geom.c currently uses the g_consumer's private field to point to a
vdev_t. That way, a GEOM event can cause a change to a ZFS vdev. For
example, when you remove a disk, the vdev's status will change to REMOVED.
However, vdev_geom will sometimes attach multiple vdevs to the same GEOM
consumer. If this happens, then geom events will only be propagated to one
of the vdevs.
Fix this by storing a linked list of vdevs in g_consumer's private field.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
* g_consumer.private now stores a linked list of vdev pointers associated
with the consumer instead of just a single vdev pointer.
* Change vdev_geom_set_physpath's signature to more closely match
vdev_geom_set_rotation_rate
* Don't bother calling g_access in vdev_geom_set_physpath. It's guaranteed
that we've already accessed the consumer by the time we get here.
* Don't call vdev_geom_set_physpath in vdev_geom_attach. Instead, call it
in vdev_geom_open, after we know that the open has succeeded.
PR: 218634
Reviewed by: gibbs
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D10391
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Fix vdev_geom_attach_by_guids for partitioned disks
When opening a vdev whose path is unknown, vdev_geom must find a geom
provider with a label whose guids match the desired vdev. However, due to
partitioning, it is possible that two non-synonomous providers will share
some labels. For example, if the first partition starts at the beginning of
the drive, then ada0 and ada0p1 will share the first label. More troubling,
if the last partition runs to the end of the drive, then ada0p3 and ada0
will share the last label. If vdev_geom opens ada0 when it should've opened
ada0p3, then the pool won't be readable. If it opens ada0 when it should've
opened ada0p1, then it will corrupt some other partition when it writes the
3rd and 4th labels.
The easiest way to reproduce this problem is to install a mirrored root pool
with the default partition layout, then swap the positions of the two boot
drives and reboot. Whether the bug manifests depends on the order in which
geom lists its providers, which is arbitrary.
Fix this situation by modifying the search algorithm to prefer geom
providers that have all four labels intact. If no such provider exists, then
open whichever provider has the most.
Reviewed by: mav
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D10365
|
|
|
|
|
|
| |
Reduce ARC fragmentation threshold
Sponsored by: Multiplay
|
|
|
|
|
|
| |
Fix expandsz 16.0E vals and vdev_min_asize of RAIDZ children
Sponsored by: Multiplay
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Fix setting birthtime in ZFS
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c
* In zfs_freebsd_setattr, if the caller wants to set the birthtime,
set the bits that zfs_settattr expects
* In zfs_setattr, if XAT_CREATETIME is set, set xoa_createtime,
expected by zfs_xvattr_set. The two levels of indirection seem
excessive, but it minimizes diffs vs OpenZFS.
* In zfs_setattr, check for overflow of va_birthtime (from delphij)
* Remove red herring in zfs_getattr
sys/cddl/contrib/opensolaris/uts/common/sys/vnode.h
* Un-booby-trap some macros
New tests are under review at https://github.com/pjd/pjdfstest/pull/6
Reviewed by: avg
MFC after: 3 weeks
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D9353
|
|
|
|
|
|
|
|
| |
Replace the implementation of DTrace's RAND subroutine for generating
low-quality random numbers with a modern implementation (xoroshiro128+)
that is capable of generating better quality randomness without compromising performance.
Submitted by: Graeme Jenkinson
|
| |
|
| |
|
| |
|
|
|
|
| |
clones
|
|
|
|
|
|
|
|
|
| |
At least on FreeBSD there are no legal way to access media or get its
size without opening device/provider first. Postponing this caching
allows to skip several disk seeks per ZVOL/snapshot during import.
For HDD pool with 1 ZVOL in dev mode with 1000 snapshots this reduces
pool import time from 40 to 10 seconds.
|
| |
|
|
|
|
| |
Add sysctls for zfs_immediate_write_sz and zvol_immediate_write_sz.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
After some ZIL changes 6 years ago zil_slog_limit got partially broken
due to zl_itx_list_sz not updated when async itx'es upgraded to sync.
Actually because of other changes about that time zl_itx_list_sz is not
really required to implement the functionality, so this patch removes
some unneeded broken code and variables.
Original idea of zil_slog_limit was to reduce chance of SLOG abuse by
single heavy logger, that increased latency for other (more latency critical)
loggers, by pushing heavy log out into the main pool instead of SLOG. Beside
huge latency increase for heavy writers, this implementation caused double
write of all data, since the log records were explicitly prepared for SLOG.
Since we now have I/O scheduler, I've found it can be much more efficient
to reduce priority of heavy logger SLOG writes from ZIO_PRIORITY_SYNC_WRITE
to ZIO_PRIORITY_ASYNC_WRITE, while still leave them on SLOG.
Existing ZIL implementation had problem with space efficiency when it
has to write large chunks of data into log blocks of limited size. In some
cases efficiency stopped to almost as low as 50%. In case of ZIL stored on
spinning rust, that also reduced log write speed in half, since head had to
uselessly fly over allocated but not written areas. This change improves
the situation by offloading problematic operations from z*_log_write() to
zil_lwb_commit(), which knows real situation of log blocks allocation and
can split large requests into pieces much more efficiently. Also as side
effect it removes one of two data copy operations done by ZIL code WR_COPIED
case.
While there, untangle and unify code of z*_log_write() functions.
Also zfs_log_write() alike to zvol_log_write() can now handle writes crossing
block boundary, that may also improve efficiency if ZPL is made to do that.
Sponsored by: iXsystems, Inc.
|
|
|
|
| |
It is at least partially broken now, but that is another question.
|
|
|
|
|
|
|
|
|
|
| |
For short transactions overhead of context switch can be too large.
Skipping it gives significant latency reduction. For large ones,
including multiple ZIOs, latency is less critical, while throughput
there may become limited by checksumming speed of single CPU core.
To get best of both cases, execute last ZIO directly from calling
thread context to save latency, while all others (if there are any)
enqueue to taskqueues in traditional way.
|
| |
|
| |
|
|
|
|
|
| |
Do not invoke the resize event when previous provider's size was zero.
This is similar to r303637 fix for geom_disk.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
MVF: 313876
7504 kmem_reap hangs spa_sync and administrative tasks
illumos/illumos-gate@405a5a0f5c3ab36cb76559467d1a62ba648bd809
https://github.com/illumos/illumos-gate/commit/405a5a0f5c3ab36cb76559467d1a62ba648bd80
https://www.illumos.org/issues/7504
We see long spa_sync(). We are waiting to hold dp_config_rwlock for writer. Some
other thread holds dp_config_rwlock for reader, then calls arc_get_data_buf(),
which finds that arc_is_overflowing()==B_TRUE. So it waits (while holding
dp_config_rwlock for reader) for arc_reclaim_thread to signal arc_reclaim_waiters_cv.
Before signaling, arc_reclaim_thread does arc_kmem_reap_now(), which takes ~seconds.
Author: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
|
|
|
|
|
| |
Obtained from: Panzura
Sponsored by: Panzura
|
|
|
|
|
|
|
| |
Fix null pointer dereference in zfs_freebsd_setacl().
Prevents unprivileged users from panicking the kernel by calling
__acl_delete_*() on files or directories inside a ZFS mount.
|
| |
|
| |
|
| |
|
|
|
|
| |
PR: 216939
|
|
|
|
| |
Fix an off-by-one in an assertion on fasttrap tracepoint sizes.
|
|
|
|
| |
Remove an obsolete pragma from dtrace.h.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
r308407,r308665,r308667,r309067:
cache: put all negative entry management code into dedicated functions
==
cache: manage negative entry list with a dedicated lock
Since negative entries are managed with a LRU list, a hit requires a
modificaton.
Currently the code tries to upgrade the global lock if needed and is
forced to retry the lookup if it fails.
Provide a dedicated lock for use when the cache is only shared-locked.
==
cache: defer freeing entries until after the global lock is dropped
This also defers vdrop for held vnodes.
==
cache: improve scalability by introducing bucket locks
An array of bucket locks is added.
All modifications still require the global cache_lock to be held for
writing. However, most readers only need the relevant bucket lock and in
effect can run concurrently to the writer as long as they use a
different lock. See the added comment for more details.
This is an intermediate step towards removal of the global lock.
==
cache: get rid of the global lock
Add a table of vnode locks and use them along with bucketlocks to provide
concurrent modification support. The approach taken is to preserve the
current behaviour of the namecache and just lock all relevant parts before
any changes are made.
Lookups still require the relevant bucket to be locked.
==
cache: ignore purgevfs requests for filesystems with few vnodes
purgevfs is purely optional and induces lock contention in workloads
which frequently mount and unmount filesystems.
In particular, poudriere will do this for filesystems with 4 vnodes or
less. Full cache scan is clearly wasteful.
Since there is no explicit counter for namecache entries, the number of
vnodes used by the target fs is checked.
The default limit is the number of bucket locks.
== (by kib)
Limit scope of the optimization in r306608 to dounmount() caller only.
Other uses of cache_purgevfs() do rely on the cache purge for correct
operations, when paths are invalidated without unmount.
==
cache: split negative entry LRU into multiple lists
This splits the ncneg_mtx lock while preserving the hit ratio at least
during buildworld.
Create N dedicated lists for new negative entries.
Entries with at least one hit get promoted to the hot list, where they
get requeued every M hits.
Shrinking demotes one hot entry and performs a round-robin shrinking of
regular lists.
==
cache: fix up a corner case in r307650
If no negative entry is found on the last list, the ncp pointer will be
left uninitialized and a non-null value will make the function assume an
entry was found.
Fix the problem by initializing to NULL on entry.
== (by kib)
vn_fullpath1() checked VV_ROOT and then unreferenced
vp->v_mount->mnt_vnodecovered unlocked. This allowed unmount to race.
Lock vnode after we noticed the VV_ROOT flag. See comments for
explanation why unlocked check for the flag is considered safe.
==
cache: fix a race between entry removal and demotion
The negative list shrinker can demote an entry with only hotlist + neglist
locks held. On the other hand entry removal possibly sets the NCF_DVDROP
without aformentioned locks held prior to detaching it from the respective
netlist., which can lose the update made by the shrinker.
==
cache: plug a write-only variable in cache_negative_zap_one
==
cache: ensure that the number of bucket locks does not exceed hash size
The size can be changed by side effect of modifying kern.maxvnodes.
Since numbucketlocks was not modified, setting a sufficiently low value
would give more locks than actual buckets, which would then lead to
corruption.
Force the number of buckets to be not smaller.
Note this should not matter for real world cases.
|
|
|
|
|
|
| |
Remove extra DOF_SEC_XLIMPORT from the DOF_SEC_ISLOADABLE macro
Sponsored by: DARPA, AFRL
|
| |
|
| |
|
|
|
|
| |
to free already free blocks
|
|
|
|
| |
zfs_suspend_fs+zfs_resume_fs and zfs_ioc_rename
|
|
|
|
|
|
|
|
|
|
| |
Fix a kernel panic in DTrace's rw_iswriter subroutine.
On FreeBSD the sense of rw_write_held() and rw_iswriter() were reversed,
probably due to a cut and paste error. Using rw_iswriter() would cause
the kernel to panic.
Reviewed by: markj
Sponsored by: DARPA, AFRL
|
|
|
|
|
|
|
| |
Add tunable to disable destructive dtrace
Submitted by: Joerg Pernfuss <code.jpe@gmail.com>
Reviewed by: rstone, markj
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Original commit "7090 zfs should improve allocation order" declares alloc
queue sorted by time and offset. But in practice io_offset is always zero,
so sorting happened only by time, while order of writes with equal time was
completely random. On Illumos this did not affected much thanks to using
high resolution timestamps. On FreeBSD due to using much faster but low
resolution timestamps it caused bad data placement on disks, affecting
further read performance.
This change switches zio_timestamp_compare() from comparing uninitialized
io_offset to really populated io_bookmark values. I haven't decided yet
what to do with timestampts, but on simple tests this change gives the
same peformance results by just making code to work as declared.
|
|
|
|
| |
DIAGNOSTIC is enabled
|
|
|
|
| |
options for zfsboot
|
|
|
|
|
|
|
|
|
| |
Fix ZIL records ordering when ZVOL opened both with and without FSYNC.
Before this an earlier writes to a ZVOL opened without FSYNC could get to
ZIL after later writes to the same ZVOL opened with FSYNC. Fix this by
replicating functionality of ZPL (zv_sync_cnt equivalent to z_sync_cnt),
marking all log records sync if anybody opened the ZVOL with FSYNC.
|