summaryrefslogtreecommitdiffstats
path: root/sys/ufs/ffs/ffs_alloc.c
Commit message (Collapse)AuthorAgeFilesLines
* Refinement to revision 1.16 of ufs/ffs/ffs_snapshot.c to reducemckusick2001-05-041-3/+1
| | | | | the amount of time that the filesystem must be suspended. The current snapshot is elided as well as the earlier snapshots.
* Revert consequences of changes to mount.h, part 2.grog2001-04-291-2/+0
| | | | Requested by: bde
* Correct #includes to work with fixed sys/mount.h.grog2001-04-231-0/+2
|
* Background fsck sysctl operations must use vn_start_write andmckusick2001-04-171-8/+14
| | | | | vn_finished_write so that they do not attempt to modify a suspended filesystem.
* Directory layout preference improvements from Grigoriy Orlov <gluk@ptci.ru>.mckusick2001-04-101-17/+112
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | His description of the problem and solution follow. My own tests show speedups on typical filesystem intensive workloads of 5% to 12% which is very impressive considering the small amount of code change involved. ------ One day I noticed that some file operations run much faster on small file systems then on big ones. I've looked at the ffs algorithms, thought about them, and redesigned the dirpref algorithm. First I want to describe the results of my tests. These results are old and I have improved the algorithm after these tests were done. Nevertheless they show how big the perfomance speedup may be. I have done two file/directory intensive tests on a two OpenBSD systems with old and new dirpref algorithm. The first test is "tar -xzf ports.tar.gz", the second is "rm -rf ports". The ports.tar.gz file is the ports collection from the OpenBSD 2.8 release. It contains 6596 directories and 13868 files. The test systems are: 1. Celeron-450, 128Mb, two IDE drives, the system at wd0, file system for test is at wd1. Size of test file system is 8 Gb, number of cg=991, size of cg is 8m, block size = 8k, fragment size = 1k OpenBSD-current from Dec 2000 with BUFCACHEPERCENT=35 2. PIII-600, 128Mb, two IBM DTLA-307045 IDE drives at i815e, the system at wd0, file system for test is at wd1. Size of test file system is 40 Gb, number of cg=5324, size of cg is 8m, block size = 8k, fragment size = 1k OpenBSD-current from Dec 2000 with BUFCACHEPERCENT=50 You can get more info about the test systems and methods at: http://www.ptci.ru/gluk/dirpref/old/dirpref.html Test Results tar -xzf ports.tar.gz rm -rf ports mode old dirpref new dirpref speedup old dirprefnew dirpref speedup First system normal 667 472 1.41 477 331 1.44 async 285 144 1.98 130 14 9.29 sync 768 616 1.25 477 334 1.43 softdep 413 252 1.64 241 38 6.34 Second system normal 329 81 4.06 263.5 93.5 2.81 async 302 25.7 11.75 112 2.26 49.56 sync 281 57.0 4.93 263 90.5 2.9 softdep 341 40.6 8.4 284 4.76 59.66 "old dirpref" and "new dirpref" columns give a test time in seconds. speedup - speed increasement in times, ie. old dirpref / new dirpref. ------ Algorithm description The old dirpref algorithm is described in comments: /* * Find a cylinder to place a directory. * * The policy implemented by this algorithm is to select from * among those cylinder groups with above the average number of * free inodes, the one with the smallest number of directories. */ A new directory is allocated in a different cylinder groups than its parent directory resulting in a directory tree that is spreaded across all the cylinder groups. This spreading out results in a non-optimal access to the directories and files. When we have a small filesystem it is not a problem but when the filesystem is big then perfomance degradation becomes very apparent. What I mean by a big file system ? 1. A big filesystem is a filesystem which occupy 20-30 or more percent of total drive space, i.e. first and last cylinder are physically located relatively far from each other. 2. It has a relatively large number of cylinder groups, for example more cylinder groups than 50% of the buffers in the buffer cache. The first results in long access times, while the second results in many buffers being used by metadata operations. Such operations use cylinder group blocks and on-disk inode blocks. The cylinder group block (fs->fs_cblkno) contains struct cg, inode and block bit maps. It is 2k in size for the default filesystem parameters. If new and parent directories are located in different cylinder groups then the system performs more input/output operations and uses more buffers. On filesystems with many cylinder groups, lots of cache buffers are used for metadata operations. My solution for this problem is very simple. I allocate many directories in one cylinder group. I also do some things, so that the new allocation method does not cause excessive fragmentation and all directory inodes will not be located at a location far from its file's inodes and data. The algorithm is: /* * Find a cylinder group to place a directory. * * The policy implemented by this algorithm is to allocate a * directory inode in the same cylinder group as its parent * directory, but also to reserve space for its files inodes * and data. Restrict the number of directories which may be * allocated one after another in the same cylinder group * without intervening allocation of files. * * If we allocate a first level directory then force allocation * in another cylinder group. */ My early versions of dirpref give me a good results for a wide range of file operations and different filesystem capacities except one case: those applications that create their entire directory structure first and only later fill this structure with files. My solution for such and similar cases is to limit a number of directories which may be created one after another in the same cylinder group without intervening file creations. For this purpose, I allocate an array of counters at mount time. This array is linked to the superblock fs->fs_contigdirs[cg]. Each time a directory is created the counter increases and each time a file is created the counter decreases. A 60Gb filesystem with 8mb/cg requires 10kb of memory for the counters array. The maxcontigdirs is a maximum number of directories which may be created without an intervening file creation. I found in my tests that the best performance occurs when I restrict the number of directories in one cylinder group such that all its files may be located in the same cylinder group. There may be some deterioration in performance if all the file inodes are in the same cylinder group as its containing directory, but their data partially resides in a different cylinder group. The maxcontigdirs value is calculated to try to prevent this condition. Since there is no way to know how many files and directories will be allocated later I added two optimization parameters in superblock/tunefs. They are: int32_t fs_avgfilesize; /* expected average file size */ int32_t fs_avgfpdir; /* expected # of files per directory */ These parameters have reasonable defaults but may be tweeked for special uses of a filesystem. They are only necessary in rare cases like better tuning a filesystem being used to store a squid cache. I have been using this algorithm for about 3 months. I have done a lot of testing on filesystems with different capacities, average filesize, average number of files per directory, and so on. I think this algorithm has no negative impact on filesystem perfomance. It works better than the default one in all cases. The new dirpref will greatly improve untarring/removing/coping of big directories, decrease load on cvs servers and much more. The new dirpref doesn't speedup a compilation process, but also doesn't slow it down. Obtained from: Grigoriy Orlov <gluk@ptci.ru>
* Fix typo ); -> ,asmodai2001-03-241-1/+1
|
* Check that background fsck operation is being done on a ufs filesystem.mckusick2001-03-231-0/+2
| | | | Obtained from: Robert Watson <rwatson@FreeBSD.org>
* Add kernel support for running fsck on active filesystems.mckusick2001-03-211-14/+219
|
* Report the correct inode number when panicing with freeing free inode.mckusick2001-03-211-14/+14
| | | | Report the correct block number when panicing with freeing free block.
* Preceed/preceeding are not english words. Use precede and preceding.asmodai2001-02-181-1/+1
|
* Minor change: fix warning - move a 'struct vnode *vp' declaration inside apeter2000-07-281-0/+2
| | | | #ifdef DIAGNOSTIC to match its corresponding usage.
* Add snapshots to the fast filesystem. Most of the changes supportmckusick2000-07-111-1/+16
| | | | | | | | | | | | | | | | | | | | the gating of system calls that cause modifications to the underlying filesystem. The gating can be enabled by any filesystem that needs to consistently suspend operations by adding the vop_stdgetwritemount to their set of vnops. Once gating is enabled, the function vfs_write_suspend stops all new write operations to a filesystem, allows any filesystem modifying system calls already in progress to complete, then sync's the filesystem to disk and returns. The function vfs_write_resume allows the suspended write operations to begin again. Gating is not added by default for all filesystems as for SMP systems it adds two extra locks to such critical kernel paths as the write system call. Thus, gating should only be added as needed. Details on the use and current status of snapshots in FFS can be found in /sys/ufs/ffs/README.snapshot so for brevity and timelyness is not included here. Unless and until you create a snapshot file, these changes should have no effect on your system (famous last words).
* Separate the struct bio related stuff out of <sys/buf.h> intophk2000-05-051-0/+1
| | | | | | | | | | | | | | | <sys/bio.h>. <sys/bio.h> is now a prerequisite for <sys/buf.h> but it shall not be made a nested include according to bdes teachings on the subject of nested includes. Diskdrivers and similar stuff below specfs::strategy() should no longer need to include <sys/buf.> unless they need caching of data. Still a few bogus uses of struct buf to track down. Repocopy by: peter
* Introduce extended attribute support for FFS, allowing arbitraryrwatson2000-04-151-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | (name, value) pairs to be associated with inodes. This support is used for ACLs, MAC labels, and Capabilities in the TrustedBSD security extensions, which are currently under development. In this implementation, attributes are backed to data vnodes in the style of the quota support in FFS. Support for FFS extended attributes may be enabled using the FFS_EXTATTR kernel option (disabled by default). Userland utilities and man pages will be committed in the next batch. VFS interfaces and man pages have been in the repo since 4.0-RELEASE and are unchanged. o ufs/ufs/extattr.h: UFS-specific extattr defines o ufs/ufs/ufs_extattr.c: bulk of support routines o ufs/{ufs,ffs,mfs}/*.[ch]: hooks and extattr.h includes o contrib/softupdates/ffs_softdep.c: extattr.h includes o conf/options, conf/files, i386/conf/LINT: added FFS_EXTATTR o coda/coda_vfsops.c: XXX required extattr.h due to ufsmount.h (This should not be the case, and will be fixed in a future commit) Currently attributes are not supported in MFS. This will be fixed. Reviewed by: adrian, bp, freebsd-fs, other unthanked souls Obtained from: TrustedBSD Project
* Use 64-bit math to decide if optimization needs to be changed.mckusick2000-03-151-30/+48
| | | | | | Necessary for coherent results on filesystems bigger than 0.5Tb. Submitted by: Paul Saab <ps@yahoo-inc.com>
* Several performance improvements for soft updates have been added:mckusick2000-01-101-0/+7
| | | | | | | | | | | | | | | 1) Fastpath deletions. When a file is being deleted, check to see if it was so recently created that its inode has not yet been written to disk. If so, the delete can proceed to immediately free the inode. 2) Background writes: No file or block allocations can be done while the bitmap is being written to disk. To avoid these stalls, the bitmap is copied to another buffer which is written thus leaving the original available for futher allocations. 3) Link count tracking. Constantly track the difference in i_effnlink and i_nlink so that inodes that have had no change other than i_effnlink need not be written. 4) Identify buffers with rollback dependencies so that the buffer flushing daemon can choose to skip over them.
* Change incorrect NULLs to 0seivind1999-12-211-1/+1
|
* Preferentially allocate the first indirect block in the samemckusick1999-12-011-1/+1
| | | | | cylinder group as the inode. This makes a 15% difference in read speed for files in the 96K to 500K size range.
* $Id$ -> $FreeBSD$peter1999-08-281-1/+1
|
* Fix bug introduced in rev 1.28, which causes kernel build to break forsheldonh1999-08-241-2/+4
| | | | | | | | the case where DEBUG is defined but not DIAGNOSTIC. ffs_checkblk is declared conditionally on DIAGNOSTIC, not DEBUG. PR: 13314 Reviewed by: bde
* Use devtoname() to print dev_t's instead of casting them to long or u_longbde1999-08-231-17/+20
| | | | for misprinting in %lx format.
* Try and fix a dev_t/major/minor etc nit.peter1999-05-121-3/+3
|
* Add sufficient braces to keep egcs happy about potentially ambiguouspeter1999-05-061-2/+3
| | | | if/else nesting.
* Don't pass unused unused timestamp args to UFS_UPDATE() or wastebde1999-01-071-6/+3
| | | | | time initializing them. This almost finishes centralizing (in-core) timestamp updates in ufs_itimes().
* Ifdefed the conditionally used variable `prtrealloc'. Declare itbde1999-01-061-2/+4
| | | | | as volatile so that there is no chance that the code that it controls is optimised away.
* Restored the "reallocblks" code to its former glory. What this does isdg1998-11-131-14/+4
| | | | | | basically do a on-the-fly defragmentation of the FFS filesystem, changing file block allocations to make them contiguous. Thanks to Kirk McKusick for providing hints on what needed to be done to get this working.
* Put the zombie ffs sysctl node in "notyet" state together with its fewbde1998-09-071-1/+3
| | | | remaining children. Prepare it for MOUNT_UFS going away.
* Add a new vnode op, VOP_FREEBLKS(), which filesystems can use to informphk1998-09-051-1/+2
| | | | | | | | | | | | | device drivers about sectors no longer in use. Device-drivers receive the call through d_strategy, if they have D_CANFREE in d_flags. This allows flash based devices to erase the sectors and avoid pointlessly carrying them around in compactions. Reviewed by: Kirk Mckusick, bde Sponsored by: M-Systems (www.m-sys.com)
* Removed unused includes.bde1998-08-171-2/+3
|
* Fixed printf format errors.bde1998-07-111-19/+21
|
* Eradicate the variable "time" from the kernel, using various measures.phk1998-03-301-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | "time" wasn't a atomic variable, so splfoo() protection were needed around any access to it, unless you just wanted the seconds part. Most uses of time.tv_sec now uses the new variable time_second instead. gettime() changed to getmicrotime(0. Remove a couple of unneeded splfoo() protections, the new getmicrotime() is atomic, (until Bruce sets a breakpoint in it). A couple of places needed random data, so use read_random() instead of mucking about with time which isn't random. Add a new nfs_curusec() function. Mark a couple of bogosities involving the now disappeard time variable. Update ffs_update() to avoid the weird "== &time" checks, by fixing the one remaining call that passwd &time as args. Change profiling in ncr.c to use ticks instead of time. Resolution is the same. Add new function "tvtohz()" to avoid the bogus "splfoo(), add time, call hzto() which subtracts time" sequences. Reviewed by: bde
* Reviewed by: dyson@freebsd.org (john Dyson), dg@root.com (david greenman)julian1998-03-081-23/+70
| | | | | Submitted by: Kirk McKusick (mcKusick@mckusick.com) Obtained from: WHistle development tree
* Back out DIAGNOSTIC changes.eivind1998-02-061-2/+1
|
* Turn DIAGNOSTIC into a new-style option.eivind1998-02-041-1/+2
|
* Fix a small style bug in the generation number change (rev.1.33) beforebde1997-12-021-2/+2
| | | | copying the change to other fs's.
* Staticized.bde1997-11-221-3/+6
|
* Unremoved prtrealloc and the declaration of ffs_clusteralloc(). Thesebde1997-11-221-1/+9
| | | | | | are used in the `#ifdef notyet' case :-). This case is used except in the `#if !defined (not_yes)' case :-|. This has something to do with the `#ifdef notyet_block_reallocation_enabled' case in vfs_cluster.c :-(.
* Remove a bunch of variables which were unused both in GENERIC and LINT.phk1997-11-071-5/+1
| | | | Found by: -Wunused
* Another VFS cleanup "kilo commit"phk1997-10-161-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1. Remove VOP_UPDATE, it is (also) an UFS/{FFS,LFS,EXT2FS,MFS} intereface function, and now lives in the ufsmount structure. 2. Remove VOP_SEEK, it was unused. 3. Add mode default vops: VOP_ADVLOCK vop_einval VOP_CLOSE vop_null VOP_FSYNC vop_null VOP_IOCTL vop_enotty VOP_MMAP vop_einval VOP_OPEN vop_null VOP_PATHCONF vop_einval VOP_READLINK vop_einval VOP_REALLOCBLKS vop_eopnotsupp And remove identical functionality from filesystems 4. Add vop_stdpathconf, which returns the canonical stuff. Use it in the filesystems. (XXX: It's probably wrong that specfs and fifofs sets this vop, shouldn't it come from the "host" filesystem, for instance ufs or cd9660 ?) 5. Try to make system wide VOP functions have vop_* names. 6. Initialize the um_* vectors in LFS. (Recompile your LKMS!!!)
* VFS mega cleanup commit (x/N)phk1997-10-161-24/+18
| | | | | | | | | | | | | | | | | | | | | | | 1. Add new file "sys/kern/vfs_default.c" where default actions for VOPs go. Implement proper defaults for ABORTOP, BWRITE, LEASE, POLL, REVOKE and STRATEGY. Various stuff spread over the entire tree belongs here. 2. Change VOP_BLKATOFF to a normal function in cd9660. 3. Kill VOP_BLKATOFF, VOP_TRUNCATE, VOP_VFREE, VOP_VALLOC. These are private interface functions between UFS and the underlying storage manager layer (FFS/LFS/MFS/EXT2FS). The functions now live in struct ufsmount instead. 4. Remove a kludge of VOP_ functions in all filesystems, that did nothing but obscure the simplicity and break the expandability. If a filesystem doesn't implement VOP_FOO, it shouldn't have an entry for it in its vnops table. The system will try to DTRT if it is not implemented. There are still some cruft left, but the bulk of it is done. 5. Fix another VCALL in vfs_cache.c (thanks Bruce!)
* I think my previous change may have opened a race conditio.phk1997-10-141-4/+1
| | | | This patch does the same thing, with no change in semantics.
* ufs_ihashrem() should not be called from the UFS layer, but from thephk1997-10-141-1/+4
| | | | | | lower layer (LFS/FFS/?) like the rest of the ihash functions. Otherwise it is impossible to make a lower layer that doesn't use the ihash facility.
* [Regarding the previous patch] This is completely wrong.phk1997-09-191-3/+5
| | | | | | | | | | | | | 1. ffs_alloc() actually allowed writing one block less one frag (normally 7 frags or 7/8 blocks) beyond the limit. 2. freebufspace() gives the free space in frags, but `size' is in bytes, so the change results in approximately `size' fragments too many being reserved. 3. ffs_realloccg() has the same bug but wasn't changed. PR: 3398 Submitted by: bde Eyeballed by: phk
* Ffs_alloc allow users to write one block beyond the limit.phk1997-09-181-2/+2
| | | | | | PR: 3398 Reviewed by: phk Submitted by: Wolfram Schneider <wosch@apfel.de>
* Removed unused #includes.bde1997-09-021-4/+1
|
* We got a couple of "map mismatch" panics from the followingphk1997-08-041-2/+2
| | | | | | | | | | | | | code. According to the crash dump, bpref is set to 445 and cgp->cg_nclusterblks is 444. Hence in the for loop, the test fails immediately but the following failure check (got == cgp->cg_nclusterblks) doesn't trigger because got > cgp->cg_nclusterblks. This wreaks havoc in the code after that. Fix: Move one source bit to the left :-) Noticed by: Mike Hibler <mike@fast.cs.utah.edu> Submitted by: Kirk McKusick <mckusick@McKusick.COM>
* Add generation number randomization. Newly created filesystems wil nowguido1997-03-231-6/+3
| | | | | | | | | automatically have random generation numbers. The kenel way of handling those also changed. Further it is advised to run fsirand on all your nfs exported filesystems. the code is mostly copied from OpenBSD, with the randomization chanegd to use /dev/urandom Reviewed by: Garrett Obtained from: OpenBSD
* Fixed some invalid (non-atomic) accesses to `time', mostly ones of thebde1997-03-221-2/+2
| | | | | | form `tv = time'. Use a new function gettime(). The current version just forces atomicicity without fixing precision or efficiency bugs. Simplified some related valid accesses by using the central function.
* Update a number of panic messages to reflect the actual namempp1997-03-091-7/+7
| | | | of the routine that caused the panic.
* Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are notpeter1997-02-221-1/+1
| | | | ready for it yet.
OpenPOWER on IntegriCloud