ffmpeg-streaming - Raptor Engineering's fork of FFmpeg with streaming enhancements https://git.ffmpeg.org/ffmpeg.git

	Commit message (Collapse)	Author	Age	Files	Lines
*	avcodec/arm/hevcdsp_sao : add NEON optimization for sao	Meng Wang	2018-04-09	3	-1/+242
\| \| \| \| \| \|	Signed-off-by: Meng Wang <wangmeng.kids@bytedance.com> Reviewed-by: Shengbin Meng <shengbinmeng@gmail.com> Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
*	arm: hevcdsp: Add commas between macro arguments	Martin Storsjö	2018-03-31	1	-18/+18
\| \| \| \| \| \| \| \| \| \|	When targeting darwin, clang requires commas between arguments, while the no-comma form is allowed for other targets. Since Xcode 9.3, the bundled clang supports altmacro and doesn't require using gas-preprocessor any longer. Signed-off-by: Martin Storsjö <martin@martin.st>
*	arm: hevcdsp: Avoid using macro expansion counters	Martin Storsjö	2018-03-31	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \|	Clang supports the macro expansion counter (used for making unique labels within macro expansions), but not when targeting darwin. Convert uses of the counter into normal local labels, as used elsewhere. Since Xcode 9.3, the bundled clang supports altmacro and doesn't require using gas-preprocessor any longer. Signed-off-by: Martin Storsjö <martin@martin.st>
*	Merge commit 'ab05d3934de8e932dbd77979a687e6598e67535c'	James Almer	2018-03-30	1	-47/+47
\|\ \| \| \| \| \| \| \| \| \| \| \| \|	* commit 'ab05d3934de8e932dbd77979a687e6598e67535c': arm: vc1dsp: Add commas between macro arguments Merged-by: James Almer <jamrial@gmail.com>
\| *	arm: vc1dsp: Add commas between macro arguments	Martin Storsjö	2018-03-30	1	-47/+47
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When targeting darwin, clang requires commas between arguments, while the no-comma form is allowed for other targets. Since Xcode 9.3, the bundled clang supports altmacro and doesn't require using gas-preprocessor any longer. Signed-off-by: Martin Storsjö <martin@martin.st>
\| *	hevc: Add hevc_get_pixel_4/8/12/16/24/32/48/64	Alexandra Hájková	2017-12-08	3	-1/+464
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Checkasm timings: block size bitdepth C NEON 4 8 bit: 146.7 48.7 10 bit: 146.7 52.7 8 8 bit: 430.3 84.4 10 bit: 430.4 119.5 12 8 bit: 812.8 141.0 10 bit: 812.8 195.0 16 8 bit: 1499.1 268.0 10 bit: 1498.9 368.4 24 8 bit: 4394.2 574.8 10 bit: 3696.3 804.8 32 8 bit: 5108.6 568.9 10 bit: 4249.6 918.8 48 8 bit: 16819.6 2304.9 10 bit: 13882.0 3178.5 64 8 bit: 13490.8 1799.5 10 bit: 11018.5 2519.4 Signed-off-by: Martin Storsjö <martin@martin.st>
* \|	sbcenc: add armv6 and neon asm optimizations	Aurelien Jacobs	2018-03-07	4	-0/+1067
\| \| \| \| \| \| \| \|	This was originally based on libsbc, and was fully integrated into ffmpeg.
* \|	avcodec/arm/sbrdsp_neon: Use a free register instead of putting 2 things in one	Michael Niedermayer	2018-01-12	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fixes high pitched shriek Fixes: 25420848_1478428308873746_4255813235963330560_n.mp4 Reported-by: Dale Curtis <dalecurtis@google.com> Reviewed-by: Dale Curtis <dalecurtis@chromium.org> Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
* \|	arm/hevc_idct: fix compilation on Android	James Almer	2017-12-09	1	-59/+60
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Compilation error "out of range" fixed for armeabi-v7a. Compilation failed trying to build libvlc.aar for ARM7 android on ubuntu 16.04 host. Error messages is "Offset out of range". The reason of the error is assembler LDR directives in function "ff_hevc_transform_luma_4x4_neon_8" need local storage in range <1k, but no such storage provided. Based on a patch by Ihor Bobalo <bob@eleks.com> Suggested-by: wbs Signed-off-by: James Almer <jamrial@gmail.com>
* \|	Merge commit 'b487add7ecf78efda36d49815f8f8757bd24d4cb'	James Almer	2017-11-11	1	-4/+2
\|\ \ \| \|/ \| \| \| \| \| \| \| \| \| \|	* commit 'b487add7ecf78efda36d49815f8f8757bd24d4cb': arm: Remove a redundant check in fmtconvert_init_arm.c Merged-by: James Almer <jamrial@gmail.com>
\| *	arm: Remove a redundant check in fmtconvert_init_arm.c	Martin Storsjö	2017-10-24	1	-4/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This was missed in e2710e790c0, where have_vfp && !have_vfpv3 were converted into have_vfp_vm. Signed-off-by: Martin Storsjö <martin@martin.st>
* \|	Merge commit '9dde6ab06c48f9447cd16f39bee33569cddb7be4'	James Almer	2017-11-11	1	-8/+12
\|\ \ \| \|/ \| \| \| \| \| \| \| \| \| \|	* commit '9dde6ab06c48f9447cd16f39bee33569cddb7be4': arm: Fix SIGBUS on ARM when compiled with binutils 2.29 Merged-by: James Almer <jamrial@gmail.com>
\| *	arm: Fix SIGBUS on ARM when compiled with binutils 2.29	Martin Storsjö	2017-09-02	1	-8/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In binutils 2.29, the behavior of the ADR instruction changed so that 1 is added to the address of a Thumb function (previously nothing was added). This allows the loaded address to be passed to a BLX instruction and the correct mode change will occur. See: https://sourceware.org/bugzilla/show_bug.cgi?id=21458 By using adr with a label that isn't annotated as a thumb function, we avoid the new behaviour in binutils 2.29 and get the same behaviour as in prior releases, and as in other assemblers (ms armasm.exe, clang's built in assembler) - an idea that Janne Grunau came up with. Signed-off-by: Martin Storsjö <martin@martin.st>
* \|	Merge commit 'd7320ca3ed10f0d35b3740fa03341161e74275ea'	James Almer	2017-10-30	2	-20/+5
\|\ \ \| \|/ \| \| \| \| \| \| \| \| \| \|	* commit 'd7320ca3ed10f0d35b3740fa03341161e74275ea': arm: Avoid using .dn register aliases Merged-by: James Almer <jamrial@gmail.com>
\| *	arm: Avoid using .dn register aliases	Martin Storsjö	2017-05-15	2	-20/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	clang now (in the upcoming 5.0 version) is capable of building our arm assembly without relying on gas-preprocessor, although clang/LLVM doesn't support .dn register aliases. The VC1 MC assembly was only built and used if the chosen assembler supported the .dn directives though. This was supported as long as gas-preprocessor was used. This means that VC1 decoding got a speed regression on clang 5.0, unless the user manually chose using gas-preprocessor again. By avoiding using the .dn register aliases, we can build the VC1 MC assembly with the latest clang version. Support for the .dn/.qn directives in clang/LLVM isn't actively planned, see https://bugs.llvm.org/show_bug.cgi?id=18199. This partially reverts 896a5bff64264f4d01ed98eacc97a67260c1e17e. Signed-off-by: Martin Storsjö <martin@martin.st>
* \|	Merge commit 'ce080f47b8b55ab3d41eb00487b138d9906d114d'	James Almer	2017-10-30	2	-21/+294
\|\ \ \| \|/ \| \| \| \| \| \| \| \| \| \|	* commit 'ce080f47b8b55ab3d41eb00487b138d9906d114d': hevc: Add NEON 32x32 IDCT Merged-by: James Almer <jamrial@gmail.com>
\| *	hevc: Add NEON 32x32 IDCT	Alexandra Hájková	2017-05-04	2	-21/+294
\| \| \| \| \| \| \| \|	Signed-off-by: Martin Storsjö <martin@martin.st>
* \|	Merge commit '118dd4a321a2d67f67c21b076abd0b4d939ab642'	James Almer	2017-10-30	1	-8/+8
\|\ \ \| \|/ \| \| \| \| \| \| \| \| \| \|	* commit '118dd4a321a2d67f67c21b076abd0b4d939ab642': hevc: 16x16 NEON idct: Use the right element size for loads/stores Merged-by: James Almer <jamrial@gmail.com>
\| *	hevc: 16x16 NEON idct: Use the right element size for loads/stores	Alexandra Hájková	2017-05-04	1	-8/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This doesn't change the actual behaviour of the code but improves readability. Signed-off-by: Martin Storsjö <martin@martin.st>
* \|	Merge commit 'edbf0fffb15dde7a1de70b05855529d5fc769f14'	James Almer	2017-10-30	2	-0/+102
\|\ \ \| \|/ \| \| \| \| \| \| \| \| \| \|	* commit 'edbf0fffb15dde7a1de70b05855529d5fc769f14': hevc: Add NEON add_residual for bitdepth 10 Merged-by: James Almer <jamrial@gmail.com>
\| *	hevc: Add NEON add_residual for bitdepth 10	Alexandra Hájková	2017-05-01	2	-0/+102
\| \| \| \| \| \| \| \|	Signed-off-by: Martin Storsjö <martin@martin.st>
* \|	Merge commit 'e1c2453a4fac1f7116244d0d05310935c20887e6'	James Almer	2017-10-30	1	-16/+35
\|\ \ \| \|/ \| \| \| \| \| \| \| \| \| \|	* commit 'e1c2453a4fac1f7116244d0d05310935c20887e6': arm: hevc_idct: Tune the add_res_8x8 and add_res_32x32 functions Merged-by: James Almer <jamrial@gmail.com>
\| *	arm: hevc_idct: Tune the add_res_8x8 and add_res_32x32 functions	Martin Storsjö	2017-04-28	1	-16/+35
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Before: Cortex A7 A8 A9 A53 hevc_add_res_8x8_8_neon: 116.0 58.7 80.2 90.7 hevc_add_res_32x32_8_neon: 1230.0 737.5 1187.5 974.4 After: hevc_add_res_8x8_8_neon: 97.7 57.0 73.7 80.0 hevc_add_res_32x32_8_neon: 1216.0 698.7 1127.5 827.1 Signed-off-by: Martin Storsjö <martin@martin.st>
* \|	Merge commit '0d4d43513786f1df4d561e1fac924fb0722c6700'	James Almer	2017-10-30	2	-75/+92
\|\ \ \| \|/ \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* commit '0d4d43513786f1df4d561e1fac924fb0722c6700': hevc: Add NEON add_residual for bitdepth 8 See 03cecf45c134ebbaecb62505fe444ade423ea7dc Merged-by: James Almer <jamrial@gmail.com>
\| *	hevc: Add NEON add_residual for bitdepth 8	Seppo Tomperi	2017-04-27	2	-0/+103
\| \| \| \| \| \| \| \| \| \| \| \|	Optimized by Alexandra Hájková. Signed-off-by: Martin Storsjö <martin@martin.st>
* \|	Merge commit '3d69dd65c6771c28d3bf4e8e53a905aa8cd01fd9'	James Almer	2017-10-30	2	-12/+37
\|\ \ \| \|/ \| \| \| \| \| \| \| \| \| \|	* commit '3d69dd65c6771c28d3bf4e8e53a905aa8cd01fd9': hevc: Add support for bitdepth 10 for IDCT DC Merged-by: James Almer <jamrial@gmail.com>
\| *	hevc: Add support for bitdepth 10 for IDCT DC	Alexandra Hájková	2017-04-25	2	-12/+37
\| \| \| \| \| \| \| \|	Signed-off-by: Martin Storsjö <martin@martin.st>
* \|	Merge commit '358adef0305618219522858e471edf7e0cb4043e'	James Almer	2017-10-30	2	-84/+84
\|\ \ \| \|/ \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* commit '358adef0305618219522858e471edf7e0cb4043e': hevc: Add NEON IDCT DC functions for bitdepth 8 See 03cecf45c134ebbaecb62505fe444ade423ea7dc Merged-by: James Almer <jamrial@gmail.com>
\| *	hevc: Add NEON IDCT DC functions for bitdepth 8	Seppo Tomperi	2017-04-25	2	-0/+88
\| \| \| \| \| \| \| \| \| \|	Signed-off-by: Alexandra Hájková <alexandra@khirnov.net> Signed-off-by: Martin Storsjö <martin@martin.st>
* \|	Merge commit '89d9869d2491d4209d707a8e7f29c58227ae5a4e'	James Almer	2017-10-27	2	-0/+201
\|\ \ \| \|/ \| \| \| \| \| \| \| \| \| \|	* commit '89d9869d2491d4209d707a8e7f29c58227ae5a4e': hevc: Add NEON 16x16 IDCT Merged-by: James Almer <jamrial@gmail.com>
\| *	hevc: Add NEON 16x16 IDCT	Alexandra Hájková	2017-04-12	2	-0/+201
\| \| \| \| \| \| \| \| \| \| \| \|	The speedup vs C code is around 6-13x. Signed-off-by: Martin Storsjö <martin@martin.st>
\| *	arm: Always build the hevcdsp_init_arm.c file	Martin Storsjö	2017-03-28	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The main hevcdsp.c file calls this init function if HAVE_ARM is set, regardless of whether neon support is available or not. This fixes builds where neon isn't supported by the build tools at all. Signed-off-by: Martin Storsjö <martin@martin.st>
* \|	Merge commit '0b9a237b2386ff84a6f99716bd58fa27a1b767e7'	James Almer	2017-10-24	4	-238/+217
\|\ \ \| \|/ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* commit '0b9a237b2386ff84a6f99716bd58fa27a1b767e7': hevc: Add NEON 4x4 and 8x8 IDCT [15:12:59] <@ubitux> hevc_idct_4x4_8_c: 389.1 [15:13:00] <@ubitux> hevc_idct_4x4_8_neon: 126.6 [15:13:02] <@ubitux> our ^ [15:13:06] <@ubitux> hevc_idct_4x4_8_c: 389.3 [15:13:08] <@ubitux> hevc_idct_4x4_8_neon: 107.8 [15:13:10] <@ubitux> hevc_idct_4x4_10_c: 418.6 [15:13:12] <@ubitux> hevc_idct_4x4_10_neon: 108.1 [15:13:14] <@ubitux> libav ^ [15:13:30] <@ubitux> so yeah, we can probably trash our versions here Merged-by: James Almer <jamrial@gmail.com>
\| *	hevc: Add NEON 4x4 and 8x8 IDCT	Alexandra Hájková	2017-03-27	3	-0/+277
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Optimized by Martin Storsjö <martin@martin.st>. The speedup vs C code is around 3.2-4.4x. Signed-off-by: Martin Storsjö <martin@martin.st>
\| *	arm/aarch64: vp9: Fix vertical alignment	Martin Storsjö	2017-03-16	2	-8/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Align the second/third operands as they usually are. Due to the wildly varying sizes of the written out operands in aarch64 assembly, the column alignment is usually not as clear as in arm assembly. Signed-off-by: Martin Storsjö <martin@martin.st>
\| *	arm/aarch64: vp9itxfm: Skip loading the min_eob pointer when it won't be used	Martin Storsjö	2017-03-11	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	In the half/quarter cases where we don't use the min_eob array, defer loading the pointer until we know it will be needed. Signed-off-by: Martin Storsjö <martin@martin.st>
\| *	arm: vp9itxfm: Template the quarter/half idct32 function	Martin Storsjö	2017-03-11	1	-37/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This reduces the number of lines and reduces the duplication. Also simplify the eob check for the half case. If we are in the half case, we know we at least will need to do the first three slices, we only need to check eob for the fourth one, so we can hardcode the value to check against instead of loading from the min_eob array. Since at most one slice can be skipped in the first pass, we can unroll the loop for filling zeros completely, as it was done for the quarter case before. This allows skipping loading the min_eob pointer when using the quarter/half cases. Signed-off-by: Martin Storsjö <martin@martin.st>
\| *	arm: vp9itxfm: Reorder iadst16 coeffs	Martin Storsjö	2017-02-24	1	-6/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This matches the order they are in the 16 bpp version. There they are in this order, to make sure we access them in the same order they are declared, easing loading only half of the coefficients at a time. This makes the 8 bpp version match the 16 bpp version better. Signed-off-by: Martin Storsjö <martin@martin.st>
\| *	arm: vp9itxfm: Reorder the idct coefficients for better pairing	Martin Storsjö	2017-02-24	1	-62/+62
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	All elements are used pairwise, except for the first one. Previously, the 16th element was unused. Move the unused element to the second slot, to make the later element pairs not split across registers. This simplifies loading only parts of the coefficients, reducing the difference to the 16 bpp version. Signed-off-by: Martin Storsjö <martin@martin.st>
\| *	arm: vp9itxfm: Avoid reloading the idct32 coefficients	Martin Storsjö	2017-02-24	1	-126/+120
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The idct32x32 function actually pushed q4-q7 onto the stack even though it didn't clobber them; there are plenty of registers that can be used to allow keeping all the idct coefficients in registers without having to reload different subsets of them at different stages in the transform. Since the idct16 core transform avoids clobbering q4-q7 (but clobbers q2-q3 instead, to avoid needing to back up and restore q4-q7 at all in the idct16 function), and the lanewise vmul needs a register in the q0-q3 range, we move the stored coefficients from q2-q3 into q4-q5 while doing idct16. While keeping these coefficients in registers, we still can skip pushing q7. Before: Cortex A7 A8 A9 A53 vp9_inv_dct_dct_32x32_sub32_add_neon: 18553.8 17182.7 14303.3 12089.7 After: vp9_inv_dct_dct_32x32_sub32_add_neon: 18470.3 16717.7 14173.6 11860.8 Signed-off-by: Martin Storsjö <martin@martin.st>
\| *	arm: vp9lpf: Implement the mix2_44 function with one single filter pass	Martin Storsjö	2017-02-24	2	-3/+195
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For this case, with 8 inputs but only changing 4 of them, we can fit all 16 input pixels into a q register, and still have enough temporary registers for doing the loop filter. The wd=8 filters would require too many temporary registers for processing all 16 pixels at once though. Before: Cortex A7 A8 A9 A53 vp9_loop_filter_mix2_v_44_16_neon: 289.7 256.2 237.5 181.2 After: vp9_loop_filter_mix2_v_44_16_neon: 221.2 150.5 177.7 138.0 Signed-off-by: Martin Storsjö <martin@martin.st>
\| *	arm/aarch64: vp9lpf: Keep the comparison to E within 8 bit	Martin Storsjö	2017-02-24	1	-6/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The theoretical maximum value of E is 193, so we can just saturate the addition to 255. Before: Cortex A7 A8 A9 A53 A53/AArch64 vp9_loop_filter_v_4_8_neon: 143.0 127.7 114.8 88.0 87.7 vp9_loop_filter_v_8_8_neon: 241.0 197.2 173.7 140.0 136.7 vp9_loop_filter_v_16_8_neon: 497.0 419.5 379.7 293.0 275.7 vp9_loop_filter_v_16_16_neon: 965.2 818.7 731.4 579.0 452.0 After: vp9_loop_filter_v_4_8_neon: 136.0 125.7 112.6 84.0 83.0 vp9_loop_filter_v_8_8_neon: 234.0 195.5 171.5 136.0 133.7 vp9_loop_filter_v_16_8_neon: 490.0 417.5 377.7 289.0 271.0 vp9_loop_filter_v_16_16_neon: 951.2 814.7 732.3 571.0 446.7 Signed-off-by: Martin Storsjö <martin@martin.st>
\| *	arm: vp9lpf: Interleave the start of flat8in into the calculation above	Martin Storsjö	2017-02-11	1	-2/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This adds lots of extra .ifs, but speeds it up by a couple cycles, by avoiding stalls. Signed-off-by: Martin Storsjö <martin@martin.st>
\| *	arm: vp9lpf: Use orrs instead of orr+cmp	Martin Storsjö	2017-02-11	1	-8/+4
\| \| \| \| \| \| \| \|	Signed-off-by: Martin Storsjö <martin@martin.st>
\| *	arm/aarch64: vp9lpf: Calculate !hev directly	Martin Storsjö	2017-02-11	1	-3/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Previously we first calculated hev, and then negated it. Since we were able to schedule the negation in the middle of another calculation, we don't see any gain in all cases. Before: Cortex A7 A8 A9 A53 A53/AArch64 vp9_loop_filter_v_4_8_neon: 147.0 129.0 115.8 89.0 88.7 vp9_loop_filter_v_8_8_neon: 242.0 198.5 174.7 140.0 136.7 vp9_loop_filter_v_16_8_neon: 500.0 419.5 382.7 293.0 275.7 vp9_loop_filter_v_16_16_neon: 971.2 825.5 731.5 579.0 453.0 After: vp9_loop_filter_v_4_8_neon: 143.0 127.7 114.8 88.0 87.7 vp9_loop_filter_v_8_8_neon: 241.0 197.2 173.7 140.0 136.7 vp9_loop_filter_v_16_8_neon: 497.0 419.5 379.7 293.0 275.7 vp9_loop_filter_v_16_16_neon: 965.2 818.7 731.4 579.0 452.0 Signed-off-by: Martin Storsjö <martin@martin.st>
\| *	arm: vp9itxfm: Optimize 16x16 and 32x32 idct dc by unrolling	Martin Storsjö	2017-02-11	1	-18/+36
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This work is sponsored by, and copyright, Google. Before: Cortex A7 A8 A9 A53 vp9_inv_dct_dct_16x16_sub1_add_neon: 273.0 189.5 211.7 235.8 vp9_inv_dct_dct_32x32_sub1_add_neon: 752.0 459.2 862.2 553.9 After: vp9_inv_dct_dct_16x16_sub1_add_neon: 226.5 145.0 225.1 171.8 vp9_inv_dct_dct_32x32_sub1_add_neon: 721.2 415.7 727.6 475.0 Signed-off-by: Martin Storsjö <martin@martin.st>
\| *	arm: vp9mc: Calculate less unused data in the 4 pixel wide horizontal filter	Martin Storsjö	2017-02-11	1	-11/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Before: Cortex A7 A8 A9 A53 vp9_put_8tap_smooth_4h_neon: 378.1 273.2 340.7 229.5 After: vp9_put_8tap_smooth_4h_neon: 352.1 222.2 290.5 229.5 Signed-off-by: Martin Storsjö <martin@martin.st>
\| *	arm: vp9itxfm: Share instructions for loading idct coeffs in the 8x8 function	Martin Storsjö	2017-02-09	1	-2/+1
\| \| \| \| \| \| \| \|	Signed-off-by: Martin Storsjö <martin@martin.st>
\| *	arm: vp9itxfm: Do a simpler half/quarter idct16/idct32 when possible	Martin Storsjö	2017-02-09	1	-54/+537
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This work is sponsored by, and copyright, Google. This avoids loading and calculating coefficients that we know will be zero, and avoids filling the temp buffer with zeros in places where we know the second pass won't read. This gives a pretty substantial speedup for the smaller subpartitions. The code size increases from 12388 bytes to 19784 bytes. The idct16/32_end macros are moved above the individual functions; the instructions themselves are unchanged, but since new functions are added at the same place where the code is moved from, the diff looks rather messy. Before: Cortex A7 A8 A9 A53 vp9_inv_dct_dct_16x16_sub1_add_neon: 273.0 189.5 212.0 235.8 vp9_inv_dct_dct_16x16_sub2_add_neon: 2102.1 1521.7 1736.2 1265.8 vp9_inv_dct_dct_16x16_sub4_add_neon: 2104.5 1533.0 1736.6 1265.5 vp9_inv_dct_dct_16x16_sub8_add_neon: 2484.8 1828.7 2014.4 1506.5 vp9_inv_dct_dct_16x16_sub12_add_neon: 2851.2 2117.8 2294.8 1753.2 vp9_inv_dct_dct_16x16_sub16_add_neon: 3239.4 2408.3 2543.5 1994.9 vp9_inv_dct_dct_32x32_sub1_add_neon: 758.3 456.7 864.5 553.9 vp9_inv_dct_dct_32x32_sub2_add_neon: 10776.7 7949.8 8567.7 6819.7 vp9_inv_dct_dct_32x32_sub4_add_neon: 10865.6 8131.5 8589.6 6816.3 vp9_inv_dct_dct_32x32_sub8_add_neon: 12053.9 9271.3 9387.7 7564.0 vp9_inv_dct_dct_32x32_sub12_add_neon: 13328.3 10463.2 10217.0 8321.3 vp9_inv_dct_dct_32x32_sub16_add_neon: 14176.4 11509.5 11018.7 9062.3 vp9_inv_dct_dct_32x32_sub20_add_neon: 15301.5 12999.9 11855.1 9828.2 vp9_inv_dct_dct_32x32_sub24_add_neon: 16482.7 14931.5 12650.1 10575.0 vp9_inv_dct_dct_32x32_sub28_add_neon: 17589.5 15811.9 13482.8 11333.4 vp9_inv_dct_dct_32x32_sub32_add_neon: 18696.2 17049.2 14355.6 12089.7 After: vp9_inv_dct_dct_16x16_sub1_add_neon: 273.0 189.5 211.7 235.8 vp9_inv_dct_dct_16x16_sub2_add_neon: 1203.5 998.2 1035.3 763.0 vp9_inv_dct_dct_16x16_sub4_add_neon: 1203.5 998.1 1035.5 760.8 vp9_inv_dct_dct_16x16_sub8_add_neon: 1926.1 1610.6 1722.1 1271.7 vp9_inv_dct_dct_16x16_sub12_add_neon: 2873.2 2129.7 2285.1 1757.3 vp9_inv_dct_dct_16x16_sub16_add_neon: 3221.4 2520.3 2557.6 2002.1 vp9_inv_dct_dct_32x32_sub1_add_neon: 753.0 457.5 866.6 554.6 vp9_inv_dct_dct_32x32_sub2_add_neon: 7554.6 5652.4 6048.4 4920.2 vp9_inv_dct_dct_32x32_sub4_add_neon: 7549.9 5685.0 6046.9 4925.7 vp9_inv_dct_dct_32x32_sub8_add_neon: 8336.9 6704.5 6604.0 5478.0 vp9_inv_dct_dct_32x32_sub12_add_neon: 10914.0 9777.2 9240.4 7416.9 vp9_inv_dct_dct_32x32_sub16_add_neon: 11859.2 11223.3 9966.3 8095.1 vp9_inv_dct_dct_32x32_sub20_add_neon: 15237.1 13029.4 11838.3 9829.4 vp9_inv_dct_dct_32x32_sub24_add_neon: 16293.2 14379.8 12644.9 10572.0 vp9_inv_dct_dct_32x32_sub28_add_neon: 17424.3 15734.7 13473.0 11326.9 vp9_inv_dct_dct_32x32_sub32_add_neon: 18531.3 17457.0 14298.6 12080.0 Signed-off-by: Martin Storsjö <martin@martin.st>
\| *	arm: vp9itxfm: Move the load_add_store macro out from the itxfm16 pass2 function	Martin Storsjö	2017-02-09	1	-36/+36
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This allows reusing the macro for a separate implementation of the pass2 function. Signed-off-by: Martin Storsjö <martin@martin.st>