summaryrefslogtreecommitdiffstats
path: root/libavcodec/aarch64
Commit message (Collapse)AuthorAgeFilesLines
...
| * mpegaudiodsp: aarch64: Adjust function prototype after ↵Diego Biurrun2016-11-101-2/+3
| | | | | | | | 2caa93b813adc5dbb7771dfe615da826a2947d18
* | aarch64/vp9dsp: add missing header includesJames Almer2017-03-282-0/+2
| |
* | vp9: re-split the decoder/format/dsp interface header files.Ronald S. Bultje2017-03-282-2/+2
| | | | | | | | | | The advantage here is that the internal software decoder interface is not exposed to the DSP functions or the hardware accelerations.
* | lavc/vp9: split into vp9{block,data,mvs}Clément Bœsch2017-03-272-2/+2
| | | | | | | | This is following Libav layout to ease merges.
* | Merge commit '9b2ccafb480c94fd09cfb24306d5296dc013cf5b'Clément Bœsch2017-03-231-0/+1
|\ \ | |/ | | | | | | | | | | * commit '9b2ccafb480c94fd09cfb24306d5296dc013cf5b': aarch64: Add missing sign extension in ff_h264_idct8_add_neon Merged-by: Clément Bœsch <u@pkh.me>
| * aarch64: Add missing sign extension in ff_h264_idct8_add_neonMartin Storsjö2016-10-101-0/+1
| | | | | | | | Signed-off-by: Martin Storsjö <martin@martin.st>
* | Merge commit '2caa93b813adc5dbb7771dfe615da826a2947d18'James Almer2017-03-212-3/+2
|\ \ | |/ | | | | | | | | | | * commit '2caa93b813adc5dbb7771dfe615da826a2947d18': mpegaudiodsp: Change type of array stride parameters to ptrdiff_t Merged-by: James Almer <jamrial@gmail.com>
| * mpegaudiodsp: Change type of array stride parameters to ptrdiff_tDiego Biurrun2016-09-291-1/+0
| | | | | | | | | | This avoids SIMD-optimized functions having to sign-extend their stride argument manually to be able to do pointer arithmetic.
* | Merge commit 'e4a94d8b36c48d95a7d412c40d7b558422ff659c'James Almer2017-03-214-27/+24
|\ \ | |/ | | | | | | | | | | * commit 'e4a94d8b36c48d95a7d412c40d7b558422ff659c': h264chroma: Change type of stride parameters to ptrdiff_t Merged-by: James Almer <jamrial@gmail.com>
| * h264chroma: Change type of stride parameters to ptrdiff_tDiego Biurrun2016-09-294-27/+24
| | | | | | | | | | This avoids SIMD-optimized functions having to sign-extend their stride argument manually to be able to do pointer arithmetic.
* | Merge commit '2ec9fa5ec60dcd10e1cb10d8b4e4437e634ea428'James Almer2017-03-211-2/+2
|\ \ | |/ | | | | | | | | | | * commit '2ec9fa5ec60dcd10e1cb10d8b4e4437e634ea428': idct: Change type of array stride parameters to ptrdiff_t Merged-by: James Almer <jamrial@gmail.com>
* | Merge commit 'de2ae3c1fae5a2eb539b9abd7bc2a9ca8c286ff0'Clément Bœsch2017-03-211-4/+4
|\ \ | |/ | | | | | | | | | | | | | | * commit 'de2ae3c1fae5a2eb539b9abd7bc2a9ca8c286ff0': lavc: add clobber tests for the new encoding/decoding API The merge only re-order what we already have. Merged-by: Clément Bœsch <u@pkh.me>
| * lavc: add clobber tests for the new encoding/decoding APIAnton Khirnov2016-09-281-0/+20
| |
| * libavcodec: fix constness in clobber test avcodec_open2() wrappersClément Bœsch2016-06-261-1/+1
| | | | | | | | Signed-off-by: Martin Storsjö <martin@martin.st>
* | aarch64: vp9itxfm16: Do a simpler half/quarter idct16/idct32 when possibleMartin Storsjö2017-03-191-58/+547
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This work is sponsored by, and copyright, Google. This avoids loading and calculating coefficients that we know will be zero, and avoids filling the temp buffer with zeros in places where we know the second pass won't read. This gives a pretty substantial speedup for the smaller subpartitions. The code size increases from 21512 bytes to 31400 bytes. The idct16/32_end macros are moved above the individual functions; the instructions themselves are unchanged, but since new functions are added at the same place where the code is moved from, the diff looks rather messy. Before: vp9_inv_dct_dct_16x16_sub1_add_10_neon: 284.6 vp9_inv_dct_dct_16x16_sub2_add_10_neon: 1902.7 vp9_inv_dct_dct_16x16_sub4_add_10_neon: 1903.0 vp9_inv_dct_dct_16x16_sub8_add_10_neon: 2201.1 vp9_inv_dct_dct_16x16_sub12_add_10_neon: 2510.0 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 2821.3 vp9_inv_dct_dct_32x32_sub1_add_10_neon: 1011.6 vp9_inv_dct_dct_32x32_sub2_add_10_neon: 9716.5 vp9_inv_dct_dct_32x32_sub4_add_10_neon: 9704.9 vp9_inv_dct_dct_32x32_sub8_add_10_neon: 10641.7 vp9_inv_dct_dct_32x32_sub12_add_10_neon: 11555.7 vp9_inv_dct_dct_32x32_sub16_add_10_neon: 12499.8 vp9_inv_dct_dct_32x32_sub20_add_10_neon: 13403.7 vp9_inv_dct_dct_32x32_sub24_add_10_neon: 14335.8 vp9_inv_dct_dct_32x32_sub28_add_10_neon: 15253.6 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 16179.5 After: vp9_inv_dct_dct_16x16_sub1_add_10_neon: 282.8 vp9_inv_dct_dct_16x16_sub2_add_10_neon: 1142.4 vp9_inv_dct_dct_16x16_sub4_add_10_neon: 1139.0 vp9_inv_dct_dct_16x16_sub8_add_10_neon: 1772.9 vp9_inv_dct_dct_16x16_sub12_add_10_neon: 2515.2 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 2823.5 vp9_inv_dct_dct_32x32_sub1_add_10_neon: 1012.7 vp9_inv_dct_dct_32x32_sub2_add_10_neon: 6944.4 vp9_inv_dct_dct_32x32_sub4_add_10_neon: 6944.2 vp9_inv_dct_dct_32x32_sub8_add_10_neon: 7609.8 vp9_inv_dct_dct_32x32_sub12_add_10_neon: 9953.4 vp9_inv_dct_dct_32x32_sub16_add_10_neon: 10770.1 vp9_inv_dct_dct_32x32_sub20_add_10_neon: 13418.8 vp9_inv_dct_dct_32x32_sub24_add_10_neon: 14330.7 vp9_inv_dct_dct_32x32_sub28_add_10_neon: 15257.1 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 16190.6 Signed-off-by: Martin Storsjö <martin@martin.st>
* | aarch64: vp9itxfm16: Move the load_add_store macro out from the itxfm16 ↵Martin Storsjö2017-03-191-49/+49
| | | | | | | | | | | | | | | | | | pass2 function This allows reusing the macro for a separate implementation of the pass2 function. Signed-off-by: Martin Storsjö <martin@martin.st>
* | aarch64: vp9itxfm16: Make the larger core transforms standalone functionsMartin Storsjö2017-03-191-17/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This work is sponsored by, and copyright, Google. This reduces the code size of libavcodec/aarch64/vp9itxfm_16bpp_neon.o from 26288 to 21512 bytes. This gives a small slowdown of a couple of tens of cycles, but makes it more feasible to add more optimized versions of these transforms. Before: vp9_inv_dct_dct_16x16_sub4_add_10_neon: 1887.4 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 2801.5 vp9_inv_dct_dct_32x32_sub4_add_10_neon: 9691.4 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 16154.9 After: vp9_inv_dct_dct_16x16_sub4_add_10_neon: 1899.5 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 2827.2 vp9_inv_dct_dct_32x32_sub4_add_10_neon: 9714.7 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 16175.9 Signed-off-by: Martin Storsjö <martin@martin.st>
* | aarch64: vp9itxfm16: Restructure the idct32 store macrosMartin Storsjö2017-03-191-45/+45
| | | | | | | | | | | | | | This avoids concatenation, which can't be used if the whole macro is wrapped within another macro. Signed-off-by: Martin Storsjö <martin@martin.st>
* | aarch64: vp9itxfm16: Avoid .irp when it doesn't save any linesMartin Storsjö2017-03-191-12/+12
| | | | | | | | | | | | This makes the code a bit more readable. Signed-off-by: Martin Storsjö <martin@martin.st>
* | aarch64: vp9itxfm16: Fix a typo in a commentMartin Storsjö2017-03-191-1/+1
| | | | | | | | Signed-off-by: Martin Storsjö <martin@martin.st>
* | arm/aarch64: vp9: Fix vertical alignmentMartin Storsjö2017-03-191-18/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | Align the second/third operands as they usually are. Due to the wildly varying sizes of the written out operands in aarch64 assembly, the column alignment is usually not as clear as in arm assembly. This is cherrypicked from libav commit 7995ebfad12002033c73feed422a1cfc62081e8f. Signed-off-by: Martin Storsjö <martin@martin.st>
* | arm/aarch64: vp9itxfm: Skip loading the min_eob pointer when it won't be usedMartin Storsjö2017-03-191-1/+2
| | | | | | | | | | | | | | | | | | | | In the half/quarter cases where we don't use the min_eob array, defer loading the pointer until we know it will be needed. This is cherrypicked from libav commit 3a0d5e206d24d41d87a25ba16a79b2ea04c39d4c. Signed-off-by: Martin Storsjö <martin@martin.st>
* | lavc/aarch64: add ff_simple_idct{,_add,_put}_neon functionsMatthieu Bouron2017-03-164-0/+433
| |
* | aarch64: vp9itxfm: Reorder iadst16 coeffsMartin Storsjö2017-03-111-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This matches the order they are in the 16 bpp version. There they are in this order, to make sure we access them in the same order they are declared, easing loading only half of the coefficients at a time. This makes the 8 bpp version match the 16 bpp version better. This is cherrypicked from libav commit b8f66c0838b4c645227f23a35b4d54373da4c60a. Signed-off-by: Martin Storsjö <martin@martin.st>
* | aarch64: vp9itxfm: Reorder the idct coefficients for better pairingMartin Storsjö2017-03-111-62/+62
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All elements are used pairwise, except for the first one. Previously, the 16th element was unused. Move the unused element to the second slot, to make the later element pairs not split across registers. This simplifies loading only parts of the coefficients, reducing the difference to the 16 bpp version. This is cherrypicked from libav commit 09eb88a12e008d10a3f7a6be75d18ad98b368e68. Signed-off-by: Martin Storsjö <martin@martin.st>
* | aarch64: vp9itxfm: Avoid reloading the idct32 coefficientsMartin Storsjö2017-03-111-67/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The idct32x32 function actually pushed d8-d15 onto the stack even though it didn't clobber them; there are plenty of registers that can be used to allow keeping all the idct coefficients in registers without having to reload different subsets of them at different stages in the transform. After this, we still can skip pushing d12-d15. Before: vp9_inv_dct_dct_32x32_sub32_add_neon: 8128.3 After: vp9_inv_dct_dct_32x32_sub32_add_neon: 8053.3 This is cherrypicked from libav commit 65aa002d54433154a6924dc13e498bec98451ad0. Signed-off-by: Martin Storsjö <martin@martin.st>
* | aarch64: vp9lpf: Use dup+rev16+uzp1 instead of dup+lsr+dup+trn1Martin Storsjö2017-03-111-12/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This is one cycle faster in total, and three instructions fewer. Before: vp9_loop_filter_mix2_v_44_16_neon: 123.2 After: vp9_loop_filter_mix2_v_44_16_neon: 122.2 This is cherrypicked from libav commit 3bf9c48320f25f3d5557485b0202f22ae60748b0. Signed-off-by: Martin Storsjö <martin@martin.st>
* | arm/aarch64: vp9lpf: Keep the comparison to E within 8 bitMartin Storsjö2017-03-111-31/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The theoretical maximum value of E is 193, so we can just saturate the addition to 255. Before: Cortex A7 A8 A9 A53 A53/AArch64 vp9_loop_filter_v_4_8_neon: 143.0 127.7 114.8 88.0 87.7 vp9_loop_filter_v_8_8_neon: 241.0 197.2 173.7 140.0 136.7 vp9_loop_filter_v_16_8_neon: 497.0 419.5 379.7 293.0 275.7 vp9_loop_filter_v_16_16_neon: 965.2 818.7 731.4 579.0 452.0 After: vp9_loop_filter_v_4_8_neon: 136.0 125.7 112.6 84.0 83.0 vp9_loop_filter_v_8_8_neon: 234.0 195.5 171.5 136.0 133.7 vp9_loop_filter_v_16_8_neon: 490.0 417.5 377.7 289.0 271.0 vp9_loop_filter_v_16_16_neon: 951.2 814.7 732.3 571.0 446.7 This is cherrypicked from libav commit c582cb8537367721bb399a5d01b652c20142b756. Signed-off-by: Martin Storsjö <martin@martin.st>
* | aarch64: vp9lpf: Fix broken indentation/vertical alignmentMartin Storsjö2017-03-111-2/+2
| | | | | | | | | | | | | | This is cherrypicked from libav commit 07b5136c481d394992c7e951967df0cfbb346c0b. Signed-off-by: Martin Storsjö <martin@martin.st>
* | aarch64: vp9lpf: Interleave the start of flat8in into the calculation aboveMartin Storsjö2017-03-111-3/+11
| | | | | | | | | | | | | | | | | | | | This adds lots of extra .ifs, but speeds it up by a couple cycles, by avoiding stalls. This is cherrypicked from libav commit b0806088d3b27044145b20421da8d39089ae0c6a. Signed-off-by: Martin Storsjö <martin@martin.st>
* | arm/aarch64: vp9lpf: Calculate !hev directlyMartin Storsjö2017-03-111-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously we first calculated hev, and then negated it. Since we were able to schedule the negation in the middle of another calculation, we don't see any gain in all cases. Before: Cortex A7 A8 A9 A53 A53/AArch64 vp9_loop_filter_v_4_8_neon: 147.0 129.0 115.8 89.0 88.7 vp9_loop_filter_v_8_8_neon: 242.0 198.5 174.7 140.0 136.7 vp9_loop_filter_v_16_8_neon: 500.0 419.5 382.7 293.0 275.7 vp9_loop_filter_v_16_16_neon: 971.2 825.5 731.5 579.0 453.0 After: vp9_loop_filter_v_4_8_neon: 143.0 127.7 114.8 88.0 87.7 vp9_loop_filter_v_8_8_neon: 241.0 197.2 173.7 140.0 136.7 vp9_loop_filter_v_16_8_neon: 497.0 419.5 379.7 293.0 275.7 vp9_loop_filter_v_16_16_neon: 965.2 818.7 731.4 579.0 452.0 This is cherrypicked from libav commit e1f9de86f454861b69b199ad801adc2ec6c3b220. Signed-off-by: Martin Storsjö <martin@martin.st>
* | aarch64: vp9itxfm: Optimize 16x16 and 32x32 idct dc by unrollingMartin Storsjö2017-03-111-18/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This work is sponsored by, and copyright, Google. Before: Cortex A53 vp9_inv_dct_dct_16x16_sub1_add_neon: 235.3 vp9_inv_dct_dct_32x32_sub1_add_neon: 555.1 After: vp9_inv_dct_dct_16x16_sub1_add_neon: 180.2 vp9_inv_dct_dct_32x32_sub1_add_neon: 475.3 This is cherrypicked from libav commit 3fcf788fbbccc4130868e7abe58a88990290f7c1. Signed-off-by: Martin Storsjö <martin@martin.st>
* | aarch64: vp9mc: Calculate less unused data in the 4 pixel wide horizontal filterMartin Storsjö2017-03-111-2/+13
| | | | | | | | | | | | | | | | | | No measured speedup on a Cortex A53, but other cores might benefit. This is cherrypicked from libav commit 388e0d2515bc6bbc9d0c9af1d230bd16cf945fe7. Signed-off-by: Martin Storsjö <martin@martin.st>
* | aarch64: vp9mc: Simplify the extmla macro parametersMartin Storsjö2017-03-111-25/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fold the field lengths into the macro. This makes the macro invocations much more readable, when the lines are shorter. This also makes it easier to use only half the registers within the macro. This is cherrypicked from libav commit 5e0c2158fbc774f87d3ce4b7b950ba4d42c4a7b8. Signed-off-by: Martin Storsjö <martin@martin.st>
* | aarch64: vp9itxfm: Fix incorrect vertical alignmentMartin Storsjö2017-03-111-3/+3
| | | | | | | | | | | | | | This is cherrypicked from libav commit 0c0b87f12d48d4e7f0d3d13f9345e828a3a5ea32. Signed-off-by: Martin Storsjö <martin@martin.st>
* | aarch64: vp9itxfm: Update a comment to refer to a register with a different nameMartin Storsjö2017-03-111-2/+2
| | | | | | | | | | | | | | This is cherrypicked from libav commit 8476eb0d3ab1f7a52317b23346646389c08fb57a. Signed-off-by: Martin Storsjö <martin@martin.st>
* | aarch64: vp9itxfm: Use the right lane sizes in 8x8 for improved readabilityMartin Storsjö2017-03-111-8/+8
| | | | | | | | | | | | | | This is cherrypicked from libav commit 3dd7827258ddaa2e51085d0c677d6f3b1be3572f. Signed-off-by: Martin Storsjö <martin@martin.st>
* | aarch64: vp9itxfm: Use a single lane ld1 instead of ld1r where possibleMartin Storsjö2017-03-111-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | The ld1r is a leftover from the arm version, where this trick is beneficial on some cores. Use a single-lane load where we don't need the semantics of ld1r. This is cherrypicked from libav commit ed8d293306e12c9b79022d37d39f48825ce7f2fa. Signed-off-by: Martin Storsjö <martin@martin.st>
* | aarch64: vp9itxfm: Share instructions for loading idct coeffs in the 8x8 ↵Martin Storsjö2017-03-111-2/+1
| | | | | | | | | | | | | | | | | | function This is cherrypicked from libav commit 4da4b2b87f08a1331650c7e36eb7d4029a160776. Signed-off-by: Martin Storsjö <martin@martin.st>
* | aarch64: vp9itxfm: Do separate functions for half/quarter idct16 and idct32Martin Storsjö2017-03-111-59/+466
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This work is sponsored by, and copyright, Google. This avoids loading and calculating coefficients that we know will be zero, and avoids filling the temp buffer with zeros in places where we know the second pass won't read. This gives a pretty substantial speedup for the smaller subpartitions. The code size increases from 14740 bytes to 24292 bytes. The idct16/32_end macros are moved above the individual functions; the instructions themselves are unchanged, but since new functions are added at the same place where the code is moved from, the diff looks rather messy. Before: vp9_inv_dct_dct_16x16_sub1_add_neon: 236.7 vp9_inv_dct_dct_16x16_sub2_add_neon: 1051.0 vp9_inv_dct_dct_16x16_sub4_add_neon: 1051.0 vp9_inv_dct_dct_16x16_sub8_add_neon: 1051.0 vp9_inv_dct_dct_16x16_sub12_add_neon: 1387.4 vp9_inv_dct_dct_16x16_sub16_add_neon: 1387.6 vp9_inv_dct_dct_32x32_sub1_add_neon: 554.1 vp9_inv_dct_dct_32x32_sub2_add_neon: 5198.5 vp9_inv_dct_dct_32x32_sub4_add_neon: 5198.6 vp9_inv_dct_dct_32x32_sub8_add_neon: 5196.3 vp9_inv_dct_dct_32x32_sub12_add_neon: 6183.4 vp9_inv_dct_dct_32x32_sub16_add_neon: 6174.3 vp9_inv_dct_dct_32x32_sub20_add_neon: 7151.4 vp9_inv_dct_dct_32x32_sub24_add_neon: 7145.3 vp9_inv_dct_dct_32x32_sub28_add_neon: 8119.3 vp9_inv_dct_dct_32x32_sub32_add_neon: 8118.7 After: vp9_inv_dct_dct_16x16_sub1_add_neon: 236.7 vp9_inv_dct_dct_16x16_sub2_add_neon: 640.8 vp9_inv_dct_dct_16x16_sub4_add_neon: 639.0 vp9_inv_dct_dct_16x16_sub8_add_neon: 842.0 vp9_inv_dct_dct_16x16_sub12_add_neon: 1388.3 vp9_inv_dct_dct_16x16_sub16_add_neon: 1389.3 vp9_inv_dct_dct_32x32_sub1_add_neon: 554.1 vp9_inv_dct_dct_32x32_sub2_add_neon: 3685.5 vp9_inv_dct_dct_32x32_sub4_add_neon: 3685.1 vp9_inv_dct_dct_32x32_sub8_add_neon: 3684.4 vp9_inv_dct_dct_32x32_sub12_add_neon: 5312.2 vp9_inv_dct_dct_32x32_sub16_add_neon: 5315.4 vp9_inv_dct_dct_32x32_sub20_add_neon: 7154.9 vp9_inv_dct_dct_32x32_sub24_add_neon: 7154.5 vp9_inv_dct_dct_32x32_sub28_add_neon: 8126.6 vp9_inv_dct_dct_32x32_sub32_add_neon: 8127.2 This is cherrypicked from libav commit a63da4511d0fee66695ff4afd264ba1dbf1e812d. Signed-off-by: Martin Storsjö <martin@martin.st>
* | aarch64: vp9itxfm: Move the load_add_store macro out from the itxfm16 pass2 ↵Martin Storsjö2017-03-111-45/+45
| | | | | | | | | | | | | | | | | | | | | | | | function This allows reusing the macro for a separate implementation of the pass2 function. This is cherrypicked from libav commit 79d332ebbde8c0a3e9da094dcfd10abd33ba7378. Signed-off-by: Martin Storsjö <martin@martin.st>
* | aarch64: vp9itxfm: Make the larger core transforms standalone functionsMartin Storsjö2017-03-111-17/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This work is sponsored by, and copyright, Google. This reduces the code size of libavcodec/aarch64/vp9itxfm_neon.o from 19496 to 14740 bytes. This gives a small slowdown of a couple of tens of cycles, but makes it more feasible to add more optimized versions of these transforms. Before: vp9_inv_dct_dct_16x16_sub4_add_neon: 1036.7 vp9_inv_dct_dct_16x16_sub16_add_neon: 1372.2 vp9_inv_dct_dct_32x32_sub4_add_neon: 5180.0 vp9_inv_dct_dct_32x32_sub32_add_neon: 8095.7 After: vp9_inv_dct_dct_16x16_sub4_add_neon: 1051.0 vp9_inv_dct_dct_16x16_sub16_add_neon: 1390.1 vp9_inv_dct_dct_32x32_sub4_add_neon: 5199.9 vp9_inv_dct_dct_32x32_sub32_add_neon: 8125.8 This is cherrypicked from libav commit 115476018d2c97df7e9b4445fe8f6cc7420ab91f. Signed-off-by: Martin Storsjö <martin@martin.st>
* | aarch64: vp9itxfm: Restructure the idct32 store macrosMartin Storsjö2017-03-111-40/+40
| | | | | | | | | | | | | | | | | | | | | | | | This avoids concatenation, which can't be used if the whole macro is wrapped within another macro. This is also arguably more readable. This is cherrypicked from libav commit 58d87e0f49bcbbc6f426328f53b657bae7430cd2. Signed-off-by: Martin Storsjö <martin@martin.st>
* | aarch64: Add NEON optimizations for 10 and 12 bit vp9 loop filterMartin Storsjö2017-01-243-0/+936
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This work is sponsored by, and copyright, Google. This is similar to the arm version, but due to the larger registers on aarch64, we can do 8 pixels at a time for all filter sizes. Examples of runtimes vs the 32 bit version, on a Cortex A53: ARM AArch64 vp9_loop_filter_h_4_8_10bpp_neon: 213.2 172.6 vp9_loop_filter_h_8_8_10bpp_neon: 281.2 244.2 vp9_loop_filter_h_16_8_10bpp_neon: 657.0 444.5 vp9_loop_filter_h_16_16_10bpp_neon: 1280.4 877.7 vp9_loop_filter_mix2_h_44_16_10bpp_neon: 397.7 358.0 vp9_loop_filter_mix2_h_48_16_10bpp_neon: 465.7 429.0 vp9_loop_filter_mix2_h_84_16_10bpp_neon: 465.7 428.0 vp9_loop_filter_mix2_h_88_16_10bpp_neon: 533.7 499.0 vp9_loop_filter_mix2_v_44_16_10bpp_neon: 271.5 244.0 vp9_loop_filter_mix2_v_48_16_10bpp_neon: 330.0 305.0 vp9_loop_filter_mix2_v_84_16_10bpp_neon: 329.0 306.0 vp9_loop_filter_mix2_v_88_16_10bpp_neon: 386.0 365.0 vp9_loop_filter_v_4_8_10bpp_neon: 150.0 115.2 vp9_loop_filter_v_8_8_10bpp_neon: 209.0 175.5 vp9_loop_filter_v_16_8_10bpp_neon: 492.7 345.2 vp9_loop_filter_v_16_16_10bpp_neon: 951.0 682.7 This is significantly faster than the ARM version in almost all cases except for the mix2 functions. Based on START_TIMER/STOP_TIMER wrapping around a few individual functions, the speedup vs C code is around 2-3x. Signed-off-by: Martin Storsjö <martin@martin.st>
* | aarch64: Add NEON optimizations for 10 and 12 bit vp9 itxfmMartin Storsjö2017-01-243-1/+1566
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This work is sponsored by, and copyright, Google. Compared to the arm version, on aarch64 we can keep the full 8x8 transform in registers, and for 16x16 and 32x32, we can process it in slices of 4 pixels instead of 2. Examples of runtimes vs the 32 bit version, on a Cortex A53: ARM AArch64 vp9_inv_adst_adst_4x4_sub4_add_10_neon: 111.0 109.7 vp9_inv_adst_adst_8x8_sub8_add_10_neon: 914.0 733.5 vp9_inv_adst_adst_16x16_sub16_add_10_neon: 5184.0 3745.7 vp9_inv_dct_dct_4x4_sub1_add_10_neon: 65.0 65.7 vp9_inv_dct_dct_4x4_sub4_add_10_neon: 100.0 96.7 vp9_inv_dct_dct_8x8_sub1_add_10_neon: 111.0 119.7 vp9_inv_dct_dct_8x8_sub8_add_10_neon: 618.0 494.7 vp9_inv_dct_dct_16x16_sub1_add_10_neon: 295.1 284.6 vp9_inv_dct_dct_16x16_sub2_add_10_neon: 2303.2 1883.9 vp9_inv_dct_dct_16x16_sub8_add_10_neon: 2984.8 2189.3 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 3890.0 2799.4 vp9_inv_dct_dct_32x32_sub1_add_10_neon: 1044.4 1012.7 vp9_inv_dct_dct_32x32_sub2_add_10_neon: 13333.7 9695.1 vp9_inv_dct_dct_32x32_sub16_add_10_neon: 18531.3 12459.8 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 24470.7 16160.2 vp9_inv_wht_wht_4x4_sub4_add_10_neon: 83.0 79.7 The larger transforms are significantly faster than the corresponding ARM versions. The speedup vs C code is smaller than in 32 bit mode, probably because the 64 bit intermediates in the C code can be expressed more efficiently in aarch64. Signed-off-by: Martin Storsjö <martin@martin.st>
* | aarch64: Add NEON optimizations for 10 and 12 bit vp9 MCMartin Storsjö2017-01-247-2/+881
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This work is sponsored by, and copyright, Google. This has mostly got the same differences to the 8 bit version as in the arm version. For the horizontal filters, we do 16 pixels in parallel as well. For the 8 pixel wide vertical filters, we can accumulate 4 rows before storing, just as in the 8 bit version. Examples of runtimes vs the 32 bit version, on a Cortex A53: ARM AArch64 vp9_avg4_10bpp_neon: 35.7 30.7 vp9_avg8_10bpp_neon: 93.5 84.7 vp9_avg16_10bpp_neon: 324.4 296.6 vp9_avg32_10bpp_neon: 1236.5 1148.2 vp9_avg64_10bpp_neon: 4639.6 4571.1 vp9_avg_8tap_smooth_4h_10bpp_neon: 130.0 128.0 vp9_avg_8tap_smooth_4hv_10bpp_neon: 440.0 440.5 vp9_avg_8tap_smooth_4v_10bpp_neon: 114.0 105.5 vp9_avg_8tap_smooth_8h_10bpp_neon: 327.0 314.0 vp9_avg_8tap_smooth_8hv_10bpp_neon: 918.7 865.4 vp9_avg_8tap_smooth_8v_10bpp_neon: 330.0 300.2 vp9_avg_8tap_smooth_16h_10bpp_neon: 1187.5 1155.5 vp9_avg_8tap_smooth_16hv_10bpp_neon: 2663.1 2591.0 vp9_avg_8tap_smooth_16v_10bpp_neon: 1107.4 1078.3 vp9_avg_8tap_smooth_64h_10bpp_neon: 17754.6 17454.7 vp9_avg_8tap_smooth_64hv_10bpp_neon: 33285.2 33001.5 vp9_avg_8tap_smooth_64v_10bpp_neon: 16066.9 16048.6 vp9_put4_10bpp_neon: 25.5 21.7 vp9_put8_10bpp_neon: 56.0 52.0 vp9_put16_10bpp_neon/armv8: 183.0 163.1 vp9_put32_10bpp_neon/armv8: 678.6 563.1 vp9_put64_10bpp_neon/armv8: 2679.9 2195.8 vp9_put_8tap_smooth_4h_10bpp_neon: 120.0 118.0 vp9_put_8tap_smooth_4hv_10bpp_neon: 435.2 435.0 vp9_put_8tap_smooth_4v_10bpp_neon: 107.0 98.2 vp9_put_8tap_smooth_8h_10bpp_neon: 303.0 290.0 vp9_put_8tap_smooth_8hv_10bpp_neon: 893.7 828.7 vp9_put_8tap_smooth_8v_10bpp_neon: 305.5 263.5 vp9_put_8tap_smooth_16h_10bpp_neon: 1089.1 1059.2 vp9_put_8tap_smooth_16hv_10bpp_neon: 2578.8 2452.4 vp9_put_8tap_smooth_16v_10bpp_neon: 1009.5 933.5 vp9_put_8tap_smooth_64h_10bpp_neon: 16223.4 15918.6 vp9_put_8tap_smooth_64hv_10bpp_neon: 32153.0 31016.2 vp9_put_8tap_smooth_64v_10bpp_neon: 14516.5 13748.1 These are generally about as fast as the corresponding ARM routines on the same CPU (at least on the A53), in most cases marginally faster. The speedup vs C code is around 4-9x. Signed-off-by: Martin Storsjö <martin@martin.st>
* | aarch64: vp9dsp: Restructure the bpp checksMartin Storsjö2017-01-241-15/+9
| | | | | | | | | | | | | | | | This work is sponsored by, and copyright, Google. This is more in line with how it will be extended for more bitdepths. Signed-off-by: Martin Storsjö <martin@martin.st>
* | aarch64: vp9mc: Fix a comment to refer to a register with the right nameMartin Storsjö2017-01-141-1/+1
| | | | | | | | | | | | | | This is cherrypicked from libav commit 85ad5ea72ce3983947a3b07e4b35c66cb16dfaba. Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
* | aarch64: vp9dsp: Fix vertical alignment in the init fileMartin Storsjö2017-01-141-3/+3
| | | | | | | | | | | | | | This is cherrypicked from libav commit 65074791e8f8397600aacc9801efdd17777eb6e3. Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
* | aarch64: vp9itxfm: Skip empty slices in the first pass of idct_idct 16x16 ↵Martin Storsjö2017-01-141-5/+56
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | and 32x32 This work is sponsored by, and copyright, Google. Previously all subpartitions except the eob=1 (DC) case ran with the same runtime: vp9_inv_dct_dct_16x16_sub16_add_neon: 1373.2 vp9_inv_dct_dct_32x32_sub32_add_neon: 8089.0 By skipping individual 8x16 or 8x32 pixel slices in the first pass, we reduce the runtime of these functions like this: vp9_inv_dct_dct_16x16_sub1_add_neon: 235.3 vp9_inv_dct_dct_16x16_sub2_add_neon: 1036.7 vp9_inv_dct_dct_16x16_sub4_add_neon: 1036.7 vp9_inv_dct_dct_16x16_sub8_add_neon: 1036.7 vp9_inv_dct_dct_16x16_sub12_add_neon: 1372.1 vp9_inv_dct_dct_16x16_sub16_add_neon: 1372.1 vp9_inv_dct_dct_32x32_sub1_add_neon: 555.1 vp9_inv_dct_dct_32x32_sub2_add_neon: 5190.2 vp9_inv_dct_dct_32x32_sub4_add_neon: 5180.0 vp9_inv_dct_dct_32x32_sub8_add_neon: 5183.1 vp9_inv_dct_dct_32x32_sub12_add_neon: 6161.5 vp9_inv_dct_dct_32x32_sub16_add_neon: 6155.5 vp9_inv_dct_dct_32x32_sub20_add_neon: 7136.3 vp9_inv_dct_dct_32x32_sub24_add_neon: 7128.4 vp9_inv_dct_dct_32x32_sub28_add_neon: 8098.9 vp9_inv_dct_dct_32x32_sub32_add_neon: 8098.8 I.e. in general a very minor overhead for the full subpartition case due to the additional cmps, but a significant speedup for the cases when we only need to process a small part of the actual input data. This is cherrypicked from libav commits cad42fadcd2c2ae1b3676bb398844a1f521a2d7b and a0c443a3980dc22eb02b067ac4cb9ffa2f9b04d2. Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
OpenPOWER on IntegriCloud