diff options
author | dim <dim@FreeBSD.org> | 2017-04-02 17:24:58 +0000 |
---|---|---|
committer | dim <dim@FreeBSD.org> | 2017-04-02 17:24:58 +0000 |
commit | 60b571e49a90d38697b3aca23020d9da42fc7d7f (patch) | |
tree | 99351324c24d6cb146b6285b6caffa4d26fce188 /contrib/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp | |
parent | bea1b22c7a9bce1dfdd73e6e5b65bc4752215180 (diff) | |
download | FreeBSD-src-60b571e49a90d38697b3aca23020d9da42fc7d7f.zip FreeBSD-src-60b571e49a90d38697b3aca23020d9da42fc7d7f.tar.gz |
Update clang, llvm, lld, lldb, compiler-rt and libc++ to 4.0.0 release:
MFC r309142 (by emaste):
Add WITH_LLD_AS_LD build knob
If set it installs LLD as /usr/bin/ld. LLD (as of version 3.9) is not
capable of linking the world and kernel, but can self-host and link many
substantial applications. GNU ld continues to be used for the world and
kernel build, regardless of how this knob is set.
It is on by default for arm64, and off for all other CPU architectures.
Sponsored by: The FreeBSD Foundation
MFC r310840:
Reapply 310775, now it also builds correctly if lldb is disabled:
Move llvm-objdump from CLANG_EXTRAS to installed by default
We currently install three tools from binutils 2.17.50: as, ld, and
objdump. Work is underway to migrate to a permissively-licensed
tool-chain, with one goal being the retirement of binutils 2.17.50.
LLVM's llvm-objdump is intended to be compatible with GNU objdump
although it is currently missing some options and may have formatting
differences. Enable it by default for testing and further investigation.
It may later be changed to install as /usr/bin/objdump, it becomes a
fully viable replacement.
Reviewed by: emaste
Differential Revision: https://reviews.freebsd.org/D8879
MFC r312855 (by emaste):
Rename LLD_AS_LD to LLD_IS_LD, for consistency with CLANG_IS_CC
Reported by: Dan McGregor <dan.mcgregor usask.ca>
MFC r313559 | glebius | 2017-02-10 18:34:48 +0100 (Fri, 10 Feb 2017) | 5 lines
Don't check struct rtentry on FreeBSD, it is an internal kernel structure.
On other systems it may be API structure for SIOCADDRT/SIOCDELRT.
Reviewed by: emaste, dim
MFC r314152 (by jkim):
Remove an assembler flag, which is redundant since r309124. The upstream
took care of it by introducing a macro NO_EXEC_STACK_DIRECTIVE.
http://llvm.org/viewvc/llvm-project?rev=273500&view=rev
Reviewed by: dim
MFC r314564:
Upgrade our copies of clang, llvm, lld, lldb, compiler-rt and libc++ to
4.0.0 (branches/release_40 296509). The release will follow soon.
Please note that from 3.5.0 onwards, clang, llvm and lldb require C++11
support to build; see UPDATING for more information.
Also note that as of 4.0.0, lld should be able to link the base system
on amd64 and aarch64. See the WITH_LLD_IS_LLD setting in src.conf(5).
Though please be aware that this is work in progress.
Release notes for llvm, clang and lld will be available here:
<http://releases.llvm.org/4.0.0/docs/ReleaseNotes.html>
<http://releases.llvm.org/4.0.0/tools/clang/docs/ReleaseNotes.html>
<http://releases.llvm.org/4.0.0/tools/lld/docs/ReleaseNotes.html>
Thanks to Ed Maste, Jan Beich, Antoine Brodin and Eric Fiselier for
their help.
Relnotes: yes
Exp-run: antoine
PR: 215969, 216008
MFC r314708:
For now, revert r287232 from upstream llvm trunk (by Daniil Fukalov):
[SCEV] limit recursion depth of CompareSCEVComplexity
Summary:
CompareSCEVComplexity goes too deep (50+ on a quite a big unrolled
loop) and runs almost infinite time.
Added cache of "equal" SCEV pairs to earlier cutoff of further
estimation. Recursion depth limit was also introduced as a parameter.
Reviewers: sanjoy
Subscribers: mzolotukhin, tstellarAMD, llvm-commits
Differential Revision: https://reviews.llvm.org/D26389
This commit is the cause of excessive compile times on skein_block.c
(and possibly other files) during kernel builds on amd64.
We never saw the problematic behavior described in this upstream commit,
so for now it is better to revert it. An upstream bug has been filed
here: https://bugs.llvm.org/show_bug.cgi?id=32142
Reported by: mjg
MFC r314795:
Reapply r287232 from upstream llvm trunk (by Daniil Fukalov):
[SCEV] limit recursion depth of CompareSCEVComplexity
Summary:
CompareSCEVComplexity goes too deep (50+ on a quite a big unrolled
loop) and runs almost infinite time.
Added cache of "equal" SCEV pairs to earlier cutoff of further
estimation. Recursion depth limit was also introduced as a parameter.
Reviewers: sanjoy
Subscribers: mzolotukhin, tstellarAMD, llvm-commits
Differential Revision: https://reviews.llvm.org/D26389
Pull in r296992 from upstream llvm trunk (by Sanjoy Das):
[SCEV] Decrease the recursion threshold for CompareValueComplexity
Fixes PR32142.
r287232 accidentally increased the recursion threshold for
CompareValueComplexity from 2 to 32. This change reverses that
change by introducing a separate flag for CompareValueComplexity's
threshold.
The latter revision fixes the excessive compile times for skein_block.c.
MFC r314907 | mmel | 2017-03-08 12:40:27 +0100 (Wed, 08 Mar 2017) | 7 lines
Unbreak ARMv6 world.
The new compiler_rt library imported with clang 4.0.0 have several fatal
issues (non-functional __udivsi3 for example) with ARM specific instrict
functions. As temporary workaround, until upstream solve these problems,
disable all thumb[1][2] related feature.
MFC r315016:
Update clang, llvm, lld, lldb, compiler-rt and libc++ to 4.0.0 release.
We were already very close to the last release candidate, so this is a
pretty minor update.
Relnotes: yes
MFC r316005:
Revert r314907, and pull in r298713 from upstream compiler-rt trunk (by
Weiming Zhao):
builtins: Select correct code fragments when compiling for Thumb1/Thum2/ARM ISA.
Summary:
Value of __ARM_ARCH_ISA_THUMB isn't based on the actual compilation
mode (-mthumb, -marm), it reflect's capability of given CPU.
Due to this:
- use __tbumb__ and __thumb2__ insteand of __ARM_ARCH_ISA_THUMB
- use '.thumb' directive consistently in all affected files
- decorate all thumb functions using
DEFINE_COMPILERRT_THUMB_FUNCTION()
---------
Note: This patch doesn't fix broken Thumb1 variant of __udivsi3 !
Reviewers: weimingz, rengolin, compnerd
Subscribers: aemerson, dim
Differential Revision: https://reviews.llvm.org/D30938
Discussed with: mmel
Diffstat (limited to 'contrib/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp')
-rw-r--r-- | contrib/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp | 1233 |
1 files changed, 847 insertions, 386 deletions
diff --git a/contrib/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp b/contrib/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp index 347c33f..a1ed5e8 100644 --- a/contrib/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp +++ b/contrib/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp @@ -24,52 +24,11 @@ using namespace llvm; -static unsigned getMaxWaveCountPerSIMD(const MachineFunction &MF) { - const SIMachineFunctionInfo &MFI = *MF.getInfo<SIMachineFunctionInfo>(); - const SISubtarget &ST = MF.getSubtarget<SISubtarget>(); - unsigned SIMDPerCU = 4; - - unsigned MaxInvocationsPerWave = SIMDPerCU * ST.getWavefrontSize(); - return alignTo(MFI.getMaximumWorkGroupSize(MF), MaxInvocationsPerWave) / - MaxInvocationsPerWave; -} - -static unsigned getMaxWorkGroupSGPRCount(const MachineFunction &MF) { - const SISubtarget &ST = MF.getSubtarget<SISubtarget>(); - unsigned MaxWaveCountPerSIMD = getMaxWaveCountPerSIMD(MF); - - unsigned TotalSGPRCountPerSIMD, AddressableSGPRCount, SGPRUsageAlignment; - unsigned ReservedSGPRCount; - - if (ST.getGeneration() >= SISubtarget::VOLCANIC_ISLANDS) { - TotalSGPRCountPerSIMD = 800; - AddressableSGPRCount = 102; - SGPRUsageAlignment = 16; - ReservedSGPRCount = 6; // VCC, FLAT_SCRATCH, XNACK - } else { - TotalSGPRCountPerSIMD = 512; - AddressableSGPRCount = 104; - SGPRUsageAlignment = 8; - ReservedSGPRCount = 2; // VCC - } +static cl::opt<bool> EnableSpillSGPRToSMEM( + "amdgpu-spill-sgpr-to-smem", + cl::desc("Use scalar stores to spill SGPRs if supported by subtarget"), + cl::init(false)); - unsigned MaxSGPRCount = (TotalSGPRCountPerSIMD / MaxWaveCountPerSIMD); - MaxSGPRCount = alignDown(MaxSGPRCount, SGPRUsageAlignment); - - if (ST.hasSGPRInitBug()) - MaxSGPRCount = SISubtarget::FIXED_SGPR_COUNT_FOR_INIT_BUG; - - return std::min(MaxSGPRCount - ReservedSGPRCount, AddressableSGPRCount); -} - -static unsigned getMaxWorkGroupVGPRCount(const MachineFunction &MF) { - unsigned MaxWaveCountPerSIMD = getMaxWaveCountPerSIMD(MF); - unsigned TotalVGPRCountPerSIMD = 256; - unsigned VGPRUsageAlignment = 4; - - return alignDown(TotalVGPRCountPerSIMD / MaxWaveCountPerSIMD, - VGPRUsageAlignment); -} static bool hasPressureSet(const int *PSets, unsigned PSetID) { for (unsigned i = 0; PSets[i] != -1; ++i) { @@ -95,19 +54,38 @@ SIRegisterInfo::SIRegisterInfo() : AMDGPURegisterInfo(), VGPRPressureSets(getNumRegPressureSets()) { unsigned NumRegPressureSets = getNumRegPressureSets(); - SGPR32SetID = NumRegPressureSets; - VGPR32SetID = NumRegPressureSets; - for (unsigned i = 0; i < NumRegPressureSets; ++i) { - if (strncmp("SGPR_32", getRegPressureSetName(i), 7) == 0) - SGPR32SetID = i; - else if (strncmp("VGPR_32", getRegPressureSetName(i), 7) == 0) - VGPR32SetID = i; + SGPRSetID = NumRegPressureSets; + VGPRSetID = NumRegPressureSets; + for (unsigned i = 0; i < NumRegPressureSets; ++i) { classifyPressureSet(i, AMDGPU::SGPR0, SGPRPressureSets); classifyPressureSet(i, AMDGPU::VGPR0, VGPRPressureSets); } - assert(SGPR32SetID < NumRegPressureSets && - VGPR32SetID < NumRegPressureSets); + + // Determine the number of reg units for each pressure set. + std::vector<unsigned> PressureSetRegUnits(NumRegPressureSets, 0); + for (unsigned i = 0, e = getNumRegUnits(); i != e; ++i) { + const int *PSets = getRegUnitPressureSets(i); + for (unsigned j = 0; PSets[j] != -1; ++j) { + ++PressureSetRegUnits[PSets[j]]; + } + } + + unsigned VGPRMax = 0, SGPRMax = 0; + for (unsigned i = 0; i < NumRegPressureSets; ++i) { + if (isVGPRPressureSet(i) && PressureSetRegUnits[i] > VGPRMax) { + VGPRSetID = i; + VGPRMax = PressureSetRegUnits[i]; + continue; + } + if (isSGPRPressureSet(i) && PressureSetRegUnits[i] > SGPRMax) { + SGPRSetID = i; + SGPRMax = PressureSetRegUnits[i]; + } + } + + assert(SGPRSetID < NumRegPressureSets && + VGPRSetID < NumRegPressureSets); } void SIRegisterInfo::reserveRegisterTuples(BitVector &Reserved, unsigned Reg) const { @@ -119,14 +97,14 @@ void SIRegisterInfo::reserveRegisterTuples(BitVector &Reserved, unsigned Reg) co unsigned SIRegisterInfo::reservedPrivateSegmentBufferReg( const MachineFunction &MF) const { - unsigned BaseIdx = alignDown(getMaxWorkGroupSGPRCount(MF), 4) - 4; + unsigned BaseIdx = alignDown(getMaxNumSGPRs(MF), 4) - 4; unsigned BaseReg(AMDGPU::SGPR_32RegClass.getRegister(BaseIdx)); return getMatchingSuperReg(BaseReg, AMDGPU::sub0, &AMDGPU::SReg_128RegClass); } unsigned SIRegisterInfo::reservedPrivateSegmentWaveByteOffsetReg( const MachineFunction &MF) const { - unsigned RegCount = getMaxWorkGroupSGPRCount(MF); + unsigned RegCount = getMaxNumSGPRs(MF); unsigned Reg; // Try to place it in a hole after PrivateSegmentbufferReg. @@ -161,18 +139,16 @@ BitVector SIRegisterInfo::getReservedRegs(const MachineFunction &MF) const { reserveRegisterTuples(Reserved, AMDGPU::TTMP8_TTMP9); reserveRegisterTuples(Reserved, AMDGPU::TTMP10_TTMP11); - unsigned MaxWorkGroupSGPRCount = getMaxWorkGroupSGPRCount(MF); - unsigned MaxWorkGroupVGPRCount = getMaxWorkGroupVGPRCount(MF); - - unsigned NumSGPRs = AMDGPU::SGPR_32RegClass.getNumRegs(); - unsigned NumVGPRs = AMDGPU::VGPR_32RegClass.getNumRegs(); - for (unsigned i = MaxWorkGroupSGPRCount; i < NumSGPRs; ++i) { + unsigned MaxNumSGPRs = getMaxNumSGPRs(MF); + unsigned TotalNumSGPRs = AMDGPU::SGPR_32RegClass.getNumRegs(); + for (unsigned i = MaxNumSGPRs; i < TotalNumSGPRs; ++i) { unsigned Reg = AMDGPU::SGPR_32RegClass.getRegister(i); reserveRegisterTuples(Reserved, Reg); } - - for (unsigned i = MaxWorkGroupVGPRCount; i < NumVGPRs; ++i) { + unsigned MaxNumVGPRs = getMaxNumVGPRs(MF); + unsigned TotalNumVGPRs = AMDGPU::VGPR_32RegClass.getNumRegs(); + for (unsigned i = MaxNumVGPRs; i < TotalNumVGPRs; ++i) { unsigned Reg = AMDGPU::VGPR_32RegClass.getRegister(i); reserveRegisterTuples(Reserved, Reg); } @@ -194,49 +170,26 @@ BitVector SIRegisterInfo::getReservedRegs(const MachineFunction &MF) const { assert(!isSubRegister(ScratchRSrcReg, ScratchWaveOffsetReg)); } - // Reserve registers for debugger usage if "amdgpu-debugger-reserve-trap-regs" - // attribute was specified. - const SISubtarget &ST = MF.getSubtarget<SISubtarget>(); - if (ST.debuggerReserveRegs()) { - unsigned ReservedVGPRFirst = - MaxWorkGroupVGPRCount - MFI->getDebuggerReservedVGPRCount(); - for (unsigned i = ReservedVGPRFirst; i < MaxWorkGroupVGPRCount; ++i) { - unsigned Reg = AMDGPU::VGPR_32RegClass.getRegister(i); - reserveRegisterTuples(Reserved, Reg); - } - } - return Reserved; } -unsigned SIRegisterInfo::getRegPressureSetLimit(const MachineFunction &MF, - unsigned Idx) const { - const SISubtarget &STI = MF.getSubtarget<SISubtarget>(); - // FIXME: We should adjust the max number of waves based on LDS size. - unsigned SGPRLimit = getNumSGPRsAllowed(STI, STI.getMaxWavesPerCU()); - unsigned VGPRLimit = getNumVGPRsAllowed(STI.getMaxWavesPerCU()); - - unsigned VSLimit = SGPRLimit + VGPRLimit; - - if (SGPRPressureSets.test(Idx) && VGPRPressureSets.test(Idx)) { - // FIXME: This is a hack. We should never be considering the pressure of - // these since no virtual register should ever have this class. - return VSLimit; - } - - if (SGPRPressureSets.test(Idx)) - return SGPRLimit; - - return VGPRLimit; -} - bool SIRegisterInfo::requiresRegisterScavenging(const MachineFunction &Fn) const { - return Fn.getFrameInfo()->hasStackObjects(); + return Fn.getFrameInfo().hasStackObjects(); } bool SIRegisterInfo::requiresFrameIndexScavenging(const MachineFunction &MF) const { - return MF.getFrameInfo()->hasStackObjects(); + return MF.getFrameInfo().hasStackObjects(); +} + +bool SIRegisterInfo::requiresFrameIndexReplacementScavenging( + const MachineFunction &MF) const { + // m0 is needed for the scalar store offset. m0 is unallocatable, so we can't + // create a virtual register for it during frame index elimination, so the + // scavenger is directly needed. + return MF.getFrameInfo().hasStackObjects() && + MF.getSubtarget<SISubtarget>().hasScalarStores() && + MF.getInfo<SIMachineFunctionInfo>()->hasSpilledSGPRs(); } bool SIRegisterInfo::requiresVirtualBaseRegisters( @@ -250,6 +203,14 @@ bool SIRegisterInfo::trackLivenessAfterRegAlloc(const MachineFunction &MF) const return true; } +int64_t SIRegisterInfo::getMUBUFInstrOffset(const MachineInstr *MI) const { + assert(SIInstrInfo::isMUBUF(*MI)); + + int OffIdx = AMDGPU::getNamedOperandIdx(MI->getOpcode(), + AMDGPU::OpName::offset); + return MI->getOperand(OffIdx).getImm(); +} + int64_t SIRegisterInfo::getFrameIndexInstrOffset(const MachineInstr *MI, int Idx) const { if (!SIInstrInfo::isMUBUF(*MI)) @@ -259,13 +220,16 @@ int64_t SIRegisterInfo::getFrameIndexInstrOffset(const MachineInstr *MI, AMDGPU::OpName::vaddr) && "Should never see frame index on non-address operand"); - int OffIdx = AMDGPU::getNamedOperandIdx(MI->getOpcode(), - AMDGPU::OpName::offset); - return MI->getOperand(OffIdx).getImm(); + return getMUBUFInstrOffset(MI); } bool SIRegisterInfo::needsFrameBaseReg(MachineInstr *MI, int64_t Offset) const { - return MI->mayLoadOrStore(); + if (!MI->mayLoadOrStore()) + return false; + + int64_t FullOffset = Offset + getMUBUFInstrOffset(MI); + + return !isUInt<12>(FullOffset); } void SIRegisterInfo::materializeFrameBaseRegister(MachineBasicBlock *MBB, @@ -290,14 +254,19 @@ void SIRegisterInfo::materializeFrameBaseRegister(MachineBasicBlock *MBB, MachineRegisterInfo &MRI = MF->getRegInfo(); unsigned UnusedCarry = MRI.createVirtualRegister(&AMDGPU::SReg_64RegClass); - unsigned OffsetReg = MRI.createVirtualRegister(&AMDGPU::SReg_32RegClass); + unsigned OffsetReg = MRI.createVirtualRegister(&AMDGPU::SReg_32_XM0RegClass); + + unsigned FIReg = MRI.createVirtualRegister(&AMDGPU::VGPR_32RegClass); BuildMI(*MBB, Ins, DL, TII->get(AMDGPU::S_MOV_B32), OffsetReg) .addImm(Offset); + BuildMI(*MBB, Ins, DL, TII->get(AMDGPU::V_MOV_B32_e32), FIReg) + .addFrameIndex(FrameIdx); + BuildMI(*MBB, Ins, DL, TII->get(AMDGPU::V_ADD_I32_e64), BaseReg) .addReg(UnusedCarry, RegState::Define | RegState::Dead) .addReg(OffsetReg, RegState::Kill) - .addFrameIndex(FrameIdx); + .addReg(FIReg); } void SIRegisterInfo::resolveFrameIndex(MachineInstr &MI, unsigned BaseReg, @@ -328,40 +297,21 @@ void SIRegisterInfo::resolveFrameIndex(MachineInstr &MI, unsigned BaseReg, MachineOperand *OffsetOp = TII->getNamedOperand(MI, AMDGPU::OpName::offset); int64_t NewOffset = OffsetOp->getImm() + Offset; - if (isUInt<12>(NewOffset)) { - // If we have a legal offset, fold it directly into the instruction. - FIOp->ChangeToRegister(BaseReg, false); - OffsetOp->setImm(NewOffset); - return; - } - - // The offset is not legal, so we must insert an add of the offset. - MachineRegisterInfo &MRI = MF->getRegInfo(); - unsigned NewReg = MRI.createVirtualRegister(&AMDGPU::VGPR_32RegClass); - DebugLoc DL = MI.getDebugLoc(); - - assert(Offset != 0 && "Non-zero offset expected"); - - unsigned UnusedCarry = MRI.createVirtualRegister(&AMDGPU::SReg_64RegClass); - unsigned OffsetReg = MRI.createVirtualRegister(&AMDGPU::SReg_32RegClass); + assert(isUInt<12>(NewOffset) && "offset should be legal"); - // In the case the instruction already had an immediate offset, here only - // the requested new offset is added because we are leaving the original - // immediate in place. - BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_MOV_B32), OffsetReg) - .addImm(Offset); - BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_ADD_I32_e64), NewReg) - .addReg(UnusedCarry, RegState::Define | RegState::Dead) - .addReg(OffsetReg, RegState::Kill) - .addReg(BaseReg); - - FIOp->ChangeToRegister(NewReg, false); + FIOp->ChangeToRegister(BaseReg, false); + OffsetOp->setImm(NewOffset); } bool SIRegisterInfo::isFrameOffsetLegal(const MachineInstr *MI, unsigned BaseReg, int64_t Offset) const { - return SIInstrInfo::isMUBUF(*MI) && isUInt<12>(Offset); + if (!SIInstrInfo::isMUBUF(*MI)) + return false; + + int64_t NewOffset = Offset + getMUBUFInstrOffset(MI); + + return isUInt<12>(NewOffset); } const TargetRegisterClass *SIRegisterInfo::getPointerRegClass( @@ -407,31 +357,107 @@ static unsigned getNumSubRegsForSpillOp(unsigned Op) { } } -void SIRegisterInfo::buildScratchLoadStore(MachineBasicBlock::iterator MI, - unsigned LoadStoreOp, - const MachineOperand *SrcDst, - unsigned ScratchRsrcReg, - unsigned ScratchOffset, - int64_t Offset, - RegScavenger *RS) const { +static int getOffsetMUBUFStore(unsigned Opc) { + switch (Opc) { + case AMDGPU::BUFFER_STORE_DWORD_OFFEN: + return AMDGPU::BUFFER_STORE_DWORD_OFFSET; + case AMDGPU::BUFFER_STORE_BYTE_OFFEN: + return AMDGPU::BUFFER_STORE_BYTE_OFFSET; + case AMDGPU::BUFFER_STORE_SHORT_OFFEN: + return AMDGPU::BUFFER_STORE_SHORT_OFFSET; + case AMDGPU::BUFFER_STORE_DWORDX2_OFFEN: + return AMDGPU::BUFFER_STORE_DWORDX2_OFFSET; + case AMDGPU::BUFFER_STORE_DWORDX4_OFFEN: + return AMDGPU::BUFFER_STORE_DWORDX4_OFFSET; + default: + return -1; + } +} + +static int getOffsetMUBUFLoad(unsigned Opc) { + switch (Opc) { + case AMDGPU::BUFFER_LOAD_DWORD_OFFEN: + return AMDGPU::BUFFER_LOAD_DWORD_OFFSET; + case AMDGPU::BUFFER_LOAD_UBYTE_OFFEN: + return AMDGPU::BUFFER_LOAD_UBYTE_OFFSET; + case AMDGPU::BUFFER_LOAD_SBYTE_OFFEN: + return AMDGPU::BUFFER_LOAD_SBYTE_OFFSET; + case AMDGPU::BUFFER_LOAD_USHORT_OFFEN: + return AMDGPU::BUFFER_LOAD_USHORT_OFFSET; + case AMDGPU::BUFFER_LOAD_SSHORT_OFFEN: + return AMDGPU::BUFFER_LOAD_SSHORT_OFFSET; + case AMDGPU::BUFFER_LOAD_DWORDX2_OFFEN: + return AMDGPU::BUFFER_LOAD_DWORDX2_OFFSET; + case AMDGPU::BUFFER_LOAD_DWORDX4_OFFEN: + return AMDGPU::BUFFER_LOAD_DWORDX4_OFFSET; + default: + return -1; + } +} - unsigned Value = SrcDst->getReg(); - bool IsKill = SrcDst->isKill(); +// This differs from buildSpillLoadStore by only scavenging a VGPR. It does not +// need to handle the case where an SGPR may need to be spilled while spilling. +static bool buildMUBUFOffsetLoadStore(const SIInstrInfo *TII, + MachineFrameInfo &MFI, + MachineBasicBlock::iterator MI, + int Index, + int64_t Offset) { + MachineBasicBlock *MBB = MI->getParent(); + const DebugLoc &DL = MI->getDebugLoc(); + bool IsStore = MI->mayStore(); + + unsigned Opc = MI->getOpcode(); + int LoadStoreOp = IsStore ? + getOffsetMUBUFStore(Opc) : getOffsetMUBUFLoad(Opc); + if (LoadStoreOp == -1) + return false; + + unsigned Reg = TII->getNamedOperand(*MI, AMDGPU::OpName::vdata)->getReg(); + + BuildMI(*MBB, MI, DL, TII->get(LoadStoreOp)) + .addReg(Reg, getDefRegState(!IsStore)) + .addOperand(*TII->getNamedOperand(*MI, AMDGPU::OpName::srsrc)) + .addOperand(*TII->getNamedOperand(*MI, AMDGPU::OpName::soffset)) + .addImm(Offset) + .addImm(0) // glc + .addImm(0) // slc + .addImm(0) // tfe + .setMemRefs(MI->memoperands_begin(), MI->memoperands_end()); + return true; +} + +void SIRegisterInfo::buildSpillLoadStore(MachineBasicBlock::iterator MI, + unsigned LoadStoreOp, + int Index, + unsigned ValueReg, + bool IsKill, + unsigned ScratchRsrcReg, + unsigned ScratchOffsetReg, + int64_t InstOffset, + MachineMemOperand *MMO, + RegScavenger *RS) const { MachineBasicBlock *MBB = MI->getParent(); MachineFunction *MF = MI->getParent()->getParent(); const SISubtarget &ST = MF->getSubtarget<SISubtarget>(); const SIInstrInfo *TII = ST.getInstrInfo(); + const MachineFrameInfo &MFI = MF->getFrameInfo(); - DebugLoc DL = MI->getDebugLoc(); - bool IsStore = MI->mayStore(); + const MCInstrDesc &Desc = TII->get(LoadStoreOp); + const DebugLoc &DL = MI->getDebugLoc(); + bool IsStore = Desc.mayStore(); bool RanOutOfSGPRs = false; bool Scavenged = false; - unsigned SOffset = ScratchOffset; - unsigned OriginalImmOffset = Offset; + unsigned SOffset = ScratchOffsetReg; - unsigned NumSubRegs = getNumSubRegsForSpillOp(MI->getOpcode()); + const TargetRegisterClass *RC = getRegClassForReg(MF->getRegInfo(), ValueReg); + unsigned NumSubRegs = AMDGPU::getRegBitWidth(RC->getID()) / 32; unsigned Size = NumSubRegs * 4; + int64_t Offset = InstOffset + MFI.getObjectOffset(Index); + const int64_t OriginalImmOffset = Offset; + + unsigned Align = MFI.getObjectAlignment(Index); + const MachinePointerInfo &BasePtrInfo = MMO->getPointerInfo(); if (!isUInt<12>(Offset + Size)) { SOffset = AMDGPU::NoRegister; @@ -450,20 +476,23 @@ void SIRegisterInfo::buildScratchLoadStore(MachineBasicBlock::iterator MI, // subtract the offset after the spill to return ScratchOffset to it's // original value. RanOutOfSGPRs = true; - SOffset = ScratchOffset; + SOffset = ScratchOffsetReg; } else { Scavenged = true; } + BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_ADD_U32), SOffset) - .addReg(ScratchOffset) - .addImm(Offset); + .addReg(ScratchOffsetReg) + .addImm(Offset); + Offset = 0; } - for (unsigned i = 0, e = NumSubRegs; i != e; ++i, Offset += 4) { - unsigned SubReg = NumSubRegs > 1 ? - getPhysRegSubReg(Value, &AMDGPU::VGPR_32RegClass, i) : - Value; + const unsigned EltSize = 4; + + for (unsigned i = 0, e = NumSubRegs; i != e; ++i, Offset += EltSize) { + unsigned SubReg = NumSubRegs == 1 ? + ValueReg : getSubReg(ValueReg, getSubRegFromChannel(i)); unsigned SOffsetRegState = 0; unsigned SrcDstRegState = getDefRegState(!IsStore); @@ -473,23 +502,324 @@ void SIRegisterInfo::buildScratchLoadStore(MachineBasicBlock::iterator MI, SrcDstRegState |= getKillRegState(IsKill); } - BuildMI(*MBB, MI, DL, TII->get(LoadStoreOp)) - .addReg(SubReg, getDefRegState(!IsStore)) + MachinePointerInfo PInfo = BasePtrInfo.getWithOffset(EltSize * i); + MachineMemOperand *NewMMO + = MF->getMachineMemOperand(PInfo, MMO->getFlags(), + EltSize, MinAlign(Align, EltSize * i)); + + auto MIB = BuildMI(*MBB, MI, DL, Desc) + .addReg(SubReg, getDefRegState(!IsStore) | getKillRegState(IsKill)) .addReg(ScratchRsrcReg) .addReg(SOffset, SOffsetRegState) .addImm(Offset) .addImm(0) // glc .addImm(0) // slc .addImm(0) // tfe - .addReg(Value, RegState::Implicit | SrcDstRegState) - .setMemRefs(MI->memoperands_begin(), MI->memoperands_end()); + .addMemOperand(NewMMO); + + if (NumSubRegs > 1) + MIB.addReg(ValueReg, RegState::Implicit | SrcDstRegState); } + if (RanOutOfSGPRs) { // Subtract the offset we added to the ScratchOffset register. - BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_SUB_U32), ScratchOffset) - .addReg(ScratchOffset) - .addImm(OriginalImmOffset); + BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_SUB_U32), ScratchOffsetReg) + .addReg(ScratchOffsetReg) + .addImm(OriginalImmOffset); + } +} + +static std::pair<unsigned, unsigned> getSpillEltSize(unsigned SuperRegSize, + bool Store) { + if (SuperRegSize % 16 == 0) { + return { 16, Store ? AMDGPU::S_BUFFER_STORE_DWORDX4_SGPR : + AMDGPU::S_BUFFER_LOAD_DWORDX4_SGPR }; + } + + if (SuperRegSize % 8 == 0) { + return { 8, Store ? AMDGPU::S_BUFFER_STORE_DWORDX2_SGPR : + AMDGPU::S_BUFFER_LOAD_DWORDX2_SGPR }; } + + return { 4, Store ? AMDGPU::S_BUFFER_STORE_DWORD_SGPR : + AMDGPU::S_BUFFER_LOAD_DWORD_SGPR}; +} + +void SIRegisterInfo::spillSGPR(MachineBasicBlock::iterator MI, + int Index, + RegScavenger *RS) const { + MachineBasicBlock *MBB = MI->getParent(); + MachineFunction *MF = MBB->getParent(); + MachineRegisterInfo &MRI = MF->getRegInfo(); + const SISubtarget &ST = MF->getSubtarget<SISubtarget>(); + const SIInstrInfo *TII = ST.getInstrInfo(); + + unsigned SuperReg = MI->getOperand(0).getReg(); + bool IsKill = MI->getOperand(0).isKill(); + const DebugLoc &DL = MI->getDebugLoc(); + + SIMachineFunctionInfo *MFI = MF->getInfo<SIMachineFunctionInfo>(); + MachineFrameInfo &FrameInfo = MF->getFrameInfo(); + + bool SpillToSMEM = ST.hasScalarStores() && EnableSpillSGPRToSMEM; + + assert(SuperReg != AMDGPU::M0 && "m0 should never spill"); + + unsigned OffsetReg = AMDGPU::M0; + unsigned M0CopyReg = AMDGPU::NoRegister; + + if (SpillToSMEM) { + if (RS->isRegUsed(AMDGPU::M0)) { + M0CopyReg = MRI.createVirtualRegister(&AMDGPU::SReg_32_XM0RegClass); + BuildMI(*MBB, MI, DL, TII->get(AMDGPU::COPY), M0CopyReg) + .addReg(AMDGPU::M0); + } + } + + unsigned ScalarStoreOp; + unsigned EltSize = 4; + const TargetRegisterClass *RC = getPhysRegClass(SuperReg); + if (SpillToSMEM && isSGPRClass(RC)) { + // XXX - if private_element_size is larger than 4 it might be useful to be + // able to spill wider vmem spills. + std::tie(EltSize, ScalarStoreOp) = getSpillEltSize(RC->getSize(), true); + } + + ArrayRef<int16_t> SplitParts = getRegSplitParts(RC, EltSize); + unsigned NumSubRegs = SplitParts.empty() ? 1 : SplitParts.size(); + + // SubReg carries the "Kill" flag when SubReg == SuperReg. + unsigned SubKillState = getKillRegState((NumSubRegs == 1) && IsKill); + for (unsigned i = 0, e = NumSubRegs; i < e; ++i) { + unsigned SubReg = NumSubRegs == 1 ? + SuperReg : getSubReg(SuperReg, SplitParts[i]); + + if (SpillToSMEM) { + int64_t FrOffset = FrameInfo.getObjectOffset(Index); + + // The allocated memory size is really the wavefront size * the frame + // index size. The widest register class is 64 bytes, so a 4-byte scratch + // allocation is enough to spill this in a single stack object. + // + // FIXME: Frame size/offsets are computed earlier than this, so the extra + // space is still unnecessarily allocated. + + unsigned Align = FrameInfo.getObjectAlignment(Index); + MachinePointerInfo PtrInfo + = MachinePointerInfo::getFixedStack(*MF, Index, EltSize * i); + MachineMemOperand *MMO + = MF->getMachineMemOperand(PtrInfo, MachineMemOperand::MOStore, + EltSize, MinAlign(Align, EltSize * i)); + + // SMEM instructions only support a single offset, so increment the wave + // offset. + + int64_t Offset = (ST.getWavefrontSize() * FrOffset) + (EltSize * i); + if (Offset != 0) { + BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_ADD_U32), OffsetReg) + .addReg(MFI->getScratchWaveOffsetReg()) + .addImm(Offset); + } else { + BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_MOV_B32), OffsetReg) + .addReg(MFI->getScratchWaveOffsetReg()); + } + + BuildMI(*MBB, MI, DL, TII->get(ScalarStoreOp)) + .addReg(SubReg, getKillRegState(IsKill)) // sdata + .addReg(MFI->getScratchRSrcReg()) // sbase + .addReg(OffsetReg, RegState::Kill) // soff + .addImm(0) // glc + .addMemOperand(MMO); + + continue; + } + + struct SIMachineFunctionInfo::SpilledReg Spill = + MFI->getSpilledReg(MF, Index, i); + if (Spill.hasReg()) { + BuildMI(*MBB, MI, DL, + TII->getMCOpcodeFromPseudo(AMDGPU::V_WRITELANE_B32), + Spill.VGPR) + .addReg(SubReg, getKillRegState(IsKill)) + .addImm(Spill.Lane); + + // FIXME: Since this spills to another register instead of an actual + // frame index, we should delete the frame index when all references to + // it are fixed. + } else { + // Spill SGPR to a frame index. + // TODO: Should VI try to spill to VGPR and then spill to SMEM? + unsigned TmpReg = MRI.createVirtualRegister(&AMDGPU::VGPR_32RegClass); + // TODO: Should VI try to spill to VGPR and then spill to SMEM? + + MachineInstrBuilder Mov + = BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_MOV_B32_e32), TmpReg) + .addReg(SubReg, SubKillState); + + + // There could be undef components of a spilled super register. + // TODO: Can we detect this and skip the spill? + if (NumSubRegs > 1) { + // The last implicit use of the SuperReg carries the "Kill" flag. + unsigned SuperKillState = 0; + if (i + 1 == e) + SuperKillState |= getKillRegState(IsKill); + Mov.addReg(SuperReg, RegState::Implicit | SuperKillState); + } + + unsigned Align = FrameInfo.getObjectAlignment(Index); + MachinePointerInfo PtrInfo + = MachinePointerInfo::getFixedStack(*MF, Index, EltSize * i); + MachineMemOperand *MMO + = MF->getMachineMemOperand(PtrInfo, MachineMemOperand::MOStore, + EltSize, MinAlign(Align, EltSize * i)); + BuildMI(*MBB, MI, DL, TII->get(AMDGPU::SI_SPILL_V32_SAVE)) + .addReg(TmpReg, RegState::Kill) // src + .addFrameIndex(Index) // vaddr + .addReg(MFI->getScratchRSrcReg()) // srrsrc + .addReg(MFI->getScratchWaveOffsetReg()) // soffset + .addImm(i * 4) // offset + .addMemOperand(MMO); + } + } + + if (M0CopyReg != AMDGPU::NoRegister) { + BuildMI(*MBB, MI, DL, TII->get(AMDGPU::COPY), AMDGPU::M0) + .addReg(M0CopyReg, RegState::Kill); + } + + MI->eraseFromParent(); + MFI->addToSpilledSGPRs(NumSubRegs); +} + +void SIRegisterInfo::restoreSGPR(MachineBasicBlock::iterator MI, + int Index, + RegScavenger *RS) const { + MachineFunction *MF = MI->getParent()->getParent(); + MachineRegisterInfo &MRI = MF->getRegInfo(); + MachineBasicBlock *MBB = MI->getParent(); + SIMachineFunctionInfo *MFI = MF->getInfo<SIMachineFunctionInfo>(); + MachineFrameInfo &FrameInfo = MF->getFrameInfo(); + const SISubtarget &ST = MF->getSubtarget<SISubtarget>(); + const SIInstrInfo *TII = ST.getInstrInfo(); + const DebugLoc &DL = MI->getDebugLoc(); + + unsigned SuperReg = MI->getOperand(0).getReg(); + bool SpillToSMEM = ST.hasScalarStores() && EnableSpillSGPRToSMEM; + + assert(SuperReg != AMDGPU::M0 && "m0 should never spill"); + + unsigned OffsetReg = AMDGPU::M0; + unsigned M0CopyReg = AMDGPU::NoRegister; + + if (SpillToSMEM) { + if (RS->isRegUsed(AMDGPU::M0)) { + M0CopyReg = MRI.createVirtualRegister(&AMDGPU::SReg_32_XM0RegClass); + BuildMI(*MBB, MI, DL, TII->get(AMDGPU::COPY), M0CopyReg) + .addReg(AMDGPU::M0); + } + } + + unsigned EltSize = 4; + unsigned ScalarLoadOp; + + const TargetRegisterClass *RC = getPhysRegClass(SuperReg); + if (SpillToSMEM && isSGPRClass(RC)) { + // XXX - if private_element_size is larger than 4 it might be useful to be + // able to spill wider vmem spills. + std::tie(EltSize, ScalarLoadOp) = getSpillEltSize(RC->getSize(), false); + } + + ArrayRef<int16_t> SplitParts = getRegSplitParts(RC, EltSize); + unsigned NumSubRegs = SplitParts.empty() ? 1 : SplitParts.size(); + + // SubReg carries the "Kill" flag when SubReg == SuperReg. + int64_t FrOffset = FrameInfo.getObjectOffset(Index); + + for (unsigned i = 0, e = NumSubRegs; i < e; ++i) { + unsigned SubReg = NumSubRegs == 1 ? + SuperReg : getSubReg(SuperReg, SplitParts[i]); + + if (SpillToSMEM) { + // FIXME: Size may be > 4 but extra bytes wasted. + unsigned Align = FrameInfo.getObjectAlignment(Index); + MachinePointerInfo PtrInfo + = MachinePointerInfo::getFixedStack(*MF, Index, EltSize * i); + MachineMemOperand *MMO + = MF->getMachineMemOperand(PtrInfo, MachineMemOperand::MOLoad, + EltSize, MinAlign(Align, EltSize * i)); + + // Add i * 4 offset + int64_t Offset = (ST.getWavefrontSize() * FrOffset) + (EltSize * i); + if (Offset != 0) { + BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_ADD_U32), OffsetReg) + .addReg(MFI->getScratchWaveOffsetReg()) + .addImm(Offset); + } else { + BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_MOV_B32), OffsetReg) + .addReg(MFI->getScratchWaveOffsetReg()); + } + + auto MIB = + BuildMI(*MBB, MI, DL, TII->get(ScalarLoadOp), SubReg) + .addReg(MFI->getScratchRSrcReg()) // sbase + .addReg(OffsetReg, RegState::Kill) // soff + .addImm(0) // glc + .addMemOperand(MMO); + + if (NumSubRegs > 1) + MIB.addReg(SuperReg, RegState::ImplicitDefine); + + continue; + } + + SIMachineFunctionInfo::SpilledReg Spill + = MFI->getSpilledReg(MF, Index, i); + + if (Spill.hasReg()) { + auto MIB = + BuildMI(*MBB, MI, DL, TII->getMCOpcodeFromPseudo(AMDGPU::V_READLANE_B32), + SubReg) + .addReg(Spill.VGPR) + .addImm(Spill.Lane); + + if (NumSubRegs > 1) + MIB.addReg(SuperReg, RegState::ImplicitDefine); + } else { + // Restore SGPR from a stack slot. + // FIXME: We should use S_LOAD_DWORD here for VI. + unsigned TmpReg = MRI.createVirtualRegister(&AMDGPU::VGPR_32RegClass); + unsigned Align = FrameInfo.getObjectAlignment(Index); + + MachinePointerInfo PtrInfo + = MachinePointerInfo::getFixedStack(*MF, Index, EltSize * i); + + MachineMemOperand *MMO = MF->getMachineMemOperand(PtrInfo, + MachineMemOperand::MOLoad, EltSize, + MinAlign(Align, EltSize * i)); + + BuildMI(*MBB, MI, DL, TII->get(AMDGPU::SI_SPILL_V32_RESTORE), TmpReg) + .addFrameIndex(Index) // vaddr + .addReg(MFI->getScratchRSrcReg()) // srsrc + .addReg(MFI->getScratchWaveOffsetReg()) // soffset + .addImm(i * 4) // offset + .addMemOperand(MMO); + + auto MIB = + BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_READFIRSTLANE_B32), SubReg) + .addReg(TmpReg, RegState::Kill); + + if (NumSubRegs > 1) + MIB.addReg(MI->getOperand(0).getReg(), RegState::ImplicitDefine); + } + } + + if (M0CopyReg != AMDGPU::NoRegister) { + BuildMI(*MBB, MI, DL, TII->get(AMDGPU::COPY), AMDGPU::M0) + .addReg(M0CopyReg, RegState::Kill); + } + + MI->eraseFromParent(); } void SIRegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator MI, @@ -499,7 +829,7 @@ void SIRegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator MI, MachineRegisterInfo &MRI = MF->getRegInfo(); MachineBasicBlock *MBB = MI->getParent(); SIMachineFunctionInfo *MFI = MF->getInfo<SIMachineFunctionInfo>(); - MachineFrameInfo *FrameInfo = MF->getFrameInfo(); + MachineFrameInfo &FrameInfo = MF->getFrameInfo(); const SISubtarget &ST = MF->getSubtarget<SISubtarget>(); const SIInstrInfo *TII = ST.getInstrInfo(); DebugLoc DL = MI->getDebugLoc(); @@ -514,66 +844,7 @@ void SIRegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator MI, case AMDGPU::SI_SPILL_S128_SAVE: case AMDGPU::SI_SPILL_S64_SAVE: case AMDGPU::SI_SPILL_S32_SAVE: { - unsigned NumSubRegs = getNumSubRegsForSpillOp(MI->getOpcode()); - unsigned TmpReg = MRI.createVirtualRegister(&AMDGPU::VGPR_32RegClass); - - unsigned SuperReg = MI->getOperand(0).getReg(); - bool IsKill = MI->getOperand(0).isKill(); - // SubReg carries the "Kill" flag when SubReg == SuperReg. - unsigned SubKillState = getKillRegState((NumSubRegs == 1) && IsKill); - for (unsigned i = 0, e = NumSubRegs; i < e; ++i) { - unsigned SubReg = getPhysRegSubReg(SuperReg, - &AMDGPU::SGPR_32RegClass, i); - - struct SIMachineFunctionInfo::SpilledReg Spill = - MFI->getSpilledReg(MF, Index, i); - - if (Spill.hasReg()) { - BuildMI(*MBB, MI, DL, - TII->getMCOpcodeFromPseudo(AMDGPU::V_WRITELANE_B32), - Spill.VGPR) - .addReg(SubReg, getKillRegState(IsKill)) - .addImm(Spill.Lane); - - // FIXME: Since this spills to another register instead of an actual - // frame index, we should delete the frame index when all references to - // it are fixed. - } else { - // Spill SGPR to a frame index. - // FIXME we should use S_STORE_DWORD here for VI. - MachineInstrBuilder Mov - = BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_MOV_B32_e32), TmpReg) - .addReg(SubReg, SubKillState); - - - // There could be undef components of a spilled super register. - // TODO: Can we detect this and skip the spill? - if (NumSubRegs > 1) { - // The last implicit use of the SuperReg carries the "Kill" flag. - unsigned SuperKillState = 0; - if (i + 1 == e) - SuperKillState |= getKillRegState(IsKill); - Mov.addReg(SuperReg, RegState::Implicit | SuperKillState); - } - - unsigned Size = FrameInfo->getObjectSize(Index); - unsigned Align = FrameInfo->getObjectAlignment(Index); - MachinePointerInfo PtrInfo - = MachinePointerInfo::getFixedStack(*MF, Index); - MachineMemOperand *MMO - = MF->getMachineMemOperand(PtrInfo, MachineMemOperand::MOStore, - Size, Align); - BuildMI(*MBB, MI, DL, TII->get(AMDGPU::SI_SPILL_V32_SAVE)) - .addReg(TmpReg, RegState::Kill) // src - .addFrameIndex(Index) // frame_idx - .addReg(MFI->getScratchRSrcReg()) // scratch_rsrc - .addReg(MFI->getScratchWaveOffsetReg()) // scratch_offset - .addImm(i * 4) // offset - .addMemOperand(MMO); - } - } - MI->eraseFromParent(); - MFI->addToSpilledSGPRs(NumSubRegs); + spillSGPR(MI, Index, RS); break; } @@ -583,49 +854,7 @@ void SIRegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator MI, case AMDGPU::SI_SPILL_S128_RESTORE: case AMDGPU::SI_SPILL_S64_RESTORE: case AMDGPU::SI_SPILL_S32_RESTORE: { - unsigned NumSubRegs = getNumSubRegsForSpillOp(MI->getOpcode()); - unsigned TmpReg = MRI.createVirtualRegister(&AMDGPU::VGPR_32RegClass); - - for (unsigned i = 0, e = NumSubRegs; i < e; ++i) { - unsigned SubReg = getPhysRegSubReg(MI->getOperand(0).getReg(), - &AMDGPU::SGPR_32RegClass, i); - struct SIMachineFunctionInfo::SpilledReg Spill = - MFI->getSpilledReg(MF, Index, i); - - if (Spill.hasReg()) { - BuildMI(*MBB, MI, DL, - TII->getMCOpcodeFromPseudo(AMDGPU::V_READLANE_B32), - SubReg) - .addReg(Spill.VGPR) - .addImm(Spill.Lane) - .addReg(MI->getOperand(0).getReg(), RegState::ImplicitDefine); - } else { - // Restore SGPR from a stack slot. - // FIXME: We should use S_LOAD_DWORD here for VI. - - unsigned Align = FrameInfo->getObjectAlignment(Index); - unsigned Size = FrameInfo->getObjectSize(Index); - - MachinePointerInfo PtrInfo - = MachinePointerInfo::getFixedStack(*MF, Index); - - MachineMemOperand *MMO = MF->getMachineMemOperand( - PtrInfo, MachineMemOperand::MOLoad, Size, Align); - - BuildMI(*MBB, MI, DL, TII->get(AMDGPU::SI_SPILL_V32_RESTORE), TmpReg) - .addFrameIndex(Index) // frame_idx - .addReg(MFI->getScratchRSrcReg()) // scratch_rsrc - .addReg(MFI->getScratchWaveOffsetReg()) // scratch_offset - .addImm(i * 4) // offset - .addMemOperand(MMO); - BuildMI(*MBB, MI, DL, - TII->get(AMDGPU::V_READFIRSTLANE_B32), SubReg) - .addReg(TmpReg, RegState::Kill) - .addReg(MI->getOperand(0).getReg(), RegState::ImplicitDefine); - } - } - - MI->eraseFromParent(); + restoreSGPR(MI, Index, RS); break; } @@ -635,34 +864,62 @@ void SIRegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator MI, case AMDGPU::SI_SPILL_V128_SAVE: case AMDGPU::SI_SPILL_V96_SAVE: case AMDGPU::SI_SPILL_V64_SAVE: - case AMDGPU::SI_SPILL_V32_SAVE: - buildScratchLoadStore(MI, AMDGPU::BUFFER_STORE_DWORD_OFFSET, - TII->getNamedOperand(*MI, AMDGPU::OpName::src), - TII->getNamedOperand(*MI, AMDGPU::OpName::scratch_rsrc)->getReg(), - TII->getNamedOperand(*MI, AMDGPU::OpName::scratch_offset)->getReg(), - FrameInfo->getObjectOffset(Index) + - TII->getNamedOperand(*MI, AMDGPU::OpName::offset)->getImm(), RS); - MI->eraseFromParent(); + case AMDGPU::SI_SPILL_V32_SAVE: { + const MachineOperand *VData = TII->getNamedOperand(*MI, + AMDGPU::OpName::vdata); + buildSpillLoadStore(MI, AMDGPU::BUFFER_STORE_DWORD_OFFSET, + Index, + VData->getReg(), VData->isKill(), + TII->getNamedOperand(*MI, AMDGPU::OpName::srsrc)->getReg(), + TII->getNamedOperand(*MI, AMDGPU::OpName::soffset)->getReg(), + TII->getNamedOperand(*MI, AMDGPU::OpName::offset)->getImm(), + *MI->memoperands_begin(), + RS); MFI->addToSpilledVGPRs(getNumSubRegsForSpillOp(MI->getOpcode())); + MI->eraseFromParent(); break; + } case AMDGPU::SI_SPILL_V32_RESTORE: case AMDGPU::SI_SPILL_V64_RESTORE: case AMDGPU::SI_SPILL_V96_RESTORE: case AMDGPU::SI_SPILL_V128_RESTORE: case AMDGPU::SI_SPILL_V256_RESTORE: case AMDGPU::SI_SPILL_V512_RESTORE: { - buildScratchLoadStore(MI, AMDGPU::BUFFER_LOAD_DWORD_OFFSET, - TII->getNamedOperand(*MI, AMDGPU::OpName::dst), - TII->getNamedOperand(*MI, AMDGPU::OpName::scratch_rsrc)->getReg(), - TII->getNamedOperand(*MI, AMDGPU::OpName::scratch_offset)->getReg(), - FrameInfo->getObjectOffset(Index) + - TII->getNamedOperand(*MI, AMDGPU::OpName::offset)->getImm(), RS); + const MachineOperand *VData = TII->getNamedOperand(*MI, + AMDGPU::OpName::vdata); + + buildSpillLoadStore(MI, AMDGPU::BUFFER_LOAD_DWORD_OFFSET, + Index, + VData->getReg(), VData->isKill(), + TII->getNamedOperand(*MI, AMDGPU::OpName::srsrc)->getReg(), + TII->getNamedOperand(*MI, AMDGPU::OpName::soffset)->getReg(), + TII->getNamedOperand(*MI, AMDGPU::OpName::offset)->getImm(), + *MI->memoperands_begin(), + RS); MI->eraseFromParent(); break; } default: { - int64_t Offset = FrameInfo->getObjectOffset(Index); + if (TII->isMUBUF(*MI)) { + // Disable offen so we don't need a 0 vgpr base. + assert(static_cast<int>(FIOperandNum) == + AMDGPU::getNamedOperandIdx(MI->getOpcode(), + AMDGPU::OpName::vaddr)); + + int64_t Offset = FrameInfo.getObjectOffset(Index); + int64_t OldImm + = TII->getNamedOperand(*MI, AMDGPU::OpName::offset)->getImm(); + int64_t NewOffset = OldImm + Offset; + + if (isUInt<12>(NewOffset) && + buildMUBUFOffsetLoadStore(TII, FrameInfo, MI, Index, NewOffset)) { + MI->eraseFromParent(); + break; + } + } + + int64_t Offset = FrameInfo.getObjectOffset(Index); FIOp.ChangeToImmediate(Offset); if (!TII->isImmOperandLegal(*MI, FIOperandNum, FIOp)) { unsigned TmpReg = MRI.createVirtualRegister(&AMDGPU::VGPR_32RegClass); @@ -770,7 +1027,8 @@ const TargetRegisterClass *SIRegisterInfo::getSubRegClass( return RC; // We can assume that each lane corresponds to one 32-bit register. - unsigned Count = countPopulation(getSubRegIndexLaneMask(SubIdx)); + LaneBitmask::Type Mask = getSubRegIndexLaneMask(SubIdx).getAsInteger(); + unsigned Count = countPopulation(Mask); if (isSGPRClass(RC)) { switch (Count) { case 1: @@ -812,7 +1070,7 @@ bool SIRegisterInfo::shouldRewriteCopySrc( // We want to prefer the smallest register class possible, so we don't want to // stop and rewrite on anything that looks like a subregister // extract. Operations mostly don't care about the super register class, so we - // only want to stop on the most basic of copies between the smae register + // only want to stop on the most basic of copies between the same register // class. // // e.g. if we have something like @@ -828,80 +1086,6 @@ bool SIRegisterInfo::shouldRewriteCopySrc( return getCommonSubClass(DefRC, SrcRC) != nullptr; } -unsigned SIRegisterInfo::getPhysRegSubReg(unsigned Reg, - const TargetRegisterClass *SubRC, - unsigned Channel) const { - - switch (Reg) { - case AMDGPU::VCC: - switch(Channel) { - case 0: return AMDGPU::VCC_LO; - case 1: return AMDGPU::VCC_HI; - default: llvm_unreachable("Invalid SubIdx for VCC"); break; - } - - case AMDGPU::TBA: - switch(Channel) { - case 0: return AMDGPU::TBA_LO; - case 1: return AMDGPU::TBA_HI; - default: llvm_unreachable("Invalid SubIdx for TBA"); break; - } - - case AMDGPU::TMA: - switch(Channel) { - case 0: return AMDGPU::TMA_LO; - case 1: return AMDGPU::TMA_HI; - default: llvm_unreachable("Invalid SubIdx for TMA"); break; - } - - case AMDGPU::FLAT_SCR: - switch (Channel) { - case 0: - return AMDGPU::FLAT_SCR_LO; - case 1: - return AMDGPU::FLAT_SCR_HI; - default: - llvm_unreachable("Invalid SubIdx for FLAT_SCR"); - } - break; - - case AMDGPU::EXEC: - switch (Channel) { - case 0: - return AMDGPU::EXEC_LO; - case 1: - return AMDGPU::EXEC_HI; - default: - llvm_unreachable("Invalid SubIdx for EXEC"); - } - break; - } - - const TargetRegisterClass *RC = getPhysRegClass(Reg); - // 32-bit registers don't have sub-registers, so we can just return the - // Reg. We need to have this check here, because the calculation below - // using getHWRegIndex() will fail with special 32-bit registers like - // VCC_LO, VCC_HI, EXEC_LO, EXEC_HI and M0. - if (RC->getSize() == 4) { - assert(Channel == 0); - return Reg; - } - - unsigned Index = getHWRegIndex(Reg); - return SubRC->getRegister(Index + Channel); -} - -bool SIRegisterInfo::opCanUseLiteralConstant(unsigned OpType) const { - return OpType == AMDGPU::OPERAND_REG_IMM32; -} - -bool SIRegisterInfo::opCanUseInlineConstant(unsigned OpType) const { - if (opCanUseLiteralConstant(OpType)) - return true; - - return OpType == AMDGPU::OPERAND_REG_INLINE_C; -} - // FIXME: Most of these are flexible with HSA and we don't need to reserve them // as input registers if unused. Whether the dispatch ptr is necessary should be // easy to detect from used intrinsics. Scratch setup is harder to know. @@ -924,14 +1108,18 @@ unsigned SIRegisterInfo::getPreloadedValue(const MachineFunction &MF, case SIRegisterInfo::PRIVATE_SEGMENT_WAVE_BYTE_OFFSET: return MFI->PrivateSegmentWaveByteOffsetSystemSGPR; case SIRegisterInfo::PRIVATE_SEGMENT_BUFFER: - assert(ST.isAmdHsaOS() && "Non-HSA ABI currently uses relocations"); - assert(MFI->hasPrivateSegmentBuffer()); - return MFI->PrivateSegmentBufferUserSGPR; + if (ST.isAmdCodeObjectV2(MF)) { + assert(MFI->hasPrivateSegmentBuffer()); + return MFI->PrivateSegmentBufferUserSGPR; + } + assert(MFI->hasPrivateMemoryInputPtr()); + return MFI->PrivateMemoryPtrUserSGPR; case SIRegisterInfo::KERNARG_SEGMENT_PTR: assert(MFI->hasKernargSegmentPtr()); return MFI->KernargSegmentPtrUserSGPR; case SIRegisterInfo::DISPATCH_ID: - llvm_unreachable("unimplemented"); + assert(MFI->hasDispatchID()); + return MFI->DispatchIDUserSGPR; case SIRegisterInfo::FLAT_SCRATCH_INIT: assert(MFI->hasFlatScratchInit()); return MFI->FlatScratchInitUserSGPR; @@ -968,50 +1156,323 @@ SIRegisterInfo::findUnusedRegister(const MachineRegisterInfo &MRI, return AMDGPU::NoRegister; } -unsigned SIRegisterInfo::getNumVGPRsAllowed(unsigned WaveCount) const { - switch(WaveCount) { - case 10: return 24; - case 9: return 28; - case 8: return 32; - case 7: return 36; - case 6: return 40; - case 5: return 48; - case 4: return 64; - case 3: return 84; - case 2: return 128; - default: return 256; +unsigned SIRegisterInfo::getTotalNumSGPRs(const SISubtarget &ST) const { + if (ST.getGeneration() >= AMDGPUSubtarget::VOLCANIC_ISLANDS) + return 800; + return 512; +} + +unsigned SIRegisterInfo::getNumAddressableSGPRs(const SISubtarget &ST) const { + if (ST.getGeneration() >= AMDGPUSubtarget::VOLCANIC_ISLANDS) + return 102; + return 104; +} + +unsigned SIRegisterInfo::getNumReservedSGPRs(const SISubtarget &ST, + const SIMachineFunctionInfo &MFI) const { + if (MFI.hasFlatScratchInit()) { + if (ST.getGeneration() >= AMDGPUSubtarget::VOLCANIC_ISLANDS) + return 6; // FLAT_SCRATCH, XNACK, VCC (in that order) + + if (ST.getGeneration() == AMDGPUSubtarget::SEA_ISLANDS) + return 4; // FLAT_SCRATCH, VCC (in that order) } + + if (ST.isXNACKEnabled()) + return 4; // XNACK, VCC (in that order) + + return 2; // VCC. } -unsigned SIRegisterInfo::getNumSGPRsAllowed(const SISubtarget &ST, - unsigned WaveCount) const { - if (ST.getGeneration() >= SISubtarget::VOLCANIC_ISLANDS) { - switch (WaveCount) { +unsigned SIRegisterInfo::getMinNumSGPRs(const SISubtarget &ST, + unsigned WavesPerEU) const { + if (ST.getGeneration() >= AMDGPUSubtarget::VOLCANIC_ISLANDS) { + switch (WavesPerEU) { + case 0: return 0; + case 10: return 0; + case 9: return 0; + case 8: return 81; + default: return 97; + } + } else { + switch (WavesPerEU) { + case 0: return 0; + case 10: return 0; + case 9: return 49; + case 8: return 57; + case 7: return 65; + case 6: return 73; + case 5: return 81; + default: return 97; + } + } +} + +unsigned SIRegisterInfo::getMaxNumSGPRs(const SISubtarget &ST, + unsigned WavesPerEU, + bool Addressable) const { + if (ST.getGeneration() >= AMDGPUSubtarget::VOLCANIC_ISLANDS) { + switch (WavesPerEU) { + case 0: return 80; case 10: return 80; case 9: return 80; case 8: return 96; - default: return 102; + default: return Addressable ? getNumAddressableSGPRs(ST) : 112; } } else { - switch(WaveCount) { + switch (WavesPerEU) { + case 0: return 48; case 10: return 48; case 9: return 56; case 8: return 64; case 7: return 72; case 6: return 80; case 5: return 96; - default: return 103; + default: return getNumAddressableSGPRs(ST); } } } -bool SIRegisterInfo::isVGPR(const MachineRegisterInfo &MRI, - unsigned Reg) const { - const TargetRegisterClass *RC; +unsigned SIRegisterInfo::getMaxNumSGPRs(const MachineFunction &MF) const { + const Function &F = *MF.getFunction(); + + const SISubtarget &ST = MF.getSubtarget<SISubtarget>(); + const SIMachineFunctionInfo &MFI = *MF.getInfo<SIMachineFunctionInfo>(); + + // Compute maximum number of SGPRs function can use using default/requested + // minimum number of waves per execution unit. + std::pair<unsigned, unsigned> WavesPerEU = MFI.getWavesPerEU(); + unsigned MaxNumSGPRs = getMaxNumSGPRs(ST, WavesPerEU.first, false); + unsigned MaxNumAddressableSGPRs = getMaxNumSGPRs(ST, WavesPerEU.first, true); + + // Check if maximum number of SGPRs was explicitly requested using + // "amdgpu-num-sgpr" attribute. + if (F.hasFnAttribute("amdgpu-num-sgpr")) { + unsigned Requested = AMDGPU::getIntegerAttribute( + F, "amdgpu-num-sgpr", MaxNumSGPRs); + + // Make sure requested value does not violate subtarget's specifications. + if (Requested && (Requested <= getNumReservedSGPRs(ST, MFI))) + Requested = 0; + + // If more SGPRs are required to support the input user/system SGPRs, + // increase to accommodate them. + // + // FIXME: This really ends up using the requested number of SGPRs + number + // of reserved special registers in total. Theoretically you could re-use + // the last input registers for these special registers, but this would + // require a lot of complexity to deal with the weird aliasing. + unsigned NumInputSGPRs = MFI.getNumPreloadedSGPRs(); + if (Requested && Requested < NumInputSGPRs) + Requested = NumInputSGPRs; + + // Make sure requested value is compatible with values implied by + // default/requested minimum/maximum number of waves per execution unit. + if (Requested && Requested > getMaxNumSGPRs(ST, WavesPerEU.first, false)) + Requested = 0; + if (WavesPerEU.second && + Requested && Requested < getMinNumSGPRs(ST, WavesPerEU.second)) + Requested = 0; + + if (Requested) + MaxNumSGPRs = Requested; + } + + if (ST.hasSGPRInitBug()) + MaxNumSGPRs = SISubtarget::FIXED_SGPR_COUNT_FOR_INIT_BUG; + + return std::min(MaxNumSGPRs - getNumReservedSGPRs(ST, MFI), + MaxNumAddressableSGPRs); +} + +unsigned SIRegisterInfo::getNumDebuggerReservedVGPRs( + const SISubtarget &ST) const { + if (ST.debuggerReserveRegs()) + return 4; + return 0; +} + +unsigned SIRegisterInfo::getMinNumVGPRs(unsigned WavesPerEU) const { + switch (WavesPerEU) { + case 0: return 0; + case 10: return 0; + case 9: return 25; + case 8: return 29; + case 7: return 33; + case 6: return 37; + case 5: return 41; + case 4: return 49; + case 3: return 65; + case 2: return 85; + default: return 129; + } +} + +unsigned SIRegisterInfo::getMaxNumVGPRs(unsigned WavesPerEU) const { + switch (WavesPerEU) { + case 0: return 24; + case 10: return 24; + case 9: return 28; + case 8: return 32; + case 7: return 36; + case 6: return 40; + case 5: return 48; + case 4: return 64; + case 3: return 84; + case 2: return 128; + default: return getTotalNumVGPRs(); + } +} + +unsigned SIRegisterInfo::getMaxNumVGPRs(const MachineFunction &MF) const { + const Function &F = *MF.getFunction(); + + const SISubtarget &ST = MF.getSubtarget<SISubtarget>(); + const SIMachineFunctionInfo &MFI = *MF.getInfo<SIMachineFunctionInfo>(); + + // Compute maximum number of VGPRs function can use using default/requested + // minimum number of waves per execution unit. + std::pair<unsigned, unsigned> WavesPerEU = MFI.getWavesPerEU(); + unsigned MaxNumVGPRs = getMaxNumVGPRs(WavesPerEU.first); + + // Check if maximum number of VGPRs was explicitly requested using + // "amdgpu-num-vgpr" attribute. + if (F.hasFnAttribute("amdgpu-num-vgpr")) { + unsigned Requested = AMDGPU::getIntegerAttribute( + F, "amdgpu-num-vgpr", MaxNumVGPRs); + + // Make sure requested value does not violate subtarget's specifications. + if (Requested && Requested <= getNumDebuggerReservedVGPRs(ST)) + Requested = 0; + + // Make sure requested value is compatible with values implied by + // default/requested minimum/maximum number of waves per execution unit. + if (Requested && Requested > getMaxNumVGPRs(WavesPerEU.first)) + Requested = 0; + if (WavesPerEU.second && + Requested && Requested < getMinNumVGPRs(WavesPerEU.second)) + Requested = 0; + + if (Requested) + MaxNumVGPRs = Requested; + } + + return MaxNumVGPRs - getNumDebuggerReservedVGPRs(ST); +} + +ArrayRef<int16_t> SIRegisterInfo::getRegSplitParts(const TargetRegisterClass *RC, + unsigned EltSize) const { + if (EltSize == 4) { + static const int16_t Sub0_15[] = { + AMDGPU::sub0, AMDGPU::sub1, AMDGPU::sub2, AMDGPU::sub3, + AMDGPU::sub4, AMDGPU::sub5, AMDGPU::sub6, AMDGPU::sub7, + AMDGPU::sub8, AMDGPU::sub9, AMDGPU::sub10, AMDGPU::sub11, + AMDGPU::sub12, AMDGPU::sub13, AMDGPU::sub14, AMDGPU::sub15, + }; + + static const int16_t Sub0_7[] = { + AMDGPU::sub0, AMDGPU::sub1, AMDGPU::sub2, AMDGPU::sub3, + AMDGPU::sub4, AMDGPU::sub5, AMDGPU::sub6, AMDGPU::sub7, + }; + + static const int16_t Sub0_3[] = { + AMDGPU::sub0, AMDGPU::sub1, AMDGPU::sub2, AMDGPU::sub3, + }; + + static const int16_t Sub0_2[] = { + AMDGPU::sub0, AMDGPU::sub1, AMDGPU::sub2, + }; + + static const int16_t Sub0_1[] = { + AMDGPU::sub0, AMDGPU::sub1, + }; + + switch (AMDGPU::getRegBitWidth(*RC->MC)) { + case 32: + return {}; + case 64: + return makeArrayRef(Sub0_1); + case 96: + return makeArrayRef(Sub0_2); + case 128: + return makeArrayRef(Sub0_3); + case 256: + return makeArrayRef(Sub0_7); + case 512: + return makeArrayRef(Sub0_15); + default: + llvm_unreachable("unhandled register size"); + } + } + + if (EltSize == 8) { + static const int16_t Sub0_15_64[] = { + AMDGPU::sub0_sub1, AMDGPU::sub2_sub3, + AMDGPU::sub4_sub5, AMDGPU::sub6_sub7, + AMDGPU::sub8_sub9, AMDGPU::sub10_sub11, + AMDGPU::sub12_sub13, AMDGPU::sub14_sub15 + }; + + static const int16_t Sub0_7_64[] = { + AMDGPU::sub0_sub1, AMDGPU::sub2_sub3, + AMDGPU::sub4_sub5, AMDGPU::sub6_sub7 + }; + + + static const int16_t Sub0_3_64[] = { + AMDGPU::sub0_sub1, AMDGPU::sub2_sub3 + }; + + switch (AMDGPU::getRegBitWidth(*RC->MC)) { + case 64: + return {}; + case 128: + return makeArrayRef(Sub0_3_64); + case 256: + return makeArrayRef(Sub0_7_64); + case 512: + return makeArrayRef(Sub0_15_64); + default: + llvm_unreachable("unhandled register size"); + } + } + + assert(EltSize == 16 && "unhandled register spill split size"); + + static const int16_t Sub0_15_128[] = { + AMDGPU::sub0_sub1_sub2_sub3, + AMDGPU::sub4_sub5_sub6_sub7, + AMDGPU::sub8_sub9_sub10_sub11, + AMDGPU::sub12_sub13_sub14_sub15 + }; + + static const int16_t Sub0_7_128[] = { + AMDGPU::sub0_sub1_sub2_sub3, + AMDGPU::sub4_sub5_sub6_sub7 + }; + + switch (AMDGPU::getRegBitWidth(*RC->MC)) { + case 128: + return {}; + case 256: + return makeArrayRef(Sub0_7_128); + case 512: + return makeArrayRef(Sub0_15_128); + default: + llvm_unreachable("unhandled register size"); + } +} + +const TargetRegisterClass* +SIRegisterInfo::getRegClassForReg(const MachineRegisterInfo &MRI, + unsigned Reg) const { if (TargetRegisterInfo::isVirtualRegister(Reg)) - RC = MRI.getRegClass(Reg); - else - RC = getPhysRegClass(Reg); + return MRI.getRegClass(Reg); - return hasVGPRs(RC); + return getPhysRegClass(Reg); +} + +bool SIRegisterInfo::isVGPR(const MachineRegisterInfo &MRI, + unsigned Reg) const { + return hasVGPRs(getRegClassForReg(MRI, Reg)); } |