| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
|
|
|
|
|
|
|
|
|
|
| |
The default is to limit them to what the hardware is capable of.
Add sysctl twiddles for both the non-RTS and RTS protected aggregate
generation.
Whilst here, add some comments about stuff that I've discovered during
my exploration of the TX aggregate / delimiter setup path from the
reference driver.
|
|
|
|
|
|
|
|
|
|
|
|
| |
adjustment code to now run.
Tested:
* AR5416, STA
TODO:
* Much more thorough testing on the other chips, AR5210 -> AR9287
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
| |
part of ts_status. Thus:
* make sure we decode them from ts_flags, rather than ts_status;
* make sure we decode them regardless of whether there's an error or not.
This correctly exposes descriptor configuration errors, TX delimiter
underruns and TX data underruns.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
My changed had some rather significant behavioural changes to throughput.
The two issues I noticed:
* With if_start and the ifnet mbuf queue, any temporary latency
would get eaten up by some mbufs being queued. With ath_transmit()
queuing things to ath_buf's, I'd only get 512 TX buffers before I
couldn't queue any further frames.
* There's also some non-zero latency involved with TX being pushed
into a taskqueue via direct dispatch. Any time the scheduler didn't
immediately schedule the ath TX task would cause extra latency.
Various 1ge/10ge drivers implement both direct dispatch (if the TX
lock can be acquired) and deferred task transmission (if the TX lock
can't be acquired), with frames being pushed into a drbd queue.
I'll have to do this at some point, but until I figure out how to
deal with 802.11 fragments, I'll have to wait a while longer.
So what I saw:
* lots of extra latency, specially under load - if the taskqueue
wasn't immediately scheduled, things went pear shaped;
* any extra latency would result in TX ath_buf's taking their sweet time
being replenished, so any further calls to ath_transmit() would drop
mbufs.
* .. yes, there's no explicit backpressure here - things are just dropped.
Eek.
With this, the general performance has gone up, but those subtle if_start()
related race conditions are back. For some reason, this is doubly-obvious
with the AR5416 NIC and I don't quite understand why yet.
There's an unrelated issue with AR5416 performance in STA mode (it's
fine in AP mode when bridging frames, weirdly..) that requires a little
further investigation. Specifically - it works fine on a Lenovo T40
(single core CPU) running a March 2012 9-STABLE kernel, but a Lenovo T60
(dual core) running an early November 2012 kernel behaves very poorly.
The same hardware with an AR9160 or AR9280 behaves perfectly.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
when they're being called from the TX completion handler.
Going (back) through the taskqueue is just adding extra locking and
latency to packet operations. This improves performance a little bit
on most NICs.
It still hasn't restored the original performance of the AR5416 NIC
but the AR9160, AR9280 and later NICs behave very well with this.
Tested:
* AR5416 STA (still tops out at ~ 70mbit TCP, rather than 150mbit TCP..)
* AR9160 hostap (good for both TX and RX)
* AR9280 hostap (good for both TX and RX)
|
|
|
|
|
| |
This now separates out the act of queuing frames from the act of running
TX and TX completion.
|
|
|
|
| |
Move it (for now) to the TX taskqueue.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
the separate ath0 TX taskq.
Whilst here, make sure that the TX software scheduler is also
running out of the TX task, rather than the ath0 taskqueue.
Make sure that the tx taskqueue is blocked/unblocked as necessary.
This allows for a little more parallelism on multi-core machines,
as well as (eventually) supporting a higher task priority for TX
tasks, allowing said TX task to preempt an already running RX or
TX completion task.
Tested:
* AR5416, AR9280 hostap and STA modes
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
chip hangs.
* Always do a reset in ath_bmiss_proc(), regardless of whether the
hardware is "hung" or not. Specifically, for spectral scan, there's
likely a whole bunch of potential hangs that we don't (yet) recognise
in the HAL. So to avoid staying RX deaf persisting until the station
disassociates, just do a no-loss reset.
* Set sc_beacons=1 in STA mode. During a reset, the beacon programming
isn't done. (It's likely I need to set sc_syncbeacons during a hang
reset, but I digress.) Thus after a reset, there's no beacon timer
programming to send a BMISS interrupt if beacons aren't heard ..
thus if the AP disappears, you won't get notified and you'll have to
reset your interface.
This hasn't yet fixed all of the hangs that I've seen when debugging
spectral scan, but it's certainly reduced the hang frequency and it
should improve general STA stability in very noisy environments.
Tested:
* AR9280, STA mode, spectral scan off/on
PR: kern/175227
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
if_start().
This removes the overlapping data path TX from occuring, which
solves quite a number of the potential TX queue races in ath(4).
It doesn't fix the net80211 layer TX queue races and it doesn't
fix the raw TX path yet, but it's an important step towards this.
This hasn't dropped the TX performance in my testing; primarily
because now the TX path can quickly queue frames and continue
along processing.
This involves a few rather deep changes:
* Use the ath_buf as a queue placeholder for now, as we need to be
able to support queuing a list of mbufs (ie, when transmitting
fragments) and m_nextpkt can't be used here (because it's what is
joining the fragments together)
* if_transmit() now simply allocates the ath_buf and queues it to
a driver TX staging queue.
* TX is now moved into a taskqueue function.
* The TX taskqueue function now dequeues and transmits frames.
* Fragments are handled correctly here - as the current API passes
the fragment list as one mbuf list (joined with m_nextpkt) through
to the driver if_transmit().
* For the couple of places where ath_start() may be called (mostly
from net80211 when starting the VAP up again), just reimplement
it using the new enqueue and taskqueue methods.
What I don't like (about this work and the TX code in general):
* I'm using the same lock for the staging TX queue management and the
actual TX. This isn't required; I'm just being slack.
* I haven't yet moved TX to a separate taskqueue (but the taskqueue is
created); it's easy enough to do this later if necessary. I just need
to make sure it's a higher priority queue, so TX has the same
behaviour as it used to (where it would preempt existing RX..)
* I need to re-review the TX path a little more and make sure that
ieee80211_node_*() functions aren't called within the TX lock.
When queueing, I should just push failed frames into a queue and
when I'm wrapping up the TX code, unlock the TX lock and
call ieee80211_node_free() on each.
* It would be nice if I could hold the TX lock for the entire
TX and TX completion, rather than this release/re-acquire behaviour.
But that requires that I shuffle around the TX completion code
to handle actual ath_buf free and net80211 callback/free outside
of the TX lock. That's one of my next projects.
* the ic_raw_xmit() path doesn't use this yet - so it still has
sequencing problems with parallel, overlapping calls to the
data path. I'll fix this later.
Tested:
* Hostap - AR9280, AR9220
* STA - AR5212, AR9280, AR5416
|
| |
|
|
|
|
| |
Submitted by: Christoph Mallon <christoph.mallon@gmx.de>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
I couldn't think of a way to maintain the hardware TXQ locks _and_ layer
on top of that per-TXQ software queuing and any other kind of fine-grained
locks (eg per-TID, or per-node locks.)
So for now, to facilitate some further code refactoring and development
as part of the final push to get software queue ps-poll and u-apsd handling
into this driver, just do away with them entirely.
I may eventually bring them back at some point, when it looks slightly more
architectually cleaner to do so. But as it stands at the present, it's
not really buying us much:
* in order to properly serialise things and not get bitten by scheduling
and locking interactions with things higher up in the stack, we need to
wrap the whole TX path in a long held lock. Otherwise we can end up
being pre-empted during frame handling, resulting in some out of order
frame handling between sequence number allocation and encryption handling
(ie, the seqno and the CCMP IV get out of sequence);
* .. so whilst that's the case, holding the lock for that long means that
we're acquiring and releasing the TXQ lock _inside_ that context;
* And we also acquire it per-frame during frame completion, but we currently
can't hold the lock for the duration of the TX completion as we need
to call net80211 layer things with the locks _unheld_ to avoid LOR.
* .. the other places were grab that lock are reset/flush, which don't happen
often.
My eventual aim is to change the TX path so all rejected frame transmissions
and all frame completions result in any ieee80211_free_node() calls to occur
outside of the TX lock; then I can cut back on the amount of locking that
goes on here.
There may be some LORs that occur when ieee80211_free_node() is called when
the TX queue path fails; I'll begin to address these in follow-up commits.
|
|
|
|
|
|
|
| |
isn't NULL.
If the attach fails prematurely and there's no if_vnet context, calling
CURVNET_SET(ifp->if_vnet) is going to dereference a NULL pointer.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* upon setup, tell the alq code what the chip information is.
* add TX/RX path logging for legacy chips.
* populate the tx/rx descriptor length fields with a best-estimate.
It's overly big (96 bytes when AH_SUPPORT_AR5416 is enabled)
but it'll do for now.
Whilst I'm here, add CURVNET_RESTORE() here during probe/attach as a
partial solution to fixing crashes during attach when the attach fails.
There are other attach failures that I have to deal with; those'll come
later.
|
| |
|
|
|
|
|
|
| |
is defined.
This will unbreak ATH_DEBUG builds.
|
|
|
|
|
|
|
| |
ps-poll is totally broken in its current form.
This should unbreak things enough to let people use PS-POLL devices,
but leave it in place for me to finish PS-POLL handling.
|
|
|
|
| |
AR9300 HAL.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
I've tried serialising TX using queues and such but unfortunately
due to how this interacts with the locking going on elsewhere in the
networking stack, the TX task gets delayed, resulting in quite a
noticable throughput loss:
* baseline TCP for 2x2 11n HT40 is ~ 170mbit/sec;
* TCP for TX task in the ath taskq, with the RX also going on - 80mbit/sec;
* TCP for TX task in a separate, second taskq - 100mbit/sec.
So for now I'm going with the Linux wireless stack approach - lock tx
early. The linux code does in the wireless stack, before the 802.11
state stuff happens and before it's punted to the driver.
But TX locking needs to also occur at the driver layer as the TX
completion code _also_ begins to drain the ifnet TX queue.
Whilst I'm here, add some KTR traces for the TX path.
Note:
* This really should be done at the net80211 layer (as well, at least.)
But that'll have to wait for a little more thought to happen.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
the power save queue.
* introduce some new ATH_NODE lock protected fields, tracking the
net80211 psq and TIM state;
* when doing buffer transitions - ie, when sending and completing
buffers - check the state of the SWQ and update the TIM appropriately.
* when clearing the TIM bit, if the SWQ is not empty then delay clearing
it.
This is racy, but it's no less racy than the current net80211 power
save queue management code. Specifically, with multiple TX threads,
it's quite plausible that parallel state updates will race and the
TIM will be left in an inconsistent state. I'll address that in
a follow-up commit.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
support with ath(4) and VIMAGE.
Right now the VIMAGE code doesn't supply a default vnet context during:
* hotplug attach;
* any device detach.
It special cases kldload/boot time probing (by setting the context to
vnet0) but that doesn't occur when probing devices during a bus rescan -
eg, adding a cardbus card.
These will eventually go away when the VIMAGE support extends to providing
default contexts to hotplug attach/detach.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
it run out of multiple concurrent contexts.
Right now the ath(4) TX processing is a bit hairy. Specifically:
* It was running out of ath_start(), which could occur from multiple
concurrent sending processes (as if_start() can be started from multiple
sending threads nowdays.. sigh)
* during RX if fast frames are enabled (so not really at the moment, not
until I fix this particular feature again..)
* during ath_reset() - so anything which calls that
* during ath_tx_proc*() in the ath taskqueue - ie, TX is attempted again
after TX completion, as there's now hopefully some ath_bufs available.
* Then, the ic_raw_xmit() method can queue raw frames for transmission
at any time, from any net80211 TX context. Ew.
This has caused packet ordering issues in the past - specifically,
there's absolutely no guarantee that preemption won't occuring _during_
ath_start() by the TX completion processing, which will call ath_start()
again. It's a mess - 802.11 really, really wants things to be in
sequence or things go all kinds of loopy.
So:
* create a new task struct for TX'ing;
* make the if_start method simply queue the task on the ath taskqueue;
* make ath_start() just be called by the new TX task;
* make ath_tx_kick() just schedule the ath TX task, rather than directly
calling ath_start().
Now yes, this means that I've taken a step backwards in terms of
concurrency - TX -and- RX now occur in the same single-task taskqueue.
But there's nothing stopping me from separating out the TX / TX completion
code into a separate taskqueue which runs in parallel with the RX path,
if that ends up being appropriate for some platforms.
This fixes the CCMP/seqno concurrency issues that creep up when you
transmit large amounts of uni-directional UDP traffic (>200MBit) on a
FreeBSD STA -> AP, as now there's only one TX context no matter what's
going on (TX completion->retry/software queue,
userland->net80211->ath_start(), TX completion -> ath_start());
but it won't fix any concurrency issues between raw transmitted frames
and non-raw transmitted frames (eg EAPOL frames on TID 16 and any other
TID 16 multicast traffic that gets put on the CABQ.) That is going to
require a bunch more re-architecture before it's feasible to fix.
In any case, this is a big step towards making the majority of the TX
path locking irrelevant, as now almost all TX activity occurs in the
taskqueue.
Phew.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
net80211 node power save state.
* Add an ATH_NODE_UNLOCK_ASSERT() check
* Add a new node field - an_is_powersave
* Pause/unpause the queue based on the node state
* Attempt to handle net80211 concurrency issues so the queue
doesn't get paused/unpaused more than once at a time from
the net80211 power save code.
Whilst here (and breaking my usual rule), set CLRDMASK when a queue
is unpaused, regardless of whether the queue has some pending traffic.
This means the first frame from that TID (now or later) will hvae
CLRDMASK set.
Also whilst here, bump the swretrymax counters whenever the
filtered frames code expires a frame. Again, breaking my rule, but
this is just a statistics thing rather than a functional change.
This doesn't fix ps-poll (but it doesn't break it too much worse
than it is at the present) or correcting the TID updates.
That's next on the list.
Tested:
* AR9220 AP (Atheros AP96 reference design)
* Macbook Pro and LG Optimus 1 Android phone, both setting
and clearing power save state (but not using PS-POLL.)
|
|
|
|
|
|
|
| |
This should eventually be unified with ATH_DEBUG() so I can get both
from one macro; that may take some time.
Add some new probes for TX and TX completion.
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
In fact, bus_dmamem_alloc() happily NULLs the dmat pointer passed in,
before replacing it with its own.
This fixes a MIPS crash when kldload'ing if_ath/if_ath_pci -
bus_dmamap_destroy() was passed in a NULL dmat pointer and was doing
all kinds of very bad things.
Reviewed by: scottl
|
|
|
|
|
| |
This will be used by the EDMA TX code to assign descriptor IDs in order
to provide some debugging.
|
|
|
|
|
|
|
|
|
| |
re-used by the upcoming EDMA TX completion code.
Make ath_stoptxdma() public, again so the EDMA TX code can use it.
Don't check for the TXQ bitmap in the ISR when doing EDMA work as it
doesn't apply for EDMA.
|
|
|
|
|
|
|
|
|
| |
necessary to "do" EDMA.
It was just using the TX completion status for logging information about
the descriptor completion. Since with EDMA we don't know this without
checking the TX completion FIFO, we can't provide this information.
So don't.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Now that I understand what's going on with this, I've realised that
it's going to be quite difficult to implement a processq method in
the EDMA case. Because there's a separate TX status FIFO, I can't
just run processq() on each EDMA TXQ to see what's finished.
i have to actually run the TX status queue and handle individual
TXQs.
So:
* unmethodize ath_tx_processq();
* leave ath_tx_draintxq() as a method, as it only uses the completion status
for debugging rather than actively completing the frames (ie, all frames
here are failed);
* Methodize ath_draintxq().
The EDMA ath_draintxq() will have to take care of running the TX
completion FIFO before (potentially) freeing frames in the queue.
The only two places where ath_tx_draintxq() (on a single TXQ) are used:
* ath_draintxq(); and
* the CABQ handling in the beacon setup code - it drains the CABQ before
populating the CABQ with frames for a new beacon (when doing multi-VAP
operation.)
So it's quite possible that once I methodize the CABQ and beacon handling,
I can just drop ath_tx_draintxq() in its entirety.
Finally, it's also quite possible that I can remove ath_tx_draintxq()
in the future and just "teach" it to not check the status when doing
EDMA.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
EDMA HAL hardware.
* The EDMA HAL code assumes the nexttbtt and intval values are in TU/8
units, rather than TU. For now, just "hack" around that here, at least
until I code up something to translate it in the HAL.
* Setup some different TXQ flags for EDMA hardware.
* The EDMA HAL doesn't support setting the first rate series via
ath_hal_setuptxdesc() - instead, a call to ath_hal_set11nratescenario()
is always required. So for now, just do an 11n rate series setup
for EDMA beacon frames.
This allows my AR9380 to successfully transmit beacon frames.
However, CABQ TX and all normal data frame TX and TX completion is
still not functional and will require some more significant code churn
to make work.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
enabled.
The legacy (pre-802.11n) hardware doesn't support this - although
the AR5212 era hardware supports MRR, it doesn't have all the bits
needed to support MRR + RTS/CTS. The AR5416 and later support
a packet duration and RTS/CTS flags per rate scenario, so we should
support it.
Tested:
* AR9280, STA
PR: kern/170302
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
wrapping.
The previous code was only wrapping descriptor "block" boundaries rather
than individual descriptors. It sounds equivalent but it isn't.
r238824 changed the descriptor allocation to enforce that an individual
descriptor doesn't wrap a 4KiB boundary rather than the whole block
of descriptors. Eg, for TX descriptors, they're allocated in blocks
of 10 descriptors for each ath_buf (for scatter/gather DMA.)
|
| |
|
|
|
|
|
|
|
|
| |
buffers.
ath_descdma is now being used for things other than the classical
combination of ath_buf + ath_desc allocations. In this particular case,
don't try to free and blank out the ath_buf list if it's not passed in.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
of buffers, only the number of descriptors.
This involves:
* Change the allocation function to not use nbuf at all;
* When calling it, pass in "nbuf * ndesc" to correctly update how many
descriptors are being allocated.
Whilst here, fix the descriptor allocation code to correctly allocate
a larger buffer size if the Merlin 4KB WAR is required. It overallocates
descriptors when allocating a block that doesn't ever have a 4KB boundary
being crossed, but that can be fixed at a later stage.
|
|
|
|
|
|
|
| |
code.
The TX EDMA completion path is going to need descriptors allocated but
not any buffers. This code will form the basis for that.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The AR9300 and later descriptors are 128 bytes, however I'd like to make
sure that isn't used for earlier chips.
* Populate the TX descriptor length field in the softc with
sizeof(ath_desc)
* Use this field when allocating the TX descriptors
* Pre-AR93xx TX/RX descriptors will use the ath_desc size; newer ones will
query the HAL for these sizes.
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Introduce TX DMA setup/teardown methods, mirroring what's done in
the RX path.
Although the TX DMA descriptor is setup via ath_desc_alloc() /
ath_desc_free(), there TX status descriptor ring will be allocated
in this path.
* Remove some of the TX EDMA capability probing from the RX path and
push it into the new TX EDMA path.
|
|
|
|
|
|
|
|
| |
sized TX descriptor.
This is required for the AR93xx EDMA support which requires 128 byte
TX descriptors (which is significantly larger than the earlier
hardware.)
|
| |
|
|
|
|
| |
Noticed by: rui
|
|
|
|
| |
I missed this in my previous commit.
|
|
|
|
|
|
|
|
|
|
| |
I was setting up the RX EDMA buffer to be 4096 bytes rather than the
RX data buffer portion. The hardware was likely getting very confused
and DMAing descriptor portions into places it shouldn't, leading to
memory corruption and occasional panics.
Whilst here, don't bother allocating descriptors for the RX EDMA case.
We don't use those descriptors. Instead, just allocate ath_buf entries.
|
|
|
|
|
| |
Yes, this is in the legacy interrupt path. The NIC does support
MSI but I haven't yet sat down and written that code.
|