[PATCH] usb: host: xhci: fix HALTED endpoint handling

Post by Felipe Balbi
If the HW reports the endpoint as Halted, there's
no need to start any URB for that endpoint, we can
simply return -EPIPE and this will tell upper layers
that the endpoint is, indeed, halted.
This patch fixes usbtest's set/clear halt test #13
on xHCI-based hosts.

It looks like that test assumes that the xHCI driver should not allow
URBs to be enqueued to the endpoint when the driver needs to call
usb_clear_halt(). Or is it expecting that the host will attempt the
transfer, and the device will stall the transfer?

Post by Felipe Balbi
---
Hi Sarah,
seems like this has been broken for quite a long
time.
I tested the patch on an AM437x which has XHCI Host and
the same IP configured as device as well (two dwc3 instances,
basically).
It has been running for the past hour or so without any failures
when running testusb -t 13 -c 5000 -s 2048 -a in a loop, but
if you think there's a better way to fix this, let me know. I just
didn't understand why xhci-hcd still queues the URB even though
HW already told us the endpoint is halted. Any comments ?

The reason the driver still queues URBs is because the xHC halts the
endpoint for events that are outside of stalls. For instance, it will
halt the endpoint on a babble error, a transfer error, or a split
transaction error. The upper layers don't call usb_clear_halt() for
those cases, so the xHCI driver handles the halted endpoint internally.
Look for calls to xhci_requires_manual_halt_cleanup().

The xHC also halts the endpoint on a control transfer stall. The upper
layers don't ever call usb_clear_halt on the control endpoint, because
the bus spec says the next control transfer will clear the halted
condition. Therefore the xHCI driver has to handle the halt manually.

The way the driver does that is to set a flag for the endpoint
(EP_HALTED), issue a Reset Endpoint command, and then a Set TR Dequeue
command to move the dequeue pointer past the transfer that failed. The
driver still accepts URBs submitted to the endpoint (because from the
upper layer's perspective, we should in the cases where we manually
handle the halt), but it does not ring the doorbell for that endpoint.
Once the command to set the dequeue pointer completes, the EP_HALTED
flag is cleared, the doorbell is rung for the endpoint, and the queued
URBs are processed.

The end result is that the xHCI driver *needs* to allow URBs to be
queued when the endpoint is halted and it needs to manual cleanup the
ring. Otherwise, the driver might queue an URB while the two commands
are still being processed, the driver will reject the URBs, and the
behavior in these corner cases will be different than other host
controllers.

So, no, I don't think this is quite the right solution.

You could try to differentiate between an endpoint halt that requires a
manual cleanup, and ones that the upper layer should handle, perhaps by
adding a new EP_HALTED_MANUAL flag. The driver could accept URBs for
the manual cleanup case, and reject URBs with -EPIPE for the ones the
upper layers should handle.

Code that refuses to ring the endpoint doorbell would have to look for
both EP_HALTED and EP_HALTED_MANUAL. There might be other implications
as well; grep for EP_HALTED and see what code checks it.

Sarah Sharp

Post by Felipe Balbi
drivers/usb/host/xhci-ring.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 53c2e29..e9df61a 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -2959,7 +2959,8 @@ static int prepare_ring(struct xhci_hcd *xhci, struct xhci_ring *ep_ring,
/* XXX not sure if this should be -ENOENT or not */
return -EINVAL;
- xhci_dbg(xhci, "WARN halted endpoint, queueing URB anyway.\n");
+ xhci_dbg(xhci, "WARN halted endpoint.\n");
+ return -EPIPE;
break;
--
1.8.4.GIT

Felipe Balbi

2014-01-21 17:55:54 UTC

I suppose it doesn't make a difference, considering it
wait_for_completion_interruptible(). As long as the transfer actually
completes within 10 seconds, it doesn't matter.

What I have seen, though, is the transfer never completes and the 10
second timeout always kicks in when trying two consecutive
usb_submit_urb().

Here's, in a nut-shell, what happens:

-> SetFeature(ENDPOINT_HALT)
-> GetStatus()
-> usb_submit_urb()
-> this one completes with -EPIPE as it should
-> usb_submit_urb()
-> this one always times out :-(

alright.

Post by Sarah Sharp
The xHC also halts the endpoint on a control transfer stall. The upper
layers don't ever call usb_clear_halt on the control endpoint, because
the bus spec says the next control transfer will clear the halted
condition. Therefore the xHCI driver has to handle the halt manually.
The way the driver does that is to set a flag for the endpoint
(EP_HALTED), issue a Reset Endpoint command, and then a Set TR Dequeue
command to move the dequeue pointer past the transfer that failed. The
driver still accepts URBs submitted to the endpoint (because from the
upper layer's perspective, we should in the cases where we manually
handle the halt), but it does not ring the doorbell for that endpoint.
Once the command to set the dequeue pointer completes, the EP_HALTED
flag is cleared, the doorbell is rung for the endpoint, and the queued
URBs are processed.
The end result is that the xHCI driver *needs* to allow URBs to be
queued when the endpoint is halted and it needs to manual cleanup the
ring. Otherwise, the driver might queue an URB while the two commands
are still being processed, the driver will reject the URBs, and the
behavior in these corner cases will be different than other host
controllers.
So, no, I don't think this is quite the right solution.
You could try to differentiate between an endpoint halt that requires a
manual cleanup, and ones that the upper layer should handle, perhaps by
adding a new EP_HALTED_MANUAL flag. The driver could accept URBs for
the manual cleanup case, and reject URBs with -EPIPE for the ones the
upper layers should handle.
Code that refuses to ring the endpoint doorbell would have to look for
both EP_HALTED and EP_HALTED_MANUAL. There might be other implications
as well; grep for EP_HALTED and see what code checks it.

will do.

So what's supposed to actually happen with that test ? I'd expect the
URB to actually be started (meaning we should ring the ep doorbell ?)
and the device would reply with a Stall which, in turn, would return
-EPIPE to upper layers.

Currently, that URB seems like it's never started to begin with, so we
never see a Stall from the device side and the test fails.

Surprisingly, this only happens on the second usb_submit_urb(). From
what I could gather yesterday, EP_STATE_HALTED isn't yet set when the
first usb_submit_urb() is called. Likely that because of that, the
doorbell is rung and the device has a chance to Stall the transfer.

Or am I missing something ?

--
balbi

Alan Stern

2014-01-21 18:06:19 UTC

It expects that the computer will attempt the transfer and the device
will reply with a STALL.

Post by Sarah Sharp
The reason the driver still queues URBs is because the xHC halts the
endpoint for events that are outside of stalls. For instance, it will
halt the endpoint on a babble error, a transfer error, or a split
transaction error. The upper layers don't call usb_clear_halt() for
those cases, so the xHCI driver handles the halted endpoint internally.
Look for calls to xhci_requires_manual_halt_cleanup().
The xHC also halts the endpoint on a control transfer stall. The upper
layers don't ever call usb_clear_halt on the control endpoint, because
the bus spec says the next control transfer will clear the halted
condition. Therefore the xHCI driver has to handle the halt manually.

It sounds like there's a little confusion here. To say that the
endpoint is halted is somewhat ambiguous; it could refer to a halt on
the host side or a halt on the device side.

usb_clear_halt() is meant to clear a halt on the device side. It also
tells the HCD that it has done so, in case the HCD is keeping track,
but it is not meant for clearing a halt on the host side.
usb_reset_endpoint() does that (among other things).

Post by Sarah Sharp
You could try to differentiate between an endpoint halt that requires a
manual cleanup, and ones that the upper layer should handle, perhaps by
adding a new EP_HALTED_MANUAL flag. The driver could accept URBs for
the manual cleanup case, and reject URBs with -EPIPE for the ones the
upper layers should handle.

That should not be necessary. The HCD should always accept URB
submissions. Nothing will get sent to the device until the HCD clears
any host-side halts, but the upper layers don't worry about that
because the HCD takes care of it automatically.

Contrariwise, the upper layers are responsible for clearing device-side
halts. The HCD should ignore that end of things as much as it can.

Alan Stern

Felipe Balbi

2014-01-21 18:11:39 UTC

Hi,

That should not be necessary. The HCD should always accept URB
submissions. Nothing will get sent to the device until the HCD clears
any host-side halts, but the upper layers don't worry about that
because the HCD takes care of it automatically.
Contrariwise, the upper layers are responsible for clearing device-side
halts. The HCD should ignore that end of things as much as it can.

right, but there's still the bug that usb_submit_urb() times out with
xhci and it doesn't with ehci. So there _is_ a bug in xhci.

I'll try to debug more and figure out what's really going on, but one
thing is for sure. usbtest issued a SetFeature(ENDPOINT_HALT) as part of
the test and it tries to usb_submit_urb() as means to verify that the
device is really responding with Stalls.

cheers

--
balbi

Alan Stern

2014-01-21 19:37:45 UTC

Post by Felipe Balbi
Hi,

That should not be necessary. The HCD should always accept URB
submissions. Nothing will get sent to the device until the HCD clears
any host-side halts, but the upper layers don't worry about that
because the HCD takes care of it automatically.
Contrariwise, the upper layers are responsible for clearing device-side
halts. The HCD should ignore that end of things as much as it can.

right, but there's still the bug that usb_submit_urb() times out with
xhci and it doesn't with ehci. So there _is_ a bug in xhci.

Oh, definitely. I was just trying to help make sure that the bug
wasn't "fixed" incorrectly.

Alan Stern

Sarah Sharp

2014-01-21 22:00:13 UTC

Post by Felipe Balbi
Hi,

That should not be necessary. The HCD should always accept URB
submissions. Nothing will get sent to the device until the HCD clears
any host-side halts, but the upper layers don't worry about that
because the HCD takes care of it automatically.
Contrariwise, the upper layers are responsible for clearing device-side
halts. The HCD should ignore that end of things as much as it can.

right, but there's still the bug that usb_submit_urb() times out with
xhci and it doesn't with ehci. So there _is_ a bug in xhci.
I'll try to debug more and figure out what's really going on, but one
thing is for sure. usbtest issued a SetFeature(ENDPOINT_HALT) as part of
the test and it tries to usb_submit_urb() as means to verify that the
device is really responding with Stalls.

...

Post by Felipe Balbi
I suppose it doesn't make a difference, considering it
wait_for_completion_interruptible(). As long as the transfer actually
completes within 10 seconds, it doesn't matter.
What I have seen, though, is the transfer never completes and the 10
second timeout always kicks in when trying two consecutive
usb_submit_urb().
-> SetFeature(ENDPOINT_HALT)
-> GetStatus()
-> usb_submit_urb()
-> this one completes with -EPIPE as it should
-> usb_submit_urb()
-> this one always times out :-(

Ah, I think I see what's happening.

The xHCI driver is not designed to execute transfers after a stall on a
non-control endpoint until xhci_endpoint_reset is called and the driver
fixes up the endpoint ring. That will happen in usb_clear_halt().

I expect that if an URB is submitted to a stalled non-control endpoint
before usb_clear_halt is called, the xHCI driver should see EP_HALTED is
set and refuse to ring the doorbell in that case. Once usb_clear_halt
calls endpoint_reset, the xHCI driver will clean up the endpoint ring.

The way the code was designed doesn't account for the device driver
telling the endpoint to halt via a control transfer. The xHC doesn't
snoop that particular control transfer, so it allows the first URB to be
submitted. When that transfer stalls, the driver sets EP_HALTED. Then
the second URB hangs as I expect, because the xHCI driver is waiting for
usb_clear_halt() to be called.

If it really matters that a driver be able to get an immediate -EPIPE
response back from a device when the endpoint is halted, I can rip the
code out of xhci_endpoint_reset() and make the driver manually clean up
the ring at the time of the stall, like it does for control endpoints
and babble/transfer/split TX errors. If it doesn't matter, test #13
should be changed to only submit one URB and get back an -EPIPE
response.

Sarah Sharp
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Alan Stern

2014-01-21 22:19:13 UTC

Post by Sarah Sharp
Ah, I think I see what's happening.
The xHCI driver is not designed to execute transfers after a stall on a
non-control endpoint until xhci_endpoint_reset is called and the driver
fixes up the endpoint ring. That will happen in usb_clear_halt().
I expect that if an URB is submitted to a stalled non-control endpoint
before usb_clear_halt is called, the xHCI driver should see EP_HALTED is
set and refuse to ring the doorbell in that case. Once usb_clear_halt
calls endpoint_reset, the xHCI driver will clean up the endpoint ring.
The way the code was designed doesn't account for the device driver
telling the endpoint to halt via a control transfer. The xHC doesn't
snoop that particular control transfer, so it allows the first URB to be
submitted. When that transfer stalls, the driver sets EP_HALTED. Then
the second URB hangs as I expect, because the xHCI driver is waiting for
usb_clear_halt() to be called.
If it really matters that a driver be able to get an immediate -EPIPE
response back from a device when the endpoint is halted, I can rip the
code out of xhci_endpoint_reset() and make the driver manually clean up
the ring at the time of the stall, like it does for control endpoints
and babble/transfer/split TX errors. If it doesn't matter, test #13
should be changed to only submit one URB and get back an -EPIPE
response.

In theory, errors caused by the device sending a STALL should be
handled just like other sorts of errors. After the URB which got the
STALL has been given back, the HCD should automatically restart the
endpoint ring. It should not wait for usb_clear_halt.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Sarah Sharp

2014-01-23 00:13:51 UTC

Mathias, can you look into fixing this? I know you're working on the
code to use a configure endpoint command when we can't issue a reset
endpoint command, and this change is definitely related. It may even
simplify the code to make this change first.

Thread, for reference:

http://marc.info/?l=linux-usb&m=139025060301432&w=2

Sarah Sharp
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Mathias Nyman

2014-01-23 15:51:54 UTC

Post by Sarah Sharp
If it really matters that a driver be able to get an immediate -EPIPE
response back from a device when the endpoint is halted, I can rip the
code out of xhci_endpoint_reset() and make the driver manually clean up
the ring at the time of the stall, like it does for control endpoints
and babble/transfer/split TX errors. If it doesn't matter, test #13
should be changed to only submit one URB and get back an -EPIPE
response.

I'll take a look at it, but it might take a few days before I get it done.

-Mathias

David Laight

2014-01-22 10:33:18 UTC