SSH key exchange fails 30-70% of the time on Netgear X4S R7800

Sebastian Gottschall s.gottschall at dd-wrt.com
Wed Mar 25 12:57:36 AWST 2020


how can you make sure that no context switch is happening if the kernel 
uses neon instructions by itself? by stopping the kernel?

this is faily impossible. check if this option is on, and disable it to 
make sure that the kernel does not make use of neon instructions


Am 25.03.2020 um 05:25 schrieb Horshack ‪‬:
> I excluded context switches as a possible culprit by looping until a 
> corruption happened for which no context switches occurred while the 
> test was running (ie, at the start of the test I would save the # of 
> involuntary/voluntary context switches from /proc/<pid>/status, then 
> check those counts again after the failure - if they were different I 
> restarted the test and kept looping until a failure happened in which 
> the ctx switch counts were the same.
>
> ------------------------------------------------------------------------
> *From:* dropbear-bounces+horshack=live.com at ucc.asn.au 
> <dropbear-bounces+horshack=live.com at ucc.asn.au> on behalf of Sebastian 
> Gottschall <s.gottschall at dd-wrt.com>
> *Sent:* Tuesday, March 24, 2020 9:13 PM
> *To:* dropbear at ucc.asn.au <dropbear at ucc.asn.au>
> *Subject:* Re: SSH key exchange fails 30-70% of the time on Netgear 
> X4S R7800
>
> if the corruption is caused by a context switch the problem can be 
> caused by the kernel.
> try the following and disable "CONFIG_KERNEL_MODE_NEON"
> in the kernel config. this will disable some kernel crypto assembly code
>
> Am 24.03.2020 um 16:11 schrieb Matt Johnston:
>> Good work narrowing down a test case there.
>> That's an interesting finding - I guess it might be worth posting on 
>> OpenWRT lists/forum to try find other testers.
>> Could it be power related if the tight multiplication loop is 
>> stressing it somehow? It doesn't seem to be using the Neon 
>> instruction for anything apart from loads/stores though - is there 
>> something that the compiler should be doing mixing Neon and non-Neon 
>> operations?
>>
>> Cheers,
>> Matt
>>
>> (Your emails got held up being over 100kB, I've trimmed the reply 
>> below and let them through. Apologies to everyone for the stale old 
>> one that got let through with them just now, I wasn't looking closely)
>>
>>> On Tue 24/3/2020, at 11:23 am, Horshack ‪‬ <horshack at live.com 
>>> <mailto:horshack at live.com>> wrote:
>>>
>>> I was able to isolate the issue to just a handful of assembly 
>>> instructions within fast_s_mp_sqr(), related to the squaring loop. I 
>>> broke that code out into a separate utility that reproduces the 
>>> issue within a few seconds. The failure is somewhat sensitive to the 
>>> data pattern and very sensitive to timing, indicating a likely 
>>> memory/data path issue within my particular router. I'm guessing 
>>> it's the IPQ8065 and not the SDRAM because I can get it to fail with 
>>> a tiny data set easily fits within DCACHE. I can alter the frequency 
>>> of the failure with a single ARM memory barrier instruction, which 
>>> at first implied a superscalar data ordering condition but the 
>>> memory barrier also alters the timing through the DCACHE so that is 
>>> likely the effect it's having. I was able to exclude the VFP/Neon 
>>> register corruption as the cause with some test code. I also 
>>> excluded any context switch-speciifc issue by measuring the # of 
>>> context switches in /proc/<pid>/status and catching a failure where 
>>> no switches had occurred. I also modified the affinity so the 
>>> utility runs on just one processor to rule out a specific core 
>>> having the issue.
>>>
>>> I put the source and binary of my utility on github - if anyone on 
>>> this mailing list has this model router can you give it a try if 
>>> possible? You only need the ipq8065-sqrbug (binary) and 
>>> run-ipq8065-sqrbug.sh (script). Here's the link to the 
>>> repository:https://github.com/horshack-dpreview/ipq8065-sqrbug 
>>> <https://github.com/horshack-dpreview/ipq8065-sqrbug>
>>>
>>>
>>> ------------------------------------------------------------------------
>>> *From:*Horshack ‪‬ <horshack at live.com <mailto:horshack at live.com>>
>>> *Sent:*Saturday, March 21, 2020 7:54 AM
>>> *To:*dropbear at ucc.asn.au 
>>> <mailto:dropbear at ucc.asn.au><dropbear at ucc.asn.au 
>>> <mailto:dropbear at ucc.asn.au>>
>>> *Subject:*SSH key exchange fails 30-70% of the time on Netgear X4S 
>>> R7800
>>> Including mailing list for my last two messages below...
>>>
>>> Begin forwarded message:
>>>
>>>> *From:*Horshack ‪‬ <horshack at live.com <mailto:horshack at live.com>>
>>>> *Date:*March 21, 2020 at 7:35:18 AM PDT
>>>> *To:*Matt Johnston <matt at ucc.asn.au <mailto:matt at ucc.asn.au>>
>>>> *Cc:*"dropbear at ucc.asn.au <mailto:dropbear at ucc.asn.au>" 
>>>> <dropbear at ucc.asn.au <mailto:dropbear at ucc.asn.au>>
>>>> *Subject:**Re:  SSH key exchange fails 30-70% of the time on 
>>>> Netgear X4S R7800*
>>>>
>>>> 
>>>> Disassembly of fast_s_mp_sqr() and other libtommath functions 
>>>> reveals gcc is utilizing the arm NEON SIMD instructions and 
>>>> registers for calculations involved with libtommath's mp_word 
>>>> scalar. Based on the 64-bit word corruption I see I'm guessing the 
>>>> SIMD registers aren't being preserved/restored properly somewhere, 
>>>> probably during a context switch, specifically s16–s31 (d8–d15, 
>>>> q4–q7), which AAPCS says must be preserved and which I see being 
>>>> used in the disassembly of fast_s_mp_sqr(). I'lll write some test 
>>>> code later today to see if this is the case, and if so, try to 
>>>> track down where and why the registers aren't being preserved.
>>>>
>>>> ------------------------------------------------------------------------
>>>> *From:*Horshack ‪‬ <horshack at live.com <mailto:horshack at live.com>>
>>>> *Sent:*Saturday, March 21, 2020 1:11 AM
>>>> *To:*Matt Johnston <matt at ucc.asn.au <mailto:matt at ucc.asn.au>>
>>>> *Cc:*dropbear at ucc.asn.au <mailto:dropbear at ucc.asn.au> 
>>>> <dropbear at ucc.asn.au <mailto:dropbear at ucc.asn.au>>
>>>> *Subject:*Re: SSH key exchange fails 30-70% of the time on Netgear 
>>>> X4S R7800
>>>> I have one of the failure paths isolated down to a single corrupt 
>>>> 64-bit word in memory, which required a significant amount of code 
>>>> instrumentation to achieve. I implemented a code execution history 
>>>> buffer that gets filled at various checkpoints within 
>>>> s_mp_exptmod() and some of the modules called by it. To facilitate 
>>>> this history mechanism I packaged all of s_mp_exptmod()'s local 
>>>> variables inside a structure , which consists of saving the local 
>>>> scalar vars in addition to crc32's of all the mp_int data 
>>>> structures with a separate crc32 of the mp_int.dp payload (data). 
>>>> When a failure occurs, ie one or more of the three back-to-back 
>>>> debug invocations of s_mp_exptmod yields a mismatching signed key 
>>>> result, I  dump out the history elements for each of the 
>>>> invocations to determine the first code checkpoint where failing 
>>>> invocation departed from the known correct invocation.
>>
>> *snipped*
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.ucc.gu.uwa.edu.au/pipermail/dropbear/attachments/20200325/cb03e753/attachment-0001.htm 


More information about the Dropbear mailing list