SSH key exchange fails 30-70% of the time on Netgear X4S R7800

Sebastian Gottschall s.gottschall at dd-wrt.com
Wed Mar 25 12:13:16 AWST 2020


if the corruption is caused by a context switch the problem can be 
caused by the kernel.
try the following and disable "CONFIG_KERNEL_MODE_NEON"
in the kernel config. this will disable some kernel crypto assembly code

Am 24.03.2020 um 16:11 schrieb Matt Johnston:
> Good work narrowing down a test case there.
> That's an interesting finding - I guess it might be worth posting on 
> OpenWRT lists/forum to try find other testers.
> Could it be power related if the tight multiplication loop is 
> stressing it somehow? It doesn't seem to be using the Neon instruction 
> for anything apart from loads/stores though - is there something that 
> the compiler should be doing mixing Neon and non-Neon operations?
>
> Cheers,
> Matt
>
> (Your emails got held up being over 100kB, I've trimmed the reply 
> below and let them through. Apologies to everyone for the stale old 
> one that got let through with them just now, I wasn't looking closely)
>
>> On Tue 24/3/2020, at 11:23 am, Horshack ‪‬ <horshack at live.com 
>> <mailto:horshack at live.com>> wrote:
>>
>> I was able to isolate the issue to just a handful of assembly 
>> instructions within fast_s_mp_sqr(), related to the squaring loop. I 
>> broke that code out into a separate utility that reproduces the issue 
>> within a few seconds. The failure is somewhat sensitive to the data 
>> pattern and very sensitive to timing, indicating a likely memory/data 
>> path issue within my particular router. I'm guessing it's the IPQ8065 
>> and not the SDRAM because I can get it to fail with a tiny data set 
>> easily fits within DCACHE. I can alter the frequency of the failure 
>> with a single ARM memory barrier instruction, which at first implied 
>> a superscalar data ordering condition but the memory barrier also 
>> alters the timing through the DCACHE so that is likely the effect 
>> it's having. I was able to exclude the VFP/Neon register corruption 
>> as the cause with some test code. I also excluded any context 
>> switch-speciifc issue by measuring the # of context switches in 
>> /proc/<pid>/status and catching a failure where no switches had 
>> occurred. I also modified the affinity so the utility runs on just 
>> one processor to rule out a specific core having the issue.
>>
>> I put the source and binary of my utility on github - if anyone on 
>> this mailing list has this model router can you give it a try if 
>> possible? You only need the ipq8065-sqrbug (binary) and 
>> run-ipq8065-sqrbug.sh (script). Here's the link to the 
>> repository:https://github.com/horshack-dpreview/ipq8065-sqrbug 
>> <https://github.com/horshack-dpreview/ipq8065-sqrbug>
>>
>>
>> ------------------------------------------------------------------------
>> *From:*Horshack ‪‬ <horshack at live.com <mailto:horshack at live.com>>
>> *Sent:*Saturday, March 21, 2020 7:54 AM
>> *To:*dropbear at ucc.asn.au 
>> <mailto:dropbear at ucc.asn.au><dropbear at ucc.asn.au 
>> <mailto:dropbear at ucc.asn.au>>
>> *Subject:*SSH key exchange fails 30-70% of the time on Netgear X4S R7800
>> Including mailing list for my last two messages below...
>>
>> Begin forwarded message:
>>
>>> *From:*Horshack ‪‬ <horshack at live.com <mailto:horshack at live.com>>
>>> *Date:*March 21, 2020 at 7:35:18 AM PDT
>>> *To:*Matt Johnston <matt at ucc.asn.au <mailto:matt at ucc.asn.au>>
>>> *Cc:*"dropbear at ucc.asn.au <mailto:dropbear at ucc.asn.au>" 
>>> <dropbear at ucc.asn.au <mailto:dropbear at ucc.asn.au>>
>>> *Subject:**Re:  SSH key exchange fails 30-70% of the time on Netgear 
>>> X4S R7800*
>>>
>>> 
>>> Disassembly of fast_s_mp_sqr() and other libtommath functions 
>>> reveals gcc is utilizing the arm NEON SIMD instructions and 
>>> registers for calculations involved with libtommath's mp_word 
>>> scalar. Based on the 64-bit word corruption I see I'm guessing the 
>>> SIMD registers aren't being preserved/restored properly somewhere, 
>>> probably during a context switch, specifically s16–s31 (d8–d15, 
>>> q4–q7), which AAPCS says must be preserved and which I see being 
>>> used in the disassembly of fast_s_mp_sqr(). I'lll write some test 
>>> code later today to see if this is the case, and if so, try to track 
>>> down where and why the registers aren't being preserved.
>>>
>>> ------------------------------------------------------------------------
>>> *From:*Horshack ‪‬ <horshack at live.com <mailto:horshack at live.com>>
>>> *Sent:*Saturday, March 21, 2020 1:11 AM
>>> *To:*Matt Johnston <matt at ucc.asn.au <mailto:matt at ucc.asn.au>>
>>> *Cc:*dropbear at ucc.asn.au <mailto:dropbear at ucc.asn.au> 
>>> <dropbear at ucc.asn.au <mailto:dropbear at ucc.asn.au>>
>>> *Subject:*Re: SSH key exchange fails 30-70% of the time on Netgear 
>>> X4S R7800
>>> I have one of the failure paths isolated down to a single corrupt 
>>> 64-bit word in memory, which required a significant amount of code 
>>> instrumentation to achieve. I implemented a code execution history 
>>> buffer that gets filled at various checkpoints within s_mp_exptmod() 
>>> and some of the modules called by it. To facilitate this history 
>>> mechanism I packaged all of s_mp_exptmod()'s local variables inside 
>>> a structure , which consists of saving the local scalar vars in 
>>> addition to crc32's of all the mp_int data structures with a 
>>> separate crc32 of the mp_int.dp payload (data). When a failure 
>>> occurs, ie one or more of the three back-to-back debug invocations 
>>> of s_mp_exptmod yields a mismatching signed key result, I  dump out 
>>> the history elements for each of the invocations to determine the 
>>> first code checkpoint where failing invocation departed from the 
>>> known correct invocation.
>
> *snipped*
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.ucc.gu.uwa.edu.au/pipermail/dropbear/attachments/20200325/c4f3719f/attachment-0001.htm 


More information about the Dropbear mailing list