From kazuo at irixnet.org Mon Jan 27 15:13:38 2020 From: kazuo at irixnet.org (Kazuo Kuroi) Date: Mon, 27 Jan 2020 02:13:38 -0500 Subject: Update: Bundled Libtommcrypt broken on IRIX Message-ID: <35718f4a-5845-a489-4a18-059eddeedac3@irixnet.org> Hi guys, Was able to get past that point with the libtommcrypt/libtommath from their website. It's working fine here. I guess time to update the bundle. Now to finish compilation (Slow build system to avoid too much noise/heat) -Kazuo From kazuo at irixnet.org Tue Jan 28 16:10:24 2020 From: kazuo at irixnet.org (Kazuo Kuroi) Date: Tue, 28 Jan 2020 03:10:24 -0500 Subject: Update 2: Bundled Libtommcrypt broken on IRIX Message-ID: Well, I was able to confirm this is a bug specific to MIPSPro, not to GCC, and is upstream with libtommcrypt. I will stop bothering you guys about this. Thanks anyways. From kazuo at irixnet.org Fri Jan 31 03:32:32 2020 From: kazuo at irixnet.org (Kazuo Kuroi) Date: Thu, 30 Jan 2020 14:32:32 -0500 Subject: SIGCLD issues with dropbear server on IRIX 6.5.x Message-ID: Hello, I've recently compiled dropbear 2019-78 as most people here are aware from my mailing list messages. I am updating that a build with GCC was successful, but there appears to be a problem with how dropbear is handing SIGCLD on IRIX, causing forked processes to hang. https://pastebin.com/04v74Yt5 Here is the output of attaching par (IRIX's version of truss) to the PID, then trying to logout of the session, which hangs the ssh client and that particular process of the server. I have to issue a sigkill to the process to stop it on the server side. Additionally, I'd like to report a small patch needed to fix IOV_MAX for IRIX in netio.c: ??????? #ifndef IOV_MAX ??????????????? #if defined(__CYGWIN__) && !defined(UIO_MAXIOV) ??????????????? #define IOV_MAX 1024 ??????????????? #elif defined(__sgi) ??????????????? #define IOV_MAX 512 ??????????????? #else ??????????????? #define IOV_MAX UIO_MAXIOV The elif and __sgi macro had to be added. This will fix it for both mipspro and gcc. I've also confirmed with the libtom upstream that the version of libtomcrypt that I reported in my earlier messages isn't exhibiting that bug; so if someone could give insight how to fix this: https://pastebin.com/gnGxGmZH then a build with MIPSPro may finally be possible. Thank you very much. - Kazuo Kuroi From dropbear at ukku.uk Fri Feb 14 00:55:26 2020 From: dropbear at ukku.uk (Geoff Winkless) Date: Thu, 13 Feb 2020 16:55:26 +0000 Subject: Dropbear processes getting into uninterruptible I/O process "D" state In-Reply-To: References: Message-ID: On Tue, 15 Oct 2019 at 15:30, Matt Johnston wrote: > I think regardless of what Dropbear's doing with pipes (closed sessions etc), there is probably something wrong with the Linux kernel. > As far as I know userspace can't trigger D state even intentionally (I'd be interested if anyone knows a way though). > -K is unrelated, that just sends some SSH traffic at a certain interval. Apologies for dropping in on an old thread, not sure if this is still a problem, but I came across something similar with some of my own code and remembered this thread so figured I'd add the result of what I found there in case it was helpful (probably not, but you never know...) Linux puts vfork()'s parents into the D state until the child returns: as a result if you vfork() and then fail to exec() (or more likely if the exec() call fails) without _exit() in the child Linux will leave the parent in D state permanently until the child quits. Is DROPBEAR_VFORK enabled on the example build for some reason? It shouldn't be - as far as I can tell HAVE_FORK should always be true on Linux, but I guess it could be incorrectly configured somehow. Test code: #include #include int main (int argc, char **argv) { char buff[1024]; if (vfork()==0) { // child process sprintf(buff, "ps auxww | grep %s", argv[0]); system(buff); // shows D process execl("/bin/invalidbinary", "invalidbinary", NULL); // note no _exit() call } // if you do stuff here, the child will take over, while the parent is stuck in D } Now it's not immediately obvious how this would happen in the dropbear code, although I might be missing something obvious, but it's one very easy way to get a D state process on Linux. Geoff From kazuo at irixnet.org Sat Feb 15 03:41:47 2020 From: kazuo at irixnet.org (Kazuo Kuroi) Date: Fri, 14 Feb 2020 14:41:47 -0500 Subject: IRIX/MIPSPro patches for Dropbear 2019.78 Message-ID: Hi guys, Finally worked with another user on the forums of irixnet.org and we got all of the compilation issues on IRIX fixed, including the reported SIGCLD issues Here's our patch for it: http://irix.cc/raion/patches/dropbear-2019-irix.patch I realize this isn't the most quality patch or best hack in the industry, so please be kind. If we need to explain anything, let us know. I hope this can be upstreamed, if Matt and the other users are still working on dropbear? -Kazuo Kuroi From rubonmtz at gmail.com Fri Feb 21 03:14:00 2020 From: rubonmtz at gmail.com (M Rubon) Date: Thu, 20 Feb 2020 14:14:00 -0500 Subject: Can I disable duplicate public key check from dropbear client? Message-ID: When I use dropbear client, it causes a duplicate public key check on the openssh server. Is there any way of preventing this separate [preauth] check from happening? For my application it is very useful to have the authentication attempted just once. A StackExchange comment (dave_thompson_085 on https://superuser.com/q/1116927 ) to a different SSH question notes that > The [preauth] on those log lines and the Postponed result (both) mean > the client sent authreq with method=publickey and boolean=FALSE to "query" > whether the pubkey "would be acceptable" (see rfc4252 section 7). This matches what I see in the OpenSSH logs below. I checked and do not see this [preauth] when I do the same connection from an OpenSSH client. Mike I am connecting from my router using dropbear. The command I use is ssh -i .ssh/id_rsa fast at cat On the OpenSSH server I see: Feb 19 15:41:09 cat sshd[17906]: Accepted key RSA SHA256:iJN19jufdHFey0pwLK70PqgV3rgT99iQaWVmY7M8qZ0 found at /home/fast/.ssh/authorized_keys:16 Feb 19 15:41:09 cat sshd[17906]: Postponed publickey for fast from 45.78.113.202 port 48930 ssh2 [preauth] Feb 19 15:41:09 cat sshd[17906]: Accepted key RSA SHA256:iJN19jufdHFey0pwLK70PqgV3rgT99iQaWVmY7M8qZ0 found at /home/fast/.ssh/authorized_keys:16 Feb 19 15:41:09 cat sshd[17906]: Accepted publickey for fast from 45.78.113.202 port 48930 ssh2: RSA SHA256:iJN19jufdHFey0pwLK70PqgV3rgT99iQaWVmY7M8qZ0 Feb 19 15:41:09 cat sshd[17906]: pam_unix(sshd:session): session opened for user fast by (uid=0) From themiron.ru at gmail.com Fri Mar 6 22:45:28 2020 From: themiron.ru at gmail.com (Vladislav Grishenko) Date: Fri, 6 Mar 2020 19:45:28 +0500 Subject: [PATCH] Add Ed25519 keys support Message-ID: <005b01d5f3c5$e322e740$a968b5c0$@gmail.com> Hello, Initially inspired by P?ter Szab? work from 2017, but made with general approach: ? Curve25519/Ed25519 implementation based on TweetNaCl version 20140427, old Google's curve25519_donna dropped as unnecessary, saves a lot of size. ? SHA512 reused from LibTomCrypt, no need to keep own copy ? Sign/Verify require no additional memory allocation ? Dropbear's API made ~similar to LibTomCrypt devel to ease possible switch, if necessary. Anyway, LibTomCrypt is based on TweetNaCl as well. ? Default private key path is key/etc/dropbear/dropbear_ed25519_host_key ? Implemented general import from / export to OpenSSH private keys, can be reused for other key types on necessary ? Implemented *25519 fuzzers, but still need corresponding data from dropbear-fuzzcorpus ? Man, license, commens updated to fit Ed25519 So far, DROPBEAR_CURVE25519 increases dropbear binary by ~2,5Kb on X86-64 vs ~8Kb of current curve25519_donna implementation. DROPBEAR_ED25519 adds ~7,5Kb to dropbear and ~1kB to dropbearconvert for OpenSSH import/export. Related PR against current sources is here https://github.com/mkj/dropbear/pull/91, patches are attached. Review and/or any suggestios will be highly appreciated. Thank you and Best Regards, Vladislav Grishenko -------------- next part -------------- An HTML attachment was scrubbed... URL: https://lists.ucc.gu.uwa.edu.au/pipermail/dropbear/attachments/20200306/383c92d2/attachment-0001.htm -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-Add-support-for-Ed25519-as-a-public-key-type.patch Type: application/octet-stream Size: 79914 bytes Desc: not available Url : https://lists.ucc.gu.uwa.edu.au/pipermail/dropbear/attachments/20200306/383c92d2/attachment-0003.obj -------------- next part -------------- A non-text attachment was scrubbed... Name: 0002-Add-curve25519-and-ed25519-fuzzers.patch Type: application/octet-stream Size: 6353 bytes Desc: not available Url : https://lists.ucc.gu.uwa.edu.au/pipermail/dropbear/attachments/20200306/383c92d2/attachment-0004.obj -------------- next part -------------- A non-text attachment was scrubbed... Name: 0003-Add-import-and-export-of-Ed25519-keys.patch Type: application/octet-stream Size: 7066 bytes Desc: not available Url : https://lists.ucc.gu.uwa.edu.au/pipermail/dropbear/attachments/20200306/383c92d2/attachment-0005.obj From ruben at mrbrklyn.com Sun Mar 8 02:42:30 2020 From: ruben at mrbrklyn.com (Ruben Safir) Date: Sat, 7 Mar 2020 13:42:30 -0500 Subject: android access Message-ID: <1f652793-3a36-df29-83a7-228c3d0dfeb4@mrbrklyn.com> Hello Hello - I am sure this has been asked but I couldn't find an answer with a web search.. can one access org.galexander/Files on android through an applications. I wanted to sshfs a bunch of files to my tablet for a trip and it is so damn hard it it angering. Ruben -- So many immigrant groups have swept through our town that Brooklyn, like Atlantis, reaches mythological proportions in the mind of the world - RI Safir 1998 http://www.mrbrklyn.com DRM is THEFT - We are the STAKEHOLDERS - RI Safir 2002 http://www.nylxs.com - Leadership Development in Free Software http://www.brooklyn-living.com Being so tracked is for FARM ANIMALS and extermination camps, but incompatible with living as a free human being. -RI Safir 2013 From matt at ucc.asn.au Sun Mar 8 18:46:48 2020 From: matt at ucc.asn.au (Matt Johnston) Date: Sun, 8 Mar 2020 18:46:48 +0800 Subject: android access In-Reply-To: <1f652793-3a36-df29-83a7-228c3d0dfeb4@mrbrklyn.com> References: <1f652793-3a36-df29-83a7-228c3d0dfeb4@mrbrklyn.com> Message-ID: Hi Ruben, Not sure about that particular android program but Filezilla usually works as an alright sftp program. Cheers, Matt > On Sun 8/3/2020, at 2:42 am, Ruben Safir wrote: > > Hello > > Hello - I am sure this has been asked but I couldn't find an answer with > a web search.. > > can one access org.galexander/Files on android through an applications. > I wanted to sshfs a bunch of files to my tablet for a trip and it is so > damn hard it it angering. > > Ruben > -- > So many immigrant groups have swept through our town > that Brooklyn, like Atlantis, reaches mythological > proportions in the mind of the world - RI Safir 1998 > http://www.mrbrklyn.com > DRM is THEFT - We are the STAKEHOLDERS - RI Safir 2002 > > http://www.nylxs.com - Leadership Development in Free Software > http://www.brooklyn-living.com > > Being so tracked is for FARM ANIMALS and extermination camps, > but incompatible with living as a free human being. -RI Safir 2013 From ada at thorsis.com Tue Mar 10 22:53:05 2020 From: ada at thorsis.com (Alexander Dahl) Date: Tue, 10 Mar 2020 15:53:05 +0100 Subject: [PATCH] Update remaining advise to edit options.h Message-ID: <1fb7cd010d7da0d7deb8.1583851985@ada.ifak-system.com> # HG changeset patch # User Alexander Dahl # Date 1583851118 -3600 # Tue Mar 10 15:38:38 2020 +0100 # Node ID 1fb7cd010d7da0d7deb8b7571fd3d4a8af46fc86 # Parent 7402218141d4af3bec95929226ce5f0e435313a2 Update remaining advise to edit options.h You should edit localoptions.h instead. diff --git a/INSTALL b/INSTALL --- a/INSTALL +++ b/INSTALL @@ -56,7 +56,7 @@ uClibc toolchain compiler (ie export CC=i386-uclibc-gcc or whatever). You can use "make STATIC=1" to make statically linked binaries, and it is advisable to strip the binaries too. If you're looking to make a small binary, -you should remove unneeded ciphers and MD5, by editing options.h +you should remove unneeded ciphers and MD5, by editing localoptions.h It is possible to compile zlib in, by copying zlib.h and zconf.h into a subdirectory (ie zlibincludes), and diff --git a/configure.ac b/configure.ac --- a/configure.ac +++ b/configure.ac @@ -873,4 +873,4 @@ fi AC_MSG_NOTICE() -AC_MSG_NOTICE([Now edit options.h to choose features.]) +AC_MSG_NOTICE([Now edit localoptions.h to choose features.]) From ada at thorsis.com Tue Mar 10 22:55:02 2020 From: ada at thorsis.com (Alexander Dahl) Date: Tue, 10 Mar 2020 15:55:02 +0100 Subject: [PATCH] Update remaining advise to edit options.h Message-ID: <1fb7cd010d7da0d7deb8.1583852102@ada.ifak-system.com> INSTALL | 2 +- configure.ac | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) # HG changeset patch # User Alexander Dahl # Date 1583851118 -3600 # Tue Mar 10 15:38:38 2020 +0100 # Node ID 1fb7cd010d7da0d7deb8b7571fd3d4a8af46fc86 # Parent 7402218141d4af3bec95929226ce5f0e435313a2 Update remaining advise to edit options.h You should edit localoptions.h instead. diff --git a/INSTALL b/INSTALL --- a/INSTALL +++ b/INSTALL @@ -56,7 +56,7 @@ uClibc toolchain compiler (ie export CC=i386-uclibc-gcc or whatever). You can use "make STATIC=1" to make statically linked binaries, and it is advisable to strip the binaries too. If you're looking to make a small binary, -you should remove unneeded ciphers and MD5, by editing options.h +you should remove unneeded ciphers and MD5, by editing localoptions.h It is possible to compile zlib in, by copying zlib.h and zconf.h into a subdirectory (ie zlibincludes), and diff --git a/configure.ac b/configure.ac --- a/configure.ac +++ b/configure.ac @@ -873,4 +873,4 @@ fi AC_MSG_NOTICE() -AC_MSG_NOTICE([Now edit options.h to choose features.]) +AC_MSG_NOTICE([Now edit localoptions.h to choose features.]) From ada at thorsis.com Wed Mar 11 18:04:03 2020 From: ada at thorsis.com (Alexander Dahl) Date: Wed, 11 Mar 2020 11:04:03 +0100 Subject: [PATCH 0 of 1] Fix build Message-ID: Hei hei, I'm currently working on upgrading dropbear in the ptxdist embedded Linux build system. ptxdist has a quite strict policy to explicitly pass all available options of ./configure in order to have a predictable build result. So I tried passing --disable-fuzz which did not have the expected effect. I confirmed the broken behaviour on my workstation (Debian 9 (stretch) on amd64) and this is the fix I came up with. It's just mimicking what the other options do, I have only very few experience with autotools, so please review carefully. And sorry for the messed up mails from yesterday, my Mercurial skills are a bit rusty and sending patches with hg is a little different from Git ? ;-) Greets Alex From ada at thorsis.com Wed Mar 11 18:04:04 2020 From: ada at thorsis.com (Alexander Dahl) Date: Wed, 11 Mar 2020 11:04:04 +0100 Subject: [PATCH 1 of 1] configure: Fix --disable-fuzz In-Reply-To: References: Message-ID: <7bf75f196b4d28dffd42.1583921044@ada.ifak-system.com> When explicitly passing --disable-fuzz to ./configure fuzz was actually enabled. Signed-off-by: Alexander Dahl diff --git a/configure.ac b/configure.ac --- a/configure.ac +++ b/configure.ac @@ -341,14 +341,21 @@ AC_ARG_ENABLE(fuzz, [ --enable-fuzz Build fuzzing. Not recommended for deployment.], [ - AC_DEFINE(DROPBEAR_FUZZ, 1, Fuzzing) - AC_MSG_NOTICE(Enabling fuzzing) - DROPBEAR_FUZZ=1 - # libfuzzer needs linking with c++ libraries - AC_PROG_CXX + if test "x$enableval" = "xyes"; then + AC_DEFINE(DROPBEAR_FUZZ, 1, Fuzzing) + AC_MSG_NOTICE(Enabling fuzzing) + DROPBEAR_FUZZ=1 + # libfuzzer needs linking with c++ libraries + AC_PROG_CXX + else + AC_DEFINE(DROPBEAR_FUZZ, 0, Fuzzing) + AC_MSG_NOTICE(Disabling fuzzing) + DROPBEAR_FUZZ=0 + fi ], [ AC_DEFINE(DROPBEAR_FUZZ, 0, Fuzzing) + AC_MSG_NOTICE(Disabling fuzzing) DROPBEAR_FUZZ=0 ] From matt at ucc.asn.au Thu Mar 12 00:16:27 2020 From: matt at ucc.asn.au (Matt Johnston) Date: Thu, 12 Mar 2020 00:16:27 +0800 Subject: [PATCH] Add Ed25519 keys support In-Reply-To: <005b01d5f3c5$e322e740$a968b5c0$@gmail.com> References: <005b01d5f3c5$e322e740$a968b5c0$@gmail.com> Message-ID: Thank you Vladislav, I've merged this now via github, https://secure.ucc.asn.au/hg/dropbear/rev/d32bcb5c557d It's a nice clean and thorough implementation. Cheers, Matt > On Fri 6/3/2020, at 10:45 pm, Vladislav Grishenko wrote: > > Hello, > > Initially inspired by P?ter Szab? work from 2017, but made with general approach: > > ? Curve25519/Ed25519 implementation based on TweetNaCl version 20140427, old Google's curve25519_donna dropped as unnecessary, saves a lot of size. > ? SHA512 reused from LibTomCrypt, no need to keep own copy > ? Sign/Verify require no additional memory allocation > ? Dropbear's API made ~similar to LibTomCrypt devel to ease possible switch, if necessary. Anyway, LibTomCrypt is based on TweetNaCl as well. > ? Default private key path is key/etc/dropbear/dropbear_ed25519_host_key > ? Implemented general import from / export to OpenSSH private keys, can be reused for other key types on necessary > ? Implemented *25519 fuzzers, but still need corresponding data from dropbear-fuzzcorpus > ? Man, license, commens updated to fit Ed25519 > > So far, DROPBEAR_CURVE25519 increases dropbear binary by ~2,5Kb on X86-64 vs ~8Kb of current curve25519_donna implementation. > DROPBEAR_ED25519 adds ~7,5Kb to dropbear and ~1kB to dropbearconvert for OpenSSH import/export. > > Related PR against current sources is here https://github.com/mkj/dropbear/pull/91 , patches are attached. > Review and/or any suggestios will be highly appreciated. > > Thank you and > Best Regards, Vladislav Grishenko > > <0001-Add-support-for-Ed25519-as-a-public-key-type.patch><0002-Add-curve25519-and-ed25519-fuzzers.patch><0003-Add-import-and-export-of-Ed25519-keys.patch> -------------- next part -------------- An HTML attachment was scrubbed... URL: https://lists.ucc.gu.uwa.edu.au/pipermail/dropbear/attachments/20200312/7b020295/attachment.htm From fancsali at gmail.com Wed Mar 18 19:22:34 2020 From: fancsali at gmail.com (=?UTF-8?Q?D=C3=A1niel_Fancsali?=) Date: Wed, 18 Mar 2020 11:22:34 +0000 Subject: Timeout settings Message-ID: Hello, First of all, let me just say this: awesome piece of software. Cheers! I am, however, a bit confused about the idle/keepalive settings. I have been working with OpenSSH quite a bit, and do understand the concepts around ServerAlive and ClientAlive as well as the TCPKeepAlive settings. But I still struggle to wrap my head around -K and -I in dropbear. It's a tad bit unclear which one maps to which one; or in other words, which one happends on what layer. Maybe, my mistake here is trying to understand those in the context of the OpenSSH settings, but on some level, it's the same protocol. So, looking at the code, I think this is what happens: - Setting -Kx will send an ssh packed every x seconds, and if there's no answer 3 times in a row, it considers the connection to be dead. So this is essentially ServerAlive/ClientAlive mechanism. - Specifying -Iy would say, if there's no incoming or outgoing data for y seconds, it considers the connection dead. So this is sort of the others side of the TCP keepalive coin. Is my understanding correct? If not, can someone please shed some light on this for me? Regards, Daniel -------------- next part -------------- An HTML attachment was scrubbed... URL: https://lists.ucc.gu.uwa.edu.au/pipermail/dropbear/attachments/20200318/2496a3b9/attachment-0001.htm From taniahagan at googlemail.com Wed Mar 18 20:09:14 2020 From: taniahagan at googlemail.com (Tania Hagan) Date: Wed, 18 Mar 2020 12:09:14 +0000 Subject: Hiding dropbear output on boot up Message-ID: Hi Dropbear, I have setup dropbear and busybox on a Ubuntu 18.04 desktop with LUKS encryption. This works wonderfully except the IP-Config displays over the unlock disk prompt causing confusion with users. Is there a way to either hide this output or have it display before the LUKS unlock disk prompt? Thank you very much for any help. Tania -------------- next part -------------- An HTML attachment was scrubbed... URL: https://lists.ucc.gu.uwa.edu.au/pipermail/dropbear/attachments/20200318/538cee8b/attachment.htm From matt at ucc.asn.au Wed Mar 18 22:57:59 2020 From: matt at ucc.asn.au (Matt Johnston) Date: Wed, 18 Mar 2020 22:57:59 +0800 Subject: Timeout settings In-Reply-To: References: Message-ID: <8F8D8547-39DB-4FFF-8A46-864E1FD3FEB5@ucc.asn.au> Hi Daniel, -K is equivalent to the OpenSSH ClientAliveInterval. The server will send traffic to check that the connection is open. -I will disconnect if there is no traffic for a certain time interval. It won't try to send any traffic over the connection, it just passively looks at what traffic is being sent. Note that it seems that currently -K messages will cause the -I idle timer to reset which isn't right, there's a pull request https://github.com/mkj/dropbear/pull/90 which I will merge soon. Cheers, Matt > On Wed 18/3/2020, at 7:22 pm, D?niel Fancsali wrote: > > Hello, > > First of all, let me just say this: awesome piece of software. Cheers! > > I am, however, a bit confused about the idle/keepalive settings. I have been working with OpenSSH quite a bit, and do understand the concepts around ServerAlive and ClientAlive as well as the TCPKeepAlive settings. But I still struggle to wrap my head around -K and -I in dropbear. It's a tad bit unclear which one maps to which one; or in other words, which one happends on what layer. > > Maybe, my mistake here is trying to understand those in the context of the OpenSSH settings, but on some level, it's the same protocol. > > So, looking at the code, I think this is what happens: > - Setting -Kx will send an ssh packed every x seconds, and if there's no answer 3 times in a row, it considers the connection to be dead. So this is essentially ServerAlive/ClientAlive mechanism. > - Specifying -Iy would say, if there's no incoming or outgoing data for y seconds, it considers the connection dead. So this is sort of the others side of the TCP keepalive coin. > > Is my understanding correct? If not, can someone please shed some light on this for me? > > Regards, > Daniel From fancsali at gmail.com Wed Mar 18 23:06:36 2020 From: fancsali at gmail.com (=?UTF-8?Q?D=C3=A1niel_Fancsali?=) Date: Wed, 18 Mar 2020 15:06:36 +0000 Subject: Timeout settings In-Reply-To: <8F8D8547-39DB-4FFF-8A46-864E1FD3FEB5@ucc.asn.au> References: <8F8D8547-39DB-4FFF-8A46-864E1FD3FEB5@ucc.asn.au> Message-ID: Hello, Thank you very much for the answer. That clears it up. I reckon specifying '-K' on dbclient would then do the same as ServerAliveInterval. Cheers, Daniel On Wed, 18 Mar 2020 at 14:58, Matt Johnston wrote: > Hi Daniel, > > -K is equivalent to the OpenSSH ClientAliveInterval. The server will send > traffic to check that the connection is open. > > -I will disconnect if there is no traffic for a certain time interval. It > won't try to send any traffic over the connection, it just passively looks > at what traffic is being sent. > > Note that it seems that currently -K messages will cause the -I idle timer > to reset which isn't right, there's a pull request > https://github.com/mkj/dropbear/pull/90 which I will merge soon. > > Cheers, > Matt > > > > On Wed 18/3/2020, at 7:22 pm, D?niel Fancsali > wrote: > > > > Hello, > > > > First of all, let me just say this: awesome piece of software. Cheers! > > > > I am, however, a bit confused about the idle/keepalive settings. I have > been working with OpenSSH quite a bit, and do understand the concepts > around ServerAlive and ClientAlive as well as the TCPKeepAlive settings. > But I still struggle to wrap my head around -K and -I in dropbear. It's a > tad bit unclear which one maps to which one; or in other words, which one > happends on what layer. > > > > Maybe, my mistake here is trying to understand those in the context of > the OpenSSH settings, but on some level, it's the same protocol. > > > > So, looking at the code, I think this is what happens: > > - Setting -Kx will send an ssh packed every x seconds, and if there's no > answer 3 times in a row, it considers the connection to be dead. So this is > essentially ServerAlive/ClientAlive mechanism. > > - Specifying -Iy would say, if there's no incoming or outgoing data for > y seconds, it considers the connection dead. So this is sort of the others > side of the TCP keepalive coin. > > > > Is my understanding correct? If not, can someone please shed some light > on this for me? > > > > Regards, > > Daniel > > -------------- next part -------------- An HTML attachment was scrubbed... URL: https://lists.ucc.gu.uwa.edu.au/pipermail/dropbear/attachments/20200318/8935042b/attachment.htm From matt at ucc.asn.au Wed Mar 18 23:18:26 2020 From: matt at ucc.asn.au (Matt Johnston) Date: Wed, 18 Mar 2020 23:18:26 +0800 Subject: Hiding dropbear output on boot up In-Reply-To: References: Message-ID: <4FE80C80-1A2F-465C-8578-EF5192AB48A2@ucc.asn.au> Hi Tania, I think you could probably add "> /dev/null 2> /dev/null" after one of the ipconfig commands in /usr/share/initramfs-tools/scripts/functions, though I'm not too familiar with how they all fit together. (Or if it's dhclient for ipv6 printing the output, get rid of the "-v" for dhclient). You could report a Debian bug for the initramfs package, they might have a better idea of a fix for it. https://packages.debian.org/stretch/dropbear-initramfs Cheers, Matt > On Wed 18/3/2020, at 8:09 pm, Tania Hagan wrote: > > Hi Dropbear, > > I have setup dropbear and busybox on a Ubuntu 18.04 desktop with LUKS encryption. This works wonderfully except the IP-Config displays over the unlock disk prompt causing confusion with users. Is there a way to either hide this output or have it display before the LUKS unlock disk prompt? > > Thank you very much for any help. > Tania From horshack at live.com Thu Mar 19 00:36:24 2020 From: horshack at live.com (Horshack ??) Date: Wed, 18 Mar 2020 16:36:24 +0000 Subject: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 Message-ID: Hi, I have a strange issue on my Netgear X4S R7800. Running either DD-WRT or OpenWrt, approximately 30-70% of my SSH login attempts fail. For OpenSSH clients the error reported is "error in libcrypto". For the PuTTY client the error is more descriptive - "Signature from server's host key is invalid". The failure occurs even when using the OpenSSH client built in to OpenWrt itself (ie, SSH'ing into the router from the router via an existing remote SSH session). The failure appears to be at the tail end of the key exchange, before authentication. I've tried varying the cipher (aes128-ctr / aes256-ctr), the MAC (hmac-sha1 / hmac-sha2-256), and the key exchange algo (curve25519-sha256 / curve25519-sha256 at libssh.org / diffie-hellman-group14-sha256 / diffie-hellman-group14-sha1) but the intermittent failure still occurs. The frequency of failure is about the same for all these configuration options except for diffie-hellman-group14-sha256, which fails much more frequently - it sometimes takes hundreds of attempts to succeed. Perhaps that will provide a clue to the underlying cause. Once an SSH login succeeds the connection is stable. However if I initiate a manual rekey operation via ~R then the key re-exchange fails. The router is otherwise very stable with no noticeable issues. I'm an embedded firmware engineer but have never worked on DD-WRT/OpenWrt firmware or dropbear. I have a conceptual understanding of the key exchange algo but haven't looked at the actual code of any implementation including Dropbear's. I'm seek ideas on how to troubleshoot this issue. Considering the problem is intermittent I'm thinking it's some variant in the key generation/exchange algorithm that's failing due to some issue with the router, or a more remote possibility, an issue with the Dropbear implementation. Here are pastebin links to the PuTTY full debug logs (w/raw data dumps) for both the failure and success cases: Failure Case: https://pastebin.com/MS2BtFmW Success Case: https://pastebin.com/c4j66Ga9 The only message I see from dropbear for a failed connection attempt is: authpriv.info dropbear[15948]: Child connection from 192.168.1.249:54819 authpriv.info dropbear[15948]: Exit before auth: Disconnect received Thanks! -------------- next part -------------- An HTML attachment was scrubbed... URL: https://lists.ucc.gu.uwa.edu.au/pipermail/dropbear/attachments/20200318/b06350e6/attachment.htm From horshack at live.com Thu Mar 19 15:42:39 2020 From: horshack at live.com (=?utf-8?B?SG9yc2hhY2sg4oCq4oCs?=) Date: Thu, 19 Mar 2020 07:42:39 +0000 Subject: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 In-Reply-To: References: Message-ID: Update - I cloned and built the dbclient source so I could enable the debug tracing facility to get more information about the 'Bad hostkey signature'. The intermittent failure is detected in recv_msg_kexdh_reply() -> buf_rsa_verify() -> mp_cmd(). If I bypass the buf_rsa_verify() call then the session proceeds normally without issue, which indicates everything else in the key exchange is working 100% of the time. I'll dig deeper to see why the signed host key sent by the server is wrong. ________________________________ From: Horshack ?? Sent: Wednesday, March 18, 2020 9:36 AM To: dropbear at ucc.asn.au Subject: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 Hi, I have a strange issue on my Netgear X4S R7800. Running either DD-WRT or OpenWrt, approximately 30-70% of my SSH login attempts fail. For OpenSSH clients the error reported is "error in libcrypto". For the PuTTY client the error is more descriptive - "Signature from server's host key is invalid". The failure occurs even when using the OpenSSH client built in to OpenWrt itself (ie, SSH'ing into the router from the router via an existing remote SSH session). The failure appears to be at the tail end of the key exchange, before authentication. I've tried varying the cipher (aes128-ctr / aes256-ctr), the MAC (hmac-sha1 / hmac-sha2-256), and the key exchange algo (curve25519-sha256 / curve25519-sha256 at libssh.org / diffie-hellman-group14-sha256 / diffie-hellman-group14-sha1) but the intermittent failure still occurs. The frequency of failure is about the same for all these configuration options except for diffie-hellman-group14-sha256, which fails much more frequently - it sometimes takes hundreds of attempts to succeed. Perhaps that will provide a clue to the underlying cause. Once an SSH login succeeds the connection is stable. However if I initiate a manual rekey operation via ~R then the key re-exchange fails. The router is otherwise very stable with no noticeable issues. I'm an embedded firmware engineer but have never worked on DD-WRT/OpenWrt firmware or dropbear. I have a conceptual understanding of the key exchange algo but haven't looked at the actual code of any implementation including Dropbear's. I'm seek ideas on how to troubleshoot this issue. Considering the problem is intermittent I'm thinking it's some variant in the key generation/exchange algorithm that's failing due to some issue with the router, or a more remote possibility, an issue with the Dropbear implementation. Here are pastebin links to the PuTTY full debug logs (w/raw data dumps) for both the failure and success cases: Failure Case: https://pastebin.com/MS2BtFmW Success Case: https://pastebin.com/c4j66Ga9 The only message I see from dropbear for a failed connection attempt is: authpriv.info dropbear[15948]: Child connection from 192.168.1.249:54819 authpriv.info dropbear[15948]: Exit before auth: Disconnect received Thanks! -------------- next part -------------- An HTML attachment was scrubbed... URL: https://lists.ucc.gu.uwa.edu.au/pipermail/dropbear/attachments/20200319/f572f95e/attachment-0001.htm From matt at ucc.asn.au Thu Mar 19 22:04:42 2020 From: matt at ucc.asn.au (Matt Johnston) Date: Thu, 19 Mar 2020 22:04:42 +0800 Subject: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 In-Reply-To: References: Message-ID: Hi, The first thing I'd try would be to build with -O0 compilation flags to rule out compiler optimisations doing something strange. Cheers, Matt > On Thu 19/3/2020, at 3:42 pm, Horshack ?? wrote: > > Update - I cloned and built the dbclient source so I could enable the debug tracing facility to get more information about the 'Bad hostkey signature'. The intermittent failure is detected in recv_msg_kexdh_reply() -> buf_rsa_verify() -> mp_cmd(). If I bypass the buf_rsa_verify() call then the session proceeds normally without issue, which indicates everything else in the key exchange is working 100% of the time. I'll dig deeper to see why the signed host key sent by the server is wrong. > > From: Horshack ?? > Sent: Wednesday, March 18, 2020 9:36 AM > To: dropbear at ucc.asn.au > Subject: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 > > Hi, > > I have a strange issue on my Netgear X4S R7800. Running either DD-WRT or OpenWrt, approximately 30-70% of my SSH login attempts fail. For OpenSSH clients the error reported is "error in libcrypto". For the PuTTY client the error is more descriptive - "Signature from server's host key is invalid". The failure occurs even when using the OpenSSH client built in to OpenWrt itself (ie, SSH'ing into the router from the router via an existing remote SSH session). > > The failure appears to be at the tail end of the key exchange, before authentication. I've tried varying the cipher (aes128-ctr / aes256-ctr), the MAC (hmac-sha1 / hmac-sha2-256), and the key exchange algo (curve25519-sha256 / curve25519-sha256 at libssh.org / diffie-hellman-group14-sha256 / diffie-hellman-group14-sha1) but the intermittent failure still occurs. The frequency of failure is about the same for all these configuration options except for diffie-hellman-group14-sha256, which fails much more frequently - it sometimes takes hundreds of attempts to succeed. Perhaps that will provide a clue to the underlying cause. > > Once an SSH login succeeds the connection is stable. However if I initiate a manual rekey operation via ~R then the key re-exchange fails. The router is otherwise very stable with no noticeable issues. > > I'm an embedded firmware engineer but have never worked on DD-WRT/OpenWrt firmware or dropbear. I have a conceptual understanding of the key exchange algo but haven't looked at the actual code of any implementation including Dropbear's. I'm seek ideas on how to troubleshoot this issue. Considering the problem is intermittent I'm thinking it's some variant in the key generation/exchange algorithm that's failing due to some issue with the router, or a more remote possibility, an issue with the Dropbear implementation. > > Here are pastebin links to the PuTTY full debug logs (w/raw data dumps) for both the failure and success cases: > Failure Case: https://pastebin.com/MS2BtFmW > Success Case: https://pastebin.com/c4j66Ga9 > > The only message I see from dropbear for a failed connection attempt is: > > authpriv.info dropbear[15948]: Child connection from 192.168.1.249:54819 > authpriv.info dropbear[15948]: Exit before auth: Disconnect received > > > Thanks! -------------- next part -------------- An HTML attachment was scrubbed... URL: https://lists.ucc.gu.uwa.edu.au/pipermail/dropbear/attachments/20200319/159f89a7/attachment.htm From horshack at live.com Thu Mar 19 22:11:19 2020 From: horshack at live.com (=?utf-8?B?SG9yc2hhY2sg4oCq4oCs?=) Date: Thu, 19 Mar 2020 14:11:19 +0000 Subject: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 In-Reply-To: References: , Message-ID: Thanks Matt, I'll give that a shot when I get a build environment set up for the server-side/openwrt. I also plan to look at the RSA blinding logic in buf_put_rsa_sign(). Considering the intermittency of the issue I'm thinking the issue has some correlation or dependency to the random data generated or transformed by that logic. Crypto is well outside my core competency so it'll be slow-going. ________________________________ From: Matt Johnston Sent: Thursday, March 19, 2020 7:04 AM To: Horshack ?? Cc: dropbear at ucc.asn.au Subject: Re: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 Hi, The first thing I'd try would be to build with -O0 compilation flags to rule out compiler optimisations doing something strange. Cheers, Matt On Thu 19/3/2020, at 3:42 pm, Horshack ?? > wrote: Update - I cloned and built the dbclient source so I could enable the debug tracing facility to get more information about the 'Bad hostkey signature'. The intermittent failure is detected in recv_msg_kexdh_reply() -> buf_rsa_verify() -> mp_cmd(). If I bypass the buf_rsa_verify() call then the session proceeds normally without issue, which indicates everything else in the key exchange is working 100% of the time. I'll dig deeper to see why the signed host key sent by the server is wrong. ________________________________ From: Horshack ?? Sent: Wednesday, March 18, 2020 9:36 AM To: dropbear at ucc.asn.au > Subject: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 Hi, I have a strange issue on my Netgear X4S R7800. Running either DD-WRT or OpenWrt, approximately 30-70% of my SSH login attempts fail. For OpenSSH clients the error reported is "error in libcrypto". For the PuTTY client the error is more descriptive - "Signature from server's host key is invalid". The failure occurs even when using the OpenSSH client built in to OpenWrt itself (ie, SSH'ing into the router from the router via an existing remote SSH session). The failure appears to be at the tail end of the key exchange, before authentication. I've tried varying the cipher (aes128-ctr / aes256-ctr), the MAC (hmac-sha1 / hmac-sha2-256), and the key exchange algo (curve25519-sha256 / curve25519-sha256 at libssh.org / diffie-hellman-group14-sha256 / diffie-hellman-group14-sha1) but the intermittent failure still occurs. The frequency of failure is about the same for all these configuration options except for diffie-hellman-group14-sha256, which fails much more frequently - it sometimes takes hundreds of attempts to succeed. Perhaps that will provide a clue to the underlying cause. Once an SSH login succeeds the connection is stable. However if I initiate a manual rekey operation via ~R then the key re-exchange fails. The router is otherwise very stable with no noticeable issues. I'm an embedded firmware engineer but have never worked on DD-WRT/OpenWrt firmware or dropbear. I have a conceptual understanding of the key exchange algo but haven't looked at the actual code of any implementation including Dropbear's. I'm seek ideas on how to troubleshoot this issue. Considering the problem is intermittent I'm thinking it's some variant in the key generation/exchange algorithm that's failing due to some issue with the router, or a more remote possibility, an issue with the Dropbear implementation. Here are pastebin links to the PuTTY full debug logs (w/raw data dumps) for both the failure and success cases: Failure Case: https://pastebin.com/MS2BtFmW Success Case: https://pastebin.com/c4j66Ga9 The only message I see from dropbear for a failed connection attempt is: authpriv.info dropbear[15948]: Child connection from 192.168.1.249:54819 authpriv.info dropbear[15948]: Exit before auth: Disconnect received Thanks! -------------- next part -------------- An HTML attachment was scrubbed... URL: https://lists.ucc.gu.uwa.edu.au/pipermail/dropbear/attachments/20200319/9d90244a/attachment-0001.htm From horshack at live.com Fri Mar 20 15:28:12 2020 From: horshack at live.com (=?utf-8?B?SG9yc2hhY2sg4oCq4oCs?=) Date: Fri, 20 Mar 2020 07:28:12 +0000 Subject: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 In-Reply-To: References: , , Message-ID: Update - I have isolated the intermittent issue down to the interchangeable functions s_mp_exptmod_fast() and s_mp_exptmod() - by default s_mp_exptmod_fast() is compiled instead of s_mp_exptmod() [BN_MP_EXPTMOD_FAST_C] but both functions intermittently fail and I decided to use s_mp_exptmod() as my focus because it's slightly simpler. s_mp_exptmod() is called indirectly by rsa.c::buf_put_rsa_sign()'s call to mp_exptmod(). For the intermittent failing case if I call mp_exptmod() / s_mp_exptmod() immediately again with the same source mp_int structures it yields the correct data. Example - debug code bolded: DEF_MP_INT(rsa_s_backup); DEF_MP_INT(rsa_s_backup_2); mp_copy (&rsa_s, &rsa_s_backup); mp_copy (&rsa_s, &rsa_s_backup_2); if (mp_exptmod(&rsa_tmp1, key->d, key->n, &rsa_s) != MP_OKAY) { dropbear_exit("RSA error"); } if (mp_exptmod(&rsa_tmp1, key->d, key->n, &rsa_s_backup) != MP_OKAY) { dropbear_exit("RSA error"); } if (mp_exptmod(&rsa_tmp1, key->d, key->n, &rsa_s_backup_2) != MP_OKAY) { dropbear_exit("RSA error"); } printf("after mp_exptmod\n"); dump_mp_int("rsa_s", &rsa_s); dump_mp_int("rsa_s_backup", &rsa_s_backup); dump_mp_int("rsa_s_backup_2", &rsa_s_backup_2); comp_mp_int("rsa_s", "rsa_s_backup", &rsa_s, &rsa_s_backup); comp_mp_int("rsa_s_backup", "rsa_s_backup_2", &rsa_s_backup, &rsa_s_backup_2); mp_clear(&rsa_s_backup); mp_clear(&rsa_s_backup_2); Sample output from a failure, which contains the first portion of each mp_int->dp. Bolded text has wrong data: after mp_exptmod rsa_s [0xbef6c358]: 0000 4a 00 00 00 c0 00 00 00 00 00 00 00 30 e1 8f 00 J...........0... rsa_s->dp [0x008fe130]: 0000 05 fb c0 0f 68 91 ff 0a 9f 05 57 0b 35 a2 bd 05 ....h.....W.5... 0010 57 ec a0 0b 34 3c b1 0f fa 8b b5 08 ed aa 9c 04 W...4<.......... 0020 7e 88 bb 04 12 42 51 05 9a 6d 7d 0a 98 ef 12 0c ~....BQ..m}..... 0030 76 e0 f4 0f ea 89 d7 0c 87 b0 76 03 12 a1 2d 0e v.........v...-. 0040 d7 3c df 06 0f 54 92 04 23 90 .<...T..#. rsa_s_backup [0xbef6c398]: 0000 4a 00 00 00 c0 00 00 00 00 00 00 00 00 d8 8f 00 J............... rsa_s_backup->dp [0x008fd800]: 0000 ec 9f a0 01 d4 8e e8 07 c3 ae df 0b 45 61 e6 06 ............Ea.. 0010 a1 99 59 03 d7 49 24 02 50 a6 ac 0a de a2 5c 0d ..Y..I$.P.....\. 0020 cb b7 3c 05 33 cb da 08 28 10 f2 04 14 69 d6 07 ..<.3...(....i.. 0030 8c 8e a5 04 f5 fc 92 0c ba 88 d9 04 71 b4 b2 08 ............q... 0040 bc 4f c7 0d de 73 f9 06 0d bf .O...s.... rsa_s_backup_2 [0xbef6c3a8]: 0000 4a 00 00 00 c0 00 00 00 00 00 00 00 e0 d1 8f 00 J............... rsa_s_backup_2->dp [0x008fd1e0]: 0000 ec 9f a0 01 d4 8e e8 07 c3 ae df 0b 45 61 e6 06 ............Ea.. 0010 a1 99 59 03 d7 49 24 02 50 a6 ac 0a de a2 5c 0d ..Y..I$.P.....\. 0020 cb b7 3c 05 33 cb da 08 28 10 f2 04 14 69 d6 07 ..<.3...(....i.. 0030 8c 8e a5 04 f5 fc 92 0c ba 88 d9 04 71 b4 b2 08 ............q... 0040 bc 4f c7 0d de 73 f9 06 0d bf .O...s.... rsa_s and rsa_s_backup differ Sometimes it's the second or third call that yields the incorrect data. In this instance it was the second call: after mp_exptmod rsa_s [0xbe9a6358]: 0000 4a 00 00 00 c0 00 00 00 00 00 00 00 30 c1 40 02 J...........0. at . rsa_s->dp [0x0240c130]: 0000 25 b9 db 00 ec 62 00 0d 80 2d b0 0d 00 13 d3 06 %....b...-...... 0010 3f ec 8b 0a af 5d e9 03 2d f4 4b 0c 6c 3c 72 08 ?....]..-.K.ldp [0x0240b800]: 0000 df 86 0c 0a 6c 2f 68 09 f9 a1 37 01 26 02 e7 0b ....l/h...7.&... 0010 69 5c b8 0e 0b 95 3a 0d 26 24 00 0e 97 6f dc 0b i\....:.&$...o.. 0020 64 95 ed 0a c0 75 53 03 66 3d ff 0b 26 4b ce 09 d....uS.f=..&K.. 0030 89 12 d2 03 9b 9b 0b 09 19 2c 5a 00 2c 99 fc 0b .........,Z.,... 0040 ea ad 61 09 38 e1 6a 0a 49 a5 ..a.8.j.I. rsa_s_backup_2 [0xbe9a63a8]: 0000 4a 00 00 00 c0 00 00 00 00 00 00 00 e0 b1 40 02 J............. at . rsa_s_backup_2->dp [0x0240b1e0]: 0000 25 b9 db 00 ec 62 00 0d 80 2d b0 0d 00 13 d3 06 %....b...-...... 0010 3f ec 8b 0a af 5d e9 03 2d f4 4b 0c 6c 3c 72 08 ?....]..-.K.l Sent: Thursday, March 19, 2020 7:11 AM To: Matt Johnston Cc: dropbear at ucc.asn.au Subject: Re: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 Thanks Matt, I'll give that a shot when I get a build environment set up for the server-side/openwrt. I also plan to look at the RSA blinding logic in buf_put_rsa_sign(). Considering the intermittency of the issue I'm thinking the issue has some correlation or dependency to the random data generated or transformed by that logic. Crypto is well outside my core competency so it'll be slow-going. ________________________________ From: Matt Johnston Sent: Thursday, March 19, 2020 7:04 AM To: Horshack ?? Cc: dropbear at ucc.asn.au Subject: Re: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 Hi, The first thing I'd try would be to build with -O0 compilation flags to rule out compiler optimisations doing something strange. Cheers, Matt On Thu 19/3/2020, at 3:42 pm, Horshack ?? > wrote: Update - I cloned and built the dbclient source so I could enable the debug tracing facility to get more information about the 'Bad hostkey signature'. The intermittent failure is detected in recv_msg_kexdh_reply() -> buf_rsa_verify() -> mp_cmd(). If I bypass the buf_rsa_verify() call then the session proceeds normally without issue, which indicates everything else in the key exchange is working 100% of the time. I'll dig deeper to see why the signed host key sent by the server is wrong. ________________________________ From: Horshack ?? Sent: Wednesday, March 18, 2020 9:36 AM To: dropbear at ucc.asn.au > Subject: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 Hi, I have a strange issue on my Netgear X4S R7800. Running either DD-WRT or OpenWrt, approximately 30-70% of my SSH login attempts fail. For OpenSSH clients the error reported is "error in libcrypto". For the PuTTY client the error is more descriptive - "Signature from server's host key is invalid". The failure occurs even when using the OpenSSH client built in to OpenWrt itself (ie, SSH'ing into the router from the router via an existing remote SSH session). The failure appears to be at the tail end of the key exchange, before authentication. I've tried varying the cipher (aes128-ctr / aes256-ctr), the MAC (hmac-sha1 / hmac-sha2-256), and the key exchange algo (curve25519-sha256 / curve25519-sha256 at libssh.org / diffie-hellman-group14-sha256 / diffie-hellman-group14-sha1) but the intermittent failure still occurs. The frequency of failure is about the same for all these configuration options except for diffie-hellman-group14-sha256, which fails much more frequently - it sometimes takes hundreds of attempts to succeed. Perhaps that will provide a clue to the underlying cause. Once an SSH login succeeds the connection is stable. However if I initiate a manual rekey operation via ~R then the key re-exchange fails. The router is otherwise very stable with no noticeable issues. I'm an embedded firmware engineer but have never worked on DD-WRT/OpenWrt firmware or dropbear. I have a conceptual understanding of the key exchange algo but haven't looked at the actual code of any implementation including Dropbear's. I'm seek ideas on how to troubleshoot this issue. Considering the problem is intermittent I'm thinking it's some variant in the key generation/exchange algorithm that's failing due to some issue with the router, or a more remote possibility, an issue with the Dropbear implementation. Here are pastebin links to the PuTTY full debug logs (w/raw data dumps) for both the failure and success cases: Failure Case: https://pastebin.com/MS2BtFmW Success Case: https://pastebin.com/c4j66Ga9 The only message I see from dropbear for a failed connection attempt is: authpriv.info dropbear[15948]: Child connection from 192.168.1.249:54819 authpriv.info dropbear[15948]: Exit before auth: Disconnect received Thanks! -------------- next part -------------- An HTML attachment was scrubbed... URL: https://lists.ucc.gu.uwa.edu.au/pipermail/dropbear/attachments/20200320/29b757ab/attachment-0001.htm From horshack at live.com Tue Mar 24 11:23:29 2020 From: horshack at live.com (=?utf-8?B?SG9yc2hhY2sg4oCq4oCs?=) Date: Tue, 24 Mar 2020 03:23:29 +0000 Subject: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 In-Reply-To: References: , <44E94890-EAC4-4E9B-89AA-CE81087B7EFA@ucc.asn.au>, , , Message-ID: I was able to isolate the issue to just a handful of assembly instructions within fast_s_mp_sqr(), related to the squaring loop. I broke that code out into a separate utility that reproduces the issue within a few seconds. The failure is somewhat sensitive to the data pattern and very sensitive to timing, indicating a likely memory/data path issue within my particular router. I'm guessing it's the IPQ8065 and not the SDRAM because I can get it to fail with a tiny data set easily fits within DCACHE. I can alter the frequency of the failure with a single ARM memory barrier instruction, which at first implied a superscalar data ordering condition but the memory barrier also alters the timing through the DCACHE so that is likely the effect it's having. I was able to exclude the VFP/Neon register corruption as the cause with some test code. I also excluded any context switch-speciifc issue by measuring the # of context switches in /proc//status and catching a failure where no switches had occurred. I also modified the affinity so the utility runs on just one processor to rule out a specific core having the issue. I put the source and binary of my utility on github - if anyone on this mailing list has this model router can you give it a try if possible? You only need the ipq8065-sqrbug (binary) and run-ipq8065-sqrbug.sh (script). Here's the link to the repository: https://github.com/horshack-dpreview/ipq8065-sqrbug ________________________________ From: Horshack ?? Sent: Saturday, March 21, 2020 7:54 AM To: dropbear at ucc.asn.au Subject: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 Including mailing list for my last two messages below... Begin forwarded message: From: Horshack ?? Date: March 21, 2020 at 7:35:18 AM PDT To: Matt Johnston Cc: "dropbear at ucc.asn.au" Subject: Re: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 ? Disassembly of fast_s_mp_sqr() and other libtommath functions reveals gcc is utilizing the arm NEON SIMD instructions and registers for calculations involved with libtommath's mp_word scalar. Based on the 64-bit word corruption I see I'm guessing the SIMD registers aren't being preserved/restored properly somewhere, probably during a context switch, specifically s16?s31 (d8?d15, q4?q7), which AAPCS says must be preserved and which I see being used in the disassembly of fast_s_mp_sqr(). I'lll write some test code later today to see if this is the case, and if so, try to track down where and why the registers aren't being preserved. ________________________________ From: Horshack ?? Sent: Saturday, March 21, 2020 1:11 AM To: Matt Johnston Cc: dropbear at ucc.asn.au Subject: Re: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 I have one of the failure paths isolated down to a single corrupt 64-bit word in memory, which required a significant amount of code instrumentation to achieve. I implemented a code execution history buffer that gets filled at various checkpoints within s_mp_exptmod() and some of the modules called by it. To facilitate this history mechanism I packaged all of s_mp_exptmod()'s local variables inside a structure , which consists of saving the local scalar vars in addition to crc32's of all the mp_int data structures with a separate crc32 of the mp_int.dp payload (data). When a failure occurs, ie one or more of the three back-to-back debug invocations of s_mp_exptmod yields a mismatching signed key result, I dump out the history elements for each of the invocations to determine the first code checkpoint where failing invocation departed from the known correct invocation. Here's a sample capture demonstrating. Format is event #:, source code line #, crc32 of local scalars, crc32 of mp_int structures (minus dp field), and crc32 of all the mp_int dp data payloads. In this sample, the crc32 of the dp data payload is different, which causes all subsequent crc32's for the remainder of the invocation to be difference since the data propagates through all the subsequent calculations performed in the routine. 1554: line=0492, crcLocalVars=6a08573e, crcMpIntNoDp=ab967993, crcMpIntDp=ded4078e crcRes=2554be5b, 0021 0005 0016 0002 0061 0003 0001 1555: line=0488, crcLocalVars=7dc8fe2c, crcMpIntNoDp=ad3e197a, crcMpIntDp=e71d5c11 crcRes=5ef59250, 0021 0005 0016 0002 0061 0004 0001 1556: line=2049, crcLocalVars=7dc8fe2c, crcMpIntNoDp=ad3e197a, crcMpIntDp=e71d5c11 crcRes=5ef59250, 0021 0005 0016 0002 0061 0004 0001 1557: line=2062, crcLocalVars=7dc8fe2c, crcMpIntNoDp=ab967993, crcMpIntDp=21b13223 crcRes=a43fde70, 0021 0005 0016 0002 0061 0004 0001 1558: line=0492, crcLocalVars=7dc8fe2c, crcMpIntNoDp=ab967993, crcMpIntDp=21b13223 crcRes=a43fde70, 0021 0005 0016 0002 0061 0004 0001 1559: line=0501, crcLocalVars=7a3e1d2a, crcMpIntNoDp=ad3e197a, crcMpIntDp=7691624d crcRes=6d1388bc, 0021 0005 0016 0002 0061 0005 0001 1554: line=0492, crcLocalVars=6a08573e, crcMpIntNoDp=ab967993, crcMpIntDp=ded4078e crcRes=2554be5b, 0021 0005 0016 0002 0061 0003 0001 1555: line=0488, crcLocalVars=7dc8fe2c, crcMpIntNoDp=ad3e197a, crcMpIntDp=e71d5c11 crcRes=5ef59250, 0021 0005 0016 0002 0061 0004 0001 1556: line=2049, crcLocalVars=7dc8fe2c, crcMpIntNoDp=ad3e197a, crcMpIntDp=e71d5c11 crcRes=5ef59250, 0021 0005 0016 0002 0061 0004 0001 1557: line=2062, crcLocalVars=7dc8fe2c, crcMpIntNoDp=ab967993, crcMpIntDp=a2639ce4 crcRes=74f7dec6, 0021 0005 0016 0002 0061 0004 0001 1558: line=0492, crcLocalVars=7dc8fe2c, crcMpIntNoDp=ab967993, crcMpIntDp=a2639ce4 crcRes=74f7dec6, 0021 0005 0016 0002 0061 0004 0001 1559: line=0501, crcLocalVars=7a3e1d2a, crcMpIntNoDp=ad3e197a, crcMpIntDp=5e3343d2 crcRes=517ed1b0, 0021 0005 0016 0002 0061 0005 0001 I initially found the failure occurs at seemingly random places, affected mostly by the variances of code/data placement between builds, which also affects the frequency of failure. Through a lot of trial and error I was able to tease the failure down to one of the simplest code paths (fast_s_mp_sqr), which required balancing debug code placement to keep the movement of the failure in control. fast_s_mp_sqr() does only basic arithmetic and is easy to follow. I haven't yet determined if the corrupt data is pre-calculation or post-calculation due to the limits of how much data I can snapshot in the history buffer. Nevertheless I expanded the history mechanism to snapshot the specific mp_int that usually is corrupted via this path (s_mp_exptmod's local res structure). Here is correct vs corrupt mp_init at the specific execution point where it departs from the previous correction invocation. The data fields prefixed by : are the actual content of the mp_int - I've highlighted the mismatching crc32's and the mismatching 64-bit word: Correct invocation: 8057: line=2062, crcLocalVars=1d8f10b6, crcMpIntNoDp=80a0f0a7, crcMpIntDp=e92a3e1f crcRes=02003870, 0018 0005 0020 0002 0016 0002 0000 : 05297100 04f1e4e6 0fb47d28 0ab5d584 00b2778c 08656465 02cc79bb 05e280c3 - 073117bc 037170a2 0603ef41 0a73c7af 0388c6cd 08b543fa 055d90c9 006afe46 : 0c4d0d2b 0b8753bf 0ba6b917 0dbc26af 0d5d541f 03cbd888 0a8b07bb 06ce141b - 0f2e2cdc 0d83829c 00b9e992 007a007e 0b35c3fa 0f97fa98 078b16e2 05681c5a : 09e81cad 0fcb1b35 0f017b34 0828f9c8 08253004 02f4139f 07b97efe 03a2c2c6 - 0baf31f0 038dc84d 0ec2028d 0a4d2163 0b3d8f14 03a5b8a1 07656722 0636f515 : 047c6a4e 0249e773 074fdaae 0c7affcb 025e144e 0e6e524b 0369a7e6 005e5b18 - 07359ab7 094aa102 06e091dc 048578b3 0f2023d6 09e16318 0fb25f70 091e7d0c : 00e038fe 01fe0be1 0c879fba 055feb36 05135c48 063ef5c4 062acf74 0e2ee213 - 0b32d4b4 01ac1beb 0df27135 0645d3a2 02f54fab 04524d06 0e21e0a0 01a58051 : 0d0dd311 0b10815a 08044871 0bec8042 0473b083 0d99e620 0db94b72 07398f84 - 06930d29 021f81cd 0e96625a 0ffa3c78 0c9908d6 0fd6f904 0f5dcfd9 0bd6e140 : 0357bd4b 0488f3a9 00ed811d 0c8a129f 0bde5ab5 0c61d340 042eea72 01fe06f5 - 018c9e3d 025ede93 0ce5786c 00c174de 0479c67d 06c711f5 052ebca1 093bf956 : 042b9b5e 06a62fce 0eef5130 0065890a 0ed4ef4d 0adc823d 0b7ab96f 04639d68 - 0484c7b5 0135f153 0818067f 00cffc19 0097dcba 016e355b 002e3d3e 051065cb : 0b41750c 049fb50f 0be87386 0d76e872 0de83a61 04f8c371 07daa886 03a70e50 - 0c79ea89 016660c2 0963ebd6 09d9b469 0abd18ff 02c370ac 0ad5b8ba 04846255 : 0e7c9e10 03662210 00000011 00000000 00000000 00000000 00000000 00000000 - 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 Corrupt invocation: 8057: line=2062, crcLocalVars=1d8f10b6, crcMpIntNoDp=80a0f0a7, crcMpIntDp=5a521526 crcRes=86bd8450, 0018 0005 0020 0002 0016 0002 0000 : 05297100 04f1e4e6 0fb47d28 0ab5d584 00b2778c 08656465 02cc79bb 05e280c3 - 073117bc 037170a2 0603ef41 0a73c7af 0388c6cd 08b543fa 055d90c9 006afe46 : 0c4d0d2b 0b8753bf 0ba6b917 0dbc26af 0d5d541f 03cbd888 0a8b07bb 06ce141b - 0f2e2cdc 0d83829c 00b9e992 007a007e 0b35c3fa 0f97fa98 078b16e2 05681c5a : 09e81cad 0fcb1b35 0f017b34 0828f9c8 08253004 02f4139f 07b97efe 03a2c2c6 - 0baf31f0 038dc84d 0ec2028d 0a4d2163 0b3d8f14 03a5b8a1 07656722 0636f515 : 047c6a4e 0249e773 074fdaae 0c7affcb 025e144e 0e6e524b 0369a7e6 005e5b18 - 07359ab7 094aa102 06e091dc 048578b3 0f2023d6 09e16318 0fb25f70 091e7d0c : 00e038fe 01fe0be1 0c879fba 055feb36 05135c48 063ef5c4 062acf74 0e2ee213 - 0b32d4b4 01ac1beb 0df27135 0645d3a2 02f54fab 04524d06 0e21e0a0 01a58051 : 0d0dd311 0b10815a 08044871 0bec8042 0473b083 0d99e620 0db94b72 07398f84 - 06930d29 021f81cd 0e96625a 0ffa3c78 0c9908d6 0fd6f904 0f5dcfd9 0bd6e140 : 0357bd4b 0488f3a9 00ed811d 0c8a129f 0bde5ab5 0c61d340 042eea72 01fe06f5 - 018c9e3d 025ede93 0ce5786c 00c174de 0479c67d 06c711f5 052ebca1 093bf956 : 042b9b5e 06a62fce 0eef5130 0065890a 0ed4ef4d 0adc823d 0b7ab96f 04639d68 - 0484c7b5 0135f153 0818067f 00cffc19 0097dcba 016e355b 002e3d3e 051065cb : 0b41750c 049fb50f 0be87386 0d76e872 0de83a61 07156229 072adcf7 03a70e50 - 0c79ea89 016660c2 0963ebd6 09d9b469 0abd18ff 02c370ac 0ad5b8ba 04846255 : 0e7c9e10 03662210 00000011 00000000 00000000 00000000 00000000 00000000 - 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 I don't see any immediate relationship between the corrupt vs expected data or any unique attributes of the corrupt data over multiple captures I've done. The above mp_int is post-execution of fast_s_mp_sqr(), so any corruption occurring within its execution will get folded in and propagated into a form that wont be immediately recognizable since it's undergone arithmetic operations within the routine. The fact the corruption is always a single 64-bit word is a good clue. fast_s_mp_sqr() uses 64-bit scalars (mp_word) in its carry arithmetic logic - I'll be looking into the disassembly of the routine to dig deeper. For reference here is the history structures used for the above dumps: typedef struct _LOCAL_VARS { // local vars of s_mp_exptmod() packaged into a struct mp_int *G; mp_int *X; mp_int *P; mp_int *Y; mp_int M[TAB_SIZE]; mp_int res; mp_int mu; mp_digit buf; int err, bitbuf, bitcpy, bitcnt, mode, digidx, x, y, winsize; } LOCAL_VARS; typedef struct _HISTORY_ELEMENT { ushort lineNumber; ushort pad; uint crcLocalVars; uint crcMpInt_WithoutDp; // mp_int structure excluding .dp uint crcMpIntDp; // all mp_int's in LOCAL_VARS uint crcRes; // just LOCAL_VARS.res uint resDp[160]; // content of LOCAL_VARS.res ushort bitbuf, bitcpy, bitcnt, mode, digidx, x, y; } HISTORY_ELEMENT; Here is the CPU info: root at OpenWrt:/tmp# cat /proc/cpuinfo processor : 0 model name : ARMv7 Processor rev 0 (v7l) BogoMIPS : 6.00 Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 CPU implementer : 0x51 CPU architecture: 7 CPU variant : 0x2 CPU part : 0x04d CPU revision : 0 processor : 1 model name : ARMv7 Processor rev 0 (v7l) BogoMIPS : 12.50 Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 CPU implementer : 0x51 CPU architecture: 7 CPU variant : 0x2 CPU part : 0x04d CPU revision : 0 Hardware : Generic DT based system Revision : 0000 Serial : 0000000000000000 And the first few messages of the kernel log showing version and detected CPU details: [ 0.000000] Booting Linux on physical CPU 0x0 [ 0.000000] Linux version 4.14.171 (builder at buildhost) (gcc version 7.5.0 (OpenWrt GCC 7.5.0 r10947-65030d81f3)) #0 SMP Thu Feb 27 21:05:12 2020 [ 0.000000] CPU: ARMv7 Processor [512f04d0] revision 0 (ARMv7), cr=10c5787d [ 0.000000] CPU: div instructions available: patching division code [ 0.000000] CPU: PIPT / VIPT nonaliasing data cache, PIPT instruction cache [ 0.000000] OF: fdt: Machine model: Netgear Nighthawk X4S R7800 [ 0.000000] Memory policy: Data cache writealloc [ 0.000000] On node 0 totalpages: 122880 [ 0.000000] free_area_init_node: node 0, pgdat c0a27880, node_mem_map dda39000 [ 0.000000] Normal zone: 960 pages used for memmap [ 0.000000] Normal zone: 0 pages reserved [ 0.000000] Normal zone: 122880 pages, LIFO batch:31 [ 0.000000] random: get_random_bytes called from 0xc09008dc with crng_init=0 [ 0.000000] percpu: Embedded 15 pages/cpu s29388 r8192 d23860 u61440 ________________________________ From: Matt Johnston Sent: Friday, March 20, 2020 3:50 AM To: Horshack ?? Subject: Re: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 Hi, That's an interesting failure. You should be able to disable SMP if you set maxcpus=1 as a kernel boot argument - not sure where you would set that for your device though. I guess the other option is that a kernel syscall somewhere is clobbering registers, disabling SMP wouldn't avoid that... Which kernel is it running, and what's the CPU (/proc/cpuinfo)? Cheers, Matt On Fri 20/3/2020, at 3:28 pm, Horshack ?? > wrote: Update - I have isolated the intermittent issue down to the interchangeable functions s_mp_exptmod_fast() and s_mp_exptmod() - by default s_mp_exptmod_fast() is compiled instead of s_mp_exptmod() [BN_MP_EXPTMOD_FAST_C] but both functions intermittently fail and I decided to use s_mp_exptmod() as my focus because it's slightly simpler. s_mp_exptmod() is called indirectly by rsa.c::buf_put_rsa_sign()'s call to mp_exptmod(). For the intermittent failing case if I call mp_exptmod() / s_mp_exptmod() immediately again with the same source mp_int structures it yields the correct data. Example - debug code bolded: DEF_MP_INT(rsa_s_backup); DEF_MP_INT(rsa_s_backup_2); mp_copy (&rsa_s, &rsa_s_backup); mp_copy (&rsa_s, &rsa_s_backup_2); if (mp_exptmod(&rsa_tmp1, key->d, key->n, &rsa_s) != MP_OKAY) { dropbear_exit("RSA error"); } if (mp_exptmod(&rsa_tmp1, key->d, key->n, &rsa_s_backup) != MP_OKAY) { dropbear_exit("RSA error"); } if (mp_exptmod(&rsa_tmp1, key->d, key->n, &rsa_s_backup_2) != MP_OKAY) { dropbear_exit("RSA error"); } printf("after mp_exptmod\n"); dump_mp_int("rsa_s", &rsa_s); dump_mp_int("rsa_s_backup", &rsa_s_backup); dump_mp_int("rsa_s_backup_2", &rsa_s_backup_2); comp_mp_int("rsa_s", "rsa_s_backup", &rsa_s, &rsa_s_backup); comp_mp_int("rsa_s_backup", "rsa_s_backup_2", &rsa_s_backup, &rsa_s_backup_2); mp_clear(&rsa_s_backup); mp_clear(&rsa_s_backup_2); Sample output from a failure, which contains the first portion of each mp_int->dp. Bolded text has wrong data: after mp_exptmod rsa_s [0xbef6c358]: 0000 4a 00 00 00 c0 00 00 00 00 00 00 00 30 e1 8f 00 J...........0... rsa_s->dp [0x008fe130]: 0000 05 fb c0 0f 68 91 ff 0a 9f 05 57 0b 35 a2 bd 05 ....h.....W.5... 0010 57 ec a0 0b 34 3c b1 0f fa 8b b5 08 ed aa 9c 04 W...4<.......... 0020 7e 88 bb 04 12 42 51 05 9a 6d 7d 0a 98 ef 12 0c ~....BQ..m}..... 0030 76 e0 f4 0f ea 89 d7 0c 87 b0 76 03 12 a1 2d 0e v.........v...-. 0040 d7 3c df 06 0f 54 92 04 23 90 .<...T..#. rsa_s_backup [0xbef6c398]: 0000 4a 00 00 00 c0 00 00 00 00 00 00 00 00 d8 8f 00 J............... rsa_s_backup->dp [0x008fd800]: 0000 ec 9f a0 01 d4 8e e8 07 c3 ae df 0b 45 61 e6 06 ............Ea.. 0010 a1 99 59 03 d7 49 24 02 50 a6 ac 0a de a2 5c 0d ..Y..I$.P.....\. 0020 cb b7 3c 05 33 cb da 08 28 10 f2 04 14 69 d6 07 ..<.3...(....i.. 0030 8c 8e a5 04 f5 fc 92 0c ba 88 d9 04 71 b4 b2 08 ............q... 0040 bc 4f c7 0d de 73 f9 06 0d bf .O...s.... rsa_s_backup_2 [0xbef6c3a8]: 0000 4a 00 00 00 c0 00 00 00 00 00 00 00 e0 d1 8f 00 J............... rsa_s_backup_2->dp [0x008fd1e0]: 0000 ec 9f a0 01 d4 8e e8 07 c3 ae df 0b 45 61 e6 06 ............Ea.. 0010 a1 99 59 03 d7 49 24 02 50 a6 ac 0a de a2 5c 0d ..Y..I$.P.....\. 0020 cb b7 3c 05 33 cb da 08 28 10 f2 04 14 69 d6 07 ..<.3...(....i.. 0030 8c 8e a5 04 f5 fc 92 0c ba 88 d9 04 71 b4 b2 08 ............q... 0040 bc 4f c7 0d de 73 f9 06 0d bf .O...s.... rsa_s and rsa_s_backup differ Sometimes it's the second or third call that yields the incorrect data. In this instance it was the second call: after mp_exptmod rsa_s [0xbe9a6358]: 0000 4a 00 00 00 c0 00 00 00 00 00 00 00 30 c1 40 02 J...........0. at . rsa_s->dp [0x0240c130]: 0000 25 b9 db 00 ec 62 00 0d 80 2d b0 0d 00 13 d3 06 %....b...-...... 0010 3f ec 8b 0a af 5d e9 03 2d f4 4b 0c 6c 3c 72 08 ?....]..-.K.ldp [0x0240b800]: 0000 df 86 0c 0a 6c 2f 68 09 f9 a1 37 01 26 02 e7 0b ....l/h...7.&... 0010 69 5c b8 0e 0b 95 3a 0d 26 24 00 0e 97 6f dc 0b i\....:.&$...o.. 0020 64 95 ed 0a c0 75 53 03 66 3d ff 0b 26 4b ce 09 d....uS.f=..&K.. 0030 89 12 d2 03 9b 9b 0b 09 19 2c 5a 00 2c 99 fc 0b .........,Z.,... 0040 ea ad 61 09 38 e1 6a 0a 49 a5 ..a.8.j.I. rsa_s_backup_2 [0xbe9a63a8]: 0000 4a 00 00 00 c0 00 00 00 00 00 00 00 e0 b1 40 02 J............. at . rsa_s_backup_2->dp [0x0240b1e0]: 0000 25 b9 db 00 ec 62 00 0d 80 2d b0 0d 00 13 d3 06 %....b...-...... 0010 3f ec 8b 0a af 5d e9 03 2d f4 4b 0c 6c 3c 72 08 ?....]..-.K.l> Sent: Thursday, March 19, 2020 7:11 AM To: Matt Johnston > Cc: dropbear at ucc.asn.au > Subject: Re: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 Thanks Matt, I'll give that a shot when I get a build environment set up for the server-side/openwrt. I also plan to look at the RSA blinding logic in buf_put_rsa_sign(). Considering the intermittency of the issue I'm thinking the issue has some correlation or dependency to the random data generated or transformed by that logic. Crypto is well outside my core competency so it'll be slow-going. ________________________________ From: Matt Johnston > Sent: Thursday, March 19, 2020 7:04 AM To: Horshack ?? > Cc: dropbear at ucc.asn.au > Subject: Re: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 Hi, The first thing I'd try would be to build with -O0 compilation flags to rule out compiler optimisations doing something strange. Cheers, Matt On Thu 19/3/2020, at 3:42 pm, Horshack ?? > wrote: Update - I cloned and built the dbclient source so I could enable the debug tracing facility to get more information about the 'Bad hostkey signature'. The intermittent failure is detected in recv_msg_kexdh_reply() -> buf_rsa_verify() -> mp_cmd(). If I bypass the buf_rsa_verify() call then the session proceeds normally without issue, which indicates everything else in the key exchange is working 100% of the time. I'll dig deeper to see why the signed host key sent by the server is wrong. ________________________________ From: Horshack ?? Sent: Wednesday, March 18, 2020 9:36 AM To: dropbear at ucc.asn.au > Subject: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 Hi, I have a strange issue on my Netgear X4S R7800. Running either DD-WRT or OpenWrt, approximately 30-70% of my SSH login attempts fail. For OpenSSH clients the error reported is "error in libcrypto". For the PuTTY client the error is more descriptive - "Signature from server's host key is invalid". The failure occurs even when using the OpenSSH client built in to OpenWrt itself (ie, SSH'ing into the router from the router via an existing remote SSH session). The failure appears to be at the tail end of the key exchange, before authentication. I've tried varying the cipher (aes128-ctr / aes256-ctr), the MAC (hmac-sha1 / hmac-sha2-256), and the key exchange algo (curve25519-sha256 / curve25519-sha256 at libssh.org / diffie-hellman-group14-sha256 / diffie-hellman-group14-sha1) but the intermittent failure still occurs. The frequency of failure is about the same for all these configuration options except for diffie-hellman-group14-sha256, which fails much more frequently - it sometimes takes hundreds of attempts to succeed. Perhaps that will provide a clue to the underlying cause. Once an SSH login succeeds the connection is stable. However if I initiate a manual rekey operation via ~R then the key re-exchange fails. The router is otherwise very stable with no noticeable issues. I'm an embedded firmware engineer but have never worked on DD-WRT/OpenWrt firmware or dropbear. I have a conceptual understanding of the key exchange algo but haven't looked at the actual code of any implementation including Dropbear's. I'm seek ideas on how to troubleshoot this issue. Considering the problem is intermittent I'm thinking it's some variant in the key generation/exchange algorithm that's failing due to some issue with the router, or a more remote possibility, an issue with the Dropbear implementation. Here are pastebin links to the PuTTY full debug logs (w/raw data dumps) for both the failure and success cases: Failure Case: https://pastebin.com/MS2BtFmW Success Case: https://pastebin.com/c4j66Ga9 The only message I see from dropbear for a failed connection attempt is: authpriv.info dropbear[15948]: Child connection from 192.168.1.249:54819 authpriv.info dropbear[15948]: Exit before auth: Disconnect received Thanks! -------------- next part -------------- An HTML attachment was scrubbed... URL: https://lists.ucc.gu.uwa.edu.au/pipermail/dropbear/attachments/20200324/0d1a0313/attachment-0001.htm From horshack at live.com Thu Mar 19 00:33:12 2020 From: horshack at live.com (Horshack ??) Date: Wed, 18 Mar 2020 16:33:12 +0000 Subject: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 Message-ID: Hi, I have a strange issue on my Netgear X4S R7800. Running either DD-WRT or OpenWrt, approximately 30-70% of my SSH login attempts fail. For OpenSSH clients the error reported is "error in libcrypto". For the PuTTY client the error is more descriptive - "Signature from server's host key is invalid". The failure occurs even when using the OpenSSH client built in to OpenWrt itself (ie, SSH'ing into the router from the router via an existing remote SSH session). The failure appears to be at the tail end of the key exchange, before authentication. I've tried varying the cipher (aes128-ctr / aes256-ctr), the MAC (hmac-sha1 / hmac-sha2-256), and the key exchange algo (curve25519-sha256 / curve25519-sha256 at libssh.org / diffie-hellman-group14-sha256 / diffie-hellman-group14-sha1) but the intermittent failure still occurs. The frequency of failure is about the same for all these configuration options except for diffie-hellman-group14-sha256, which fails much more frequently - it sometimes takes hundreds of attempts to succeed. Perhaps that will provide a clue to the underlying cause. Once an SSH login succeeds the connection is stable. However if I initiate a manual rekey operation via ~R then the key re-exchange fails. The router is otherwise very stable with no noticeable issues. I'm an embedded firmware engineer but have never worked on DD-WRT/OpenWrt firmware or dropbear. I have a conceptual understanding of the key exchange algo but haven't looked at the actual code of any implementation including Dropbear's. I'm seek ideas on how to troubleshoot this issue. Considering the problem is intermittent I'm thinking it's some variant in the key generation/exchange algorithm that's failing due to some issue with the router, or a more remote possibility, an issue with the Dropbear implementation. Here are pastebin links to the PuTTY full debug logs (w/raw data dumps) for both the failure and success cases: Failure Case: https://pastebin.com/MS2BtFmW Success Case: https://pastebin.com/c4j66Ga9 The only message I see from dropbear for a failed connection attempt is: authpriv.info dropbear[15948]: Child connection from 192.168.1.249:54819 authpriv.info dropbear[15948]: Exit before auth: Disconnect received Thanks! -------------- next part -------------- An HTML attachment was scrubbed... URL: https://lists.ucc.gu.uwa.edu.au/pipermail/dropbear/attachments/20200318/f55321cf/attachment-0001.htm From horshack at live.com Sat Mar 21 16:11:57 2020 From: horshack at live.com (=?utf-8?B?SG9yc2hhY2sg4oCq4oCs?=) Date: Sat, 21 Mar 2020 08:11:57 +0000 Subject: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 In-Reply-To: <44E94890-EAC4-4E9B-89AA-CE81087B7EFA@ucc.asn.au> References: , <44E94890-EAC4-4E9B-89AA-CE81087B7EFA@ucc.asn.au> Message-ID: I have one of the failure paths isolated down to a single corrupt 64-bit word in memory, which required a significant amount of code instrumentation to achieve. I implemented a code execution history buffer that gets filled at various checkpoints within s_mp_exptmod() and some of the modules called by it. To facilitate this history mechanism I packaged all of s_mp_exptmod()'s local variables inside a structure , which consists of saving the local scalar vars in addition to crc32's of all the mp_int data structures with a separate crc32 of the mp_int.dp payload (data). When a failure occurs, ie one or more of the three back-to-back debug invocations of s_mp_exptmod yields a mismatching signed key result, I dump out the history elements for each of the invocations to determine the first code checkpoint where failing invocation departed from the known correct invocation. Here's a sample capture demonstrating. Format is event #:, source code line #, crc32 of local scalars, crc32 of mp_int structures (minus dp field), and crc32 of all the mp_int dp data payloads. In this sample, the crc32 of the dp data payload is different, which causes all subsequent crc32's for the remainder of the invocation to be difference since the data propagates through all the subsequent calculations performed in the routine. 1554: line=0492, crcLocalVars=6a08573e, crcMpIntNoDp=ab967993, crcMpIntDp=ded4078e crcRes=2554be5b, 0021 0005 0016 0002 0061 0003 0001 1555: line=0488, crcLocalVars=7dc8fe2c, crcMpIntNoDp=ad3e197a, crcMpIntDp=e71d5c11 crcRes=5ef59250, 0021 0005 0016 0002 0061 0004 0001 1556: line=2049, crcLocalVars=7dc8fe2c, crcMpIntNoDp=ad3e197a, crcMpIntDp=e71d5c11 crcRes=5ef59250, 0021 0005 0016 0002 0061 0004 0001 1557: line=2062, crcLocalVars=7dc8fe2c, crcMpIntNoDp=ab967993, crcMpIntDp=21b13223 crcRes=a43fde70, 0021 0005 0016 0002 0061 0004 0001 1558: line=0492, crcLocalVars=7dc8fe2c, crcMpIntNoDp=ab967993, crcMpIntDp=21b13223 crcRes=a43fde70, 0021 0005 0016 0002 0061 0004 0001 1559: line=0501, crcLocalVars=7a3e1d2a, crcMpIntNoDp=ad3e197a, crcMpIntDp=7691624d crcRes=6d1388bc, 0021 0005 0016 0002 0061 0005 0001 1554: line=0492, crcLocalVars=6a08573e, crcMpIntNoDp=ab967993, crcMpIntDp=ded4078e crcRes=2554be5b, 0021 0005 0016 0002 0061 0003 0001 1555: line=0488, crcLocalVars=7dc8fe2c, crcMpIntNoDp=ad3e197a, crcMpIntDp=e71d5c11 crcRes=5ef59250, 0021 0005 0016 0002 0061 0004 0001 1556: line=2049, crcLocalVars=7dc8fe2c, crcMpIntNoDp=ad3e197a, crcMpIntDp=e71d5c11 crcRes=5ef59250, 0021 0005 0016 0002 0061 0004 0001 1557: line=2062, crcLocalVars=7dc8fe2c, crcMpIntNoDp=ab967993, crcMpIntDp=a2639ce4 crcRes=74f7dec6, 0021 0005 0016 0002 0061 0004 0001 1558: line=0492, crcLocalVars=7dc8fe2c, crcMpIntNoDp=ab967993, crcMpIntDp=a2639ce4 crcRes=74f7dec6, 0021 0005 0016 0002 0061 0004 0001 1559: line=0501, crcLocalVars=7a3e1d2a, crcMpIntNoDp=ad3e197a, crcMpIntDp=5e3343d2 crcRes=517ed1b0, 0021 0005 0016 0002 0061 0005 0001 I initially found the failure occurs at seemingly random places, affected mostly by the variances of code/data placement between builds, which also affects the frequency of failure. Through a lot of trial and error I was able to tease the failure down to one of the simplest code paths (fast_s_mp_sqr), which required balancing debug code placement to keep the movement of the failure in control. fast_s_mp_sqr() does only basic arithmetic and is easy to follow. I haven't yet determined if the corrupt data is pre-calculation or post-calculation due to the limits of how much data I can snapshot in the history buffer. Nevertheless I expanded the history mechanism to snapshot the specific mp_int that usually is corrupted via this path (s_mp_exptmod's local res structure). Here is correct vs corrupt mp_init at the specific execution point where it departs from the previous correction invocation. The data fields prefixed by : are the actual content of the mp_int - I've highlighted the mismatching crc32's and the mismatching 64-bit word: Correct invocation: 8057: line=2062, crcLocalVars=1d8f10b6, crcMpIntNoDp=80a0f0a7, crcMpIntDp=e92a3e1f crcRes=02003870, 0018 0005 0020 0002 0016 0002 0000 : 05297100 04f1e4e6 0fb47d28 0ab5d584 00b2778c 08656465 02cc79bb 05e280c3 - 073117bc 037170a2 0603ef41 0a73c7af 0388c6cd 08b543fa 055d90c9 006afe46 : 0c4d0d2b 0b8753bf 0ba6b917 0dbc26af 0d5d541f 03cbd888 0a8b07bb 06ce141b - 0f2e2cdc 0d83829c 00b9e992 007a007e 0b35c3fa 0f97fa98 078b16e2 05681c5a : 09e81cad 0fcb1b35 0f017b34 0828f9c8 08253004 02f4139f 07b97efe 03a2c2c6 - 0baf31f0 038dc84d 0ec2028d 0a4d2163 0b3d8f14 03a5b8a1 07656722 0636f515 : 047c6a4e 0249e773 074fdaae 0c7affcb 025e144e 0e6e524b 0369a7e6 005e5b18 - 07359ab7 094aa102 06e091dc 048578b3 0f2023d6 09e16318 0fb25f70 091e7d0c : 00e038fe 01fe0be1 0c879fba 055feb36 05135c48 063ef5c4 062acf74 0e2ee213 - 0b32d4b4 01ac1beb 0df27135 0645d3a2 02f54fab 04524d06 0e21e0a0 01a58051 : 0d0dd311 0b10815a 08044871 0bec8042 0473b083 0d99e620 0db94b72 07398f84 - 06930d29 021f81cd 0e96625a 0ffa3c78 0c9908d6 0fd6f904 0f5dcfd9 0bd6e140 : 0357bd4b 0488f3a9 00ed811d 0c8a129f 0bde5ab5 0c61d340 042eea72 01fe06f5 - 018c9e3d 025ede93 0ce5786c 00c174de 0479c67d 06c711f5 052ebca1 093bf956 : 042b9b5e 06a62fce 0eef5130 0065890a 0ed4ef4d 0adc823d 0b7ab96f 04639d68 - 0484c7b5 0135f153 0818067f 00cffc19 0097dcba 016e355b 002e3d3e 051065cb : 0b41750c 049fb50f 0be87386 0d76e872 0de83a61 04f8c371 07daa886 03a70e50 - 0c79ea89 016660c2 0963ebd6 09d9b469 0abd18ff 02c370ac 0ad5b8ba 04846255 : 0e7c9e10 03662210 00000011 00000000 00000000 00000000 00000000 00000000 - 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 Corrupt invocation: 8057: line=2062, crcLocalVars=1d8f10b6, crcMpIntNoDp=80a0f0a7, crcMpIntDp=5a521526 crcRes=86bd8450, 0018 0005 0020 0002 0016 0002 0000 : 05297100 04f1e4e6 0fb47d28 0ab5d584 00b2778c 08656465 02cc79bb 05e280c3 - 073117bc 037170a2 0603ef41 0a73c7af 0388c6cd 08b543fa 055d90c9 006afe46 : 0c4d0d2b 0b8753bf 0ba6b917 0dbc26af 0d5d541f 03cbd888 0a8b07bb 06ce141b - 0f2e2cdc 0d83829c 00b9e992 007a007e 0b35c3fa 0f97fa98 078b16e2 05681c5a : 09e81cad 0fcb1b35 0f017b34 0828f9c8 08253004 02f4139f 07b97efe 03a2c2c6 - 0baf31f0 038dc84d 0ec2028d 0a4d2163 0b3d8f14 03a5b8a1 07656722 0636f515 : 047c6a4e 0249e773 074fdaae 0c7affcb 025e144e 0e6e524b 0369a7e6 005e5b18 - 07359ab7 094aa102 06e091dc 048578b3 0f2023d6 09e16318 0fb25f70 091e7d0c : 00e038fe 01fe0be1 0c879fba 055feb36 05135c48 063ef5c4 062acf74 0e2ee213 - 0b32d4b4 01ac1beb 0df27135 0645d3a2 02f54fab 04524d06 0e21e0a0 01a58051 : 0d0dd311 0b10815a 08044871 0bec8042 0473b083 0d99e620 0db94b72 07398f84 - 06930d29 021f81cd 0e96625a 0ffa3c78 0c9908d6 0fd6f904 0f5dcfd9 0bd6e140 : 0357bd4b 0488f3a9 00ed811d 0c8a129f 0bde5ab5 0c61d340 042eea72 01fe06f5 - 018c9e3d 025ede93 0ce5786c 00c174de 0479c67d 06c711f5 052ebca1 093bf956 : 042b9b5e 06a62fce 0eef5130 0065890a 0ed4ef4d 0adc823d 0b7ab96f 04639d68 - 0484c7b5 0135f153 0818067f 00cffc19 0097dcba 016e355b 002e3d3e 051065cb : 0b41750c 049fb50f 0be87386 0d76e872 0de83a61 07156229 072adcf7 03a70e50 - 0c79ea89 016660c2 0963ebd6 09d9b469 0abd18ff 02c370ac 0ad5b8ba 04846255 : 0e7c9e10 03662210 00000011 00000000 00000000 00000000 00000000 00000000 - 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 I don't see any immediate relationship between the corrupt vs expected data or any unique attributes of the corrupt data over multiple captures I've done. The above mp_int is post-execution of fast_s_mp_sqr(), so any corruption occurring within its execution will get folded in and propagated into a form that wont be immediately recognizable since it's undergone arithmetic operations within the routine. The fact the corruption is always a single 64-bit word is a good clue. fast_s_mp_sqr() uses 64-bit scalars (mp_word) in its carry arithmetic logic - I'll be looking into the disassembly of the routine to dig deeper. For reference here is the history structures used for the above dumps: typedef struct _LOCAL_VARS { // local vars of s_mp_exptmod() packaged into a struct mp_int *G; mp_int *X; mp_int *P; mp_int *Y; mp_int M[TAB_SIZE]; mp_int res; mp_int mu; mp_digit buf; int err, bitbuf, bitcpy, bitcnt, mode, digidx, x, y, winsize; } LOCAL_VARS; typedef struct _HISTORY_ELEMENT { ushort lineNumber; ushort pad; uint crcLocalVars; uint crcMpInt_WithoutDp; // mp_int structure excluding .dp uint crcMpIntDp; // all mp_int's in LOCAL_VARS uint crcRes; // just LOCAL_VARS.res uint resDp[160]; // content of LOCAL_VARS.res ushort bitbuf, bitcpy, bitcnt, mode, digidx, x, y; } HISTORY_ELEMENT; Here is the CPU info: root at OpenWrt:/tmp# cat /proc/cpuinfo processor : 0 model name : ARMv7 Processor rev 0 (v7l) BogoMIPS : 6.00 Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 CPU implementer : 0x51 CPU architecture: 7 CPU variant : 0x2 CPU part : 0x04d CPU revision : 0 processor : 1 model name : ARMv7 Processor rev 0 (v7l) BogoMIPS : 12.50 Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 CPU implementer : 0x51 CPU architecture: 7 CPU variant : 0x2 CPU part : 0x04d CPU revision : 0 Hardware : Generic DT based system Revision : 0000 Serial : 0000000000000000 And the first few messages of the kernel log showing version and detected CPU details: [ 0.000000] Booting Linux on physical CPU 0x0 [ 0.000000] Linux version 4.14.171 (builder at buildhost) (gcc version 7.5.0 (OpenWrt GCC 7.5.0 r10947-65030d81f3)) #0 SMP Thu Feb 27 21:05:12 2020 [ 0.000000] CPU: ARMv7 Processor [512f04d0] revision 0 (ARMv7), cr=10c5787d [ 0.000000] CPU: div instructions available: patching division code [ 0.000000] CPU: PIPT / VIPT nonaliasing data cache, PIPT instruction cache [ 0.000000] OF: fdt: Machine model: Netgear Nighthawk X4S R7800 [ 0.000000] Memory policy: Data cache writealloc [ 0.000000] On node 0 totalpages: 122880 [ 0.000000] free_area_init_node: node 0, pgdat c0a27880, node_mem_map dda39000 [ 0.000000] Normal zone: 960 pages used for memmap [ 0.000000] Normal zone: 0 pages reserved [ 0.000000] Normal zone: 122880 pages, LIFO batch:31 [ 0.000000] random: get_random_bytes called from 0xc09008dc with crng_init=0 [ 0.000000] percpu: Embedded 15 pages/cpu s29388 r8192 d23860 u61440 ________________________________ From: Matt Johnston Sent: Friday, March 20, 2020 3:50 AM To: Horshack ?? Subject: Re: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 Hi, That's an interesting failure. You should be able to disable SMP if you set maxcpus=1 as a kernel boot argument - not sure where you would set that for your device though. I guess the other option is that a kernel syscall somewhere is clobbering registers, disabling SMP wouldn't avoid that... Which kernel is it running, and what's the CPU (/proc/cpuinfo)? Cheers, Matt On Fri 20/3/2020, at 3:28 pm, Horshack ?? > wrote: Update - I have isolated the intermittent issue down to the interchangeable functions s_mp_exptmod_fast() and s_mp_exptmod() - by default s_mp_exptmod_fast() is compiled instead of s_mp_exptmod() [BN_MP_EXPTMOD_FAST_C] but both functions intermittently fail and I decided to use s_mp_exptmod() as my focus because it's slightly simpler. s_mp_exptmod() is called indirectly by rsa.c::buf_put_rsa_sign()'s call to mp_exptmod(). For the intermittent failing case if I call mp_exptmod() / s_mp_exptmod() immediately again with the same source mp_int structures it yields the correct data. Example - debug code bolded: DEF_MP_INT(rsa_s_backup); DEF_MP_INT(rsa_s_backup_2); mp_copy (&rsa_s, &rsa_s_backup); mp_copy (&rsa_s, &rsa_s_backup_2); if (mp_exptmod(&rsa_tmp1, key->d, key->n, &rsa_s) != MP_OKAY) { dropbear_exit("RSA error"); } if (mp_exptmod(&rsa_tmp1, key->d, key->n, &rsa_s_backup) != MP_OKAY) { dropbear_exit("RSA error"); } if (mp_exptmod(&rsa_tmp1, key->d, key->n, &rsa_s_backup_2) != MP_OKAY) { dropbear_exit("RSA error"); } printf("after mp_exptmod\n"); dump_mp_int("rsa_s", &rsa_s); dump_mp_int("rsa_s_backup", &rsa_s_backup); dump_mp_int("rsa_s_backup_2", &rsa_s_backup_2); comp_mp_int("rsa_s", "rsa_s_backup", &rsa_s, &rsa_s_backup); comp_mp_int("rsa_s_backup", "rsa_s_backup_2", &rsa_s_backup, &rsa_s_backup_2); mp_clear(&rsa_s_backup); mp_clear(&rsa_s_backup_2); Sample output from a failure, which contains the first portion of each mp_int->dp. Bolded text has wrong data: after mp_exptmod rsa_s [0xbef6c358]: 0000 4a 00 00 00 c0 00 00 00 00 00 00 00 30 e1 8f 00 J...........0... rsa_s->dp [0x008fe130]: 0000 05 fb c0 0f 68 91 ff 0a 9f 05 57 0b 35 a2 bd 05 ....h.....W.5... 0010 57 ec a0 0b 34 3c b1 0f fa 8b b5 08 ed aa 9c 04 W...4<.......... 0020 7e 88 bb 04 12 42 51 05 9a 6d 7d 0a 98 ef 12 0c ~....BQ..m}..... 0030 76 e0 f4 0f ea 89 d7 0c 87 b0 76 03 12 a1 2d 0e v.........v...-. 0040 d7 3c df 06 0f 54 92 04 23 90 .<...T..#. rsa_s_backup [0xbef6c398]: 0000 4a 00 00 00 c0 00 00 00 00 00 00 00 00 d8 8f 00 J............... rsa_s_backup->dp [0x008fd800]: 0000 ec 9f a0 01 d4 8e e8 07 c3 ae df 0b 45 61 e6 06 ............Ea.. 0010 a1 99 59 03 d7 49 24 02 50 a6 ac 0a de a2 5c 0d ..Y..I$.P.....\. 0020 cb b7 3c 05 33 cb da 08 28 10 f2 04 14 69 d6 07 ..<.3...(....i.. 0030 8c 8e a5 04 f5 fc 92 0c ba 88 d9 04 71 b4 b2 08 ............q... 0040 bc 4f c7 0d de 73 f9 06 0d bf .O...s.... rsa_s_backup_2 [0xbef6c3a8]: 0000 4a 00 00 00 c0 00 00 00 00 00 00 00 e0 d1 8f 00 J............... rsa_s_backup_2->dp [0x008fd1e0]: 0000 ec 9f a0 01 d4 8e e8 07 c3 ae df 0b 45 61 e6 06 ............Ea.. 0010 a1 99 59 03 d7 49 24 02 50 a6 ac 0a de a2 5c 0d ..Y..I$.P.....\. 0020 cb b7 3c 05 33 cb da 08 28 10 f2 04 14 69 d6 07 ..<.3...(....i.. 0030 8c 8e a5 04 f5 fc 92 0c ba 88 d9 04 71 b4 b2 08 ............q... 0040 bc 4f c7 0d de 73 f9 06 0d bf .O...s.... rsa_s and rsa_s_backup differ Sometimes it's the second or third call that yields the incorrect data. In this instance it was the second call: after mp_exptmod rsa_s [0xbe9a6358]: 0000 4a 00 00 00 c0 00 00 00 00 00 00 00 30 c1 40 02 J...........0. at . rsa_s->dp [0x0240c130]: 0000 25 b9 db 00 ec 62 00 0d 80 2d b0 0d 00 13 d3 06 %....b...-...... 0010 3f ec 8b 0a af 5d e9 03 2d f4 4b 0c 6c 3c 72 08 ?....]..-.K.ldp [0x0240b800]: 0000 df 86 0c 0a 6c 2f 68 09 f9 a1 37 01 26 02 e7 0b ....l/h...7.&... 0010 69 5c b8 0e 0b 95 3a 0d 26 24 00 0e 97 6f dc 0b i\....:.&$...o.. 0020 64 95 ed 0a c0 75 53 03 66 3d ff 0b 26 4b ce 09 d....uS.f=..&K.. 0030 89 12 d2 03 9b 9b 0b 09 19 2c 5a 00 2c 99 fc 0b .........,Z.,... 0040 ea ad 61 09 38 e1 6a 0a 49 a5 ..a.8.j.I. rsa_s_backup_2 [0xbe9a63a8]: 0000 4a 00 00 00 c0 00 00 00 00 00 00 00 e0 b1 40 02 J............. at . rsa_s_backup_2->dp [0x0240b1e0]: 0000 25 b9 db 00 ec 62 00 0d 80 2d b0 0d 00 13 d3 06 %....b...-...... 0010 3f ec 8b 0a af 5d e9 03 2d f4 4b 0c 6c 3c 72 08 ?....]..-.K.l> Sent: Thursday, March 19, 2020 7:11 AM To: Matt Johnston > Cc: dropbear at ucc.asn.au > Subject: Re: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 Thanks Matt, I'll give that a shot when I get a build environment set up for the server-side/openwrt. I also plan to look at the RSA blinding logic in buf_put_rsa_sign(). Considering the intermittency of the issue I'm thinking the issue has some correlation or dependency to the random data generated or transformed by that logic. Crypto is well outside my core competency so it'll be slow-going. ________________________________ From: Matt Johnston > Sent: Thursday, March 19, 2020 7:04 AM To: Horshack ?? > Cc: dropbear at ucc.asn.au > Subject: Re: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 Hi, The first thing I'd try would be to build with -O0 compilation flags to rule out compiler optimisations doing something strange. Cheers, Matt On Thu 19/3/2020, at 3:42 pm, Horshack ?? > wrote: Update - I cloned and built the dbclient source so I could enable the debug tracing facility to get more information about the 'Bad hostkey signature'. The intermittent failure is detected in recv_msg_kexdh_reply() -> buf_rsa_verify() -> mp_cmd(). If I bypass the buf_rsa_verify() call then the session proceeds normally without issue, which indicates everything else in the key exchange is working 100% of the time. I'll dig deeper to see why the signed host key sent by the server is wrong. ________________________________ From: Horshack ?? Sent: Wednesday, March 18, 2020 9:36 AM To: dropbear at ucc.asn.au > Subject: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 Hi, I have a strange issue on my Netgear X4S R7800. Running either DD-WRT or OpenWrt, approximately 30-70% of my SSH login attempts fail. For OpenSSH clients the error reported is "error in libcrypto". For the PuTTY client the error is more descriptive - "Signature from server's host key is invalid". The failure occurs even when using the OpenSSH client built in to OpenWrt itself (ie, SSH'ing into the router from the router via an existing remote SSH session). The failure appears to be at the tail end of the key exchange, before authentication. I've tried varying the cipher (aes128-ctr / aes256-ctr), the MAC (hmac-sha1 / hmac-sha2-256), and the key exchange algo (curve25519-sha256 / curve25519-sha256 at libssh.org / diffie-hellman-group14-sha256 / diffie-hellman-group14-sha1) but the intermittent failure still occurs. The frequency of failure is about the same for all these configuration options except for diffie-hellman-group14-sha256, which fails much more frequently - it sometimes takes hundreds of attempts to succeed. Perhaps that will provide a clue to the underlying cause. Once an SSH login succeeds the connection is stable. However if I initiate a manual rekey operation via ~R then the key re-exchange fails. The router is otherwise very stable with no noticeable issues. I'm an embedded firmware engineer but have never worked on DD-WRT/OpenWrt firmware or dropbear. I have a conceptual understanding of the key exchange algo but haven't looked at the actual code of any implementation including Dropbear's. I'm seek ideas on how to troubleshoot this issue. Considering the problem is intermittent I'm thinking it's some variant in the key generation/exchange algorithm that's failing due to some issue with the router, or a more remote possibility, an issue with the Dropbear implementation. Here are pastebin links to the PuTTY full debug logs (w/raw data dumps) for both the failure and success cases: Failure Case: https://pastebin.com/MS2BtFmW Success Case: https://pastebin.com/c4j66Ga9 The only message I see from dropbear for a failed connection attempt is: authpriv.info dropbear[15948]: Child connection from 192.168.1.249:54819 authpriv.info dropbear[15948]: Exit before auth: Disconnect received Thanks! -------------- next part -------------- An HTML attachment was scrubbed... URL: https://lists.ucc.gu.uwa.edu.au/pipermail/dropbear/attachments/20200321/938fbda6/attachment-0001.htm From horshack at live.com Sat Mar 21 22:35:17 2020 From: horshack at live.com (=?utf-8?B?SG9yc2hhY2sg4oCq4oCs?=) Date: Sat, 21 Mar 2020 14:35:17 +0000 Subject: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 In-Reply-To: References: , <44E94890-EAC4-4E9B-89AA-CE81087B7EFA@ucc.asn.au>, Message-ID: Disassembly of fast_s_mp_sqr() and other libtommath functions reveals gcc is utilizing the arm NEON SIMD instructions and registers for calculations involved with libtommath's mp_word scalar. Based on the 64-bit word corruption I see I'm guessing the SIMD registers aren't being preserved/restored properly somewhere, probably during a context switch, specifically s16?s31 (d8?d15, q4?q7), which AAPCS says must be preserved and which I see being used in the disassembly of fast_s_mp_sqr(). I'lll write some test code later today to see if this is the case, and if so, try to track down where and why the registers aren't being preserved. ________________________________ From: Horshack ?? Sent: Saturday, March 21, 2020 1:11 AM To: Matt Johnston Cc: dropbear at ucc.asn.au Subject: Re: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 I have one of the failure paths isolated down to a single corrupt 64-bit word in memory, which required a significant amount of code instrumentation to achieve. I implemented a code execution history buffer that gets filled at various checkpoints within s_mp_exptmod() and some of the modules called by it. To facilitate this history mechanism I packaged all of s_mp_exptmod()'s local variables inside a structure , which consists of saving the local scalar vars in addition to crc32's of all the mp_int data structures with a separate crc32 of the mp_int.dp payload (data). When a failure occurs, ie one or more of the three back-to-back debug invocations of s_mp_exptmod yields a mismatching signed key result, I dump out the history elements for each of the invocations to determine the first code checkpoint where failing invocation departed from the known correct invocation. Here's a sample capture demonstrating. Format is event #:, source code line #, crc32 of local scalars, crc32 of mp_int structures (minus dp field), and crc32 of all the mp_int dp data payloads. In this sample, the crc32 of the dp data payload is different, which causes all subsequent crc32's for the remainder of the invocation to be difference since the data propagates through all the subsequent calculations performed in the routine. 1554: line=0492, crcLocalVars=6a08573e, crcMpIntNoDp=ab967993, crcMpIntDp=ded4078e crcRes=2554be5b, 0021 0005 0016 0002 0061 0003 0001 1555: line=0488, crcLocalVars=7dc8fe2c, crcMpIntNoDp=ad3e197a, crcMpIntDp=e71d5c11 crcRes=5ef59250, 0021 0005 0016 0002 0061 0004 0001 1556: line=2049, crcLocalVars=7dc8fe2c, crcMpIntNoDp=ad3e197a, crcMpIntDp=e71d5c11 crcRes=5ef59250, 0021 0005 0016 0002 0061 0004 0001 1557: line=2062, crcLocalVars=7dc8fe2c, crcMpIntNoDp=ab967993, crcMpIntDp=21b13223 crcRes=a43fde70, 0021 0005 0016 0002 0061 0004 0001 1558: line=0492, crcLocalVars=7dc8fe2c, crcMpIntNoDp=ab967993, crcMpIntDp=21b13223 crcRes=a43fde70, 0021 0005 0016 0002 0061 0004 0001 1559: line=0501, crcLocalVars=7a3e1d2a, crcMpIntNoDp=ad3e197a, crcMpIntDp=7691624d crcRes=6d1388bc, 0021 0005 0016 0002 0061 0005 0001 1554: line=0492, crcLocalVars=6a08573e, crcMpIntNoDp=ab967993, crcMpIntDp=ded4078e crcRes=2554be5b, 0021 0005 0016 0002 0061 0003 0001 1555: line=0488, crcLocalVars=7dc8fe2c, crcMpIntNoDp=ad3e197a, crcMpIntDp=e71d5c11 crcRes=5ef59250, 0021 0005 0016 0002 0061 0004 0001 1556: line=2049, crcLocalVars=7dc8fe2c, crcMpIntNoDp=ad3e197a, crcMpIntDp=e71d5c11 crcRes=5ef59250, 0021 0005 0016 0002 0061 0004 0001 1557: line=2062, crcLocalVars=7dc8fe2c, crcMpIntNoDp=ab967993, crcMpIntDp=a2639ce4 crcRes=74f7dec6, 0021 0005 0016 0002 0061 0004 0001 1558: line=0492, crcLocalVars=7dc8fe2c, crcMpIntNoDp=ab967993, crcMpIntDp=a2639ce4 crcRes=74f7dec6, 0021 0005 0016 0002 0061 0004 0001 1559: line=0501, crcLocalVars=7a3e1d2a, crcMpIntNoDp=ad3e197a, crcMpIntDp=5e3343d2 crcRes=517ed1b0, 0021 0005 0016 0002 0061 0005 0001 I initially found the failure occurs at seemingly random places, affected mostly by the variances of code/data placement between builds, which also affects the frequency of failure. Through a lot of trial and error I was able to tease the failure down to one of the simplest code paths (fast_s_mp_sqr), which required balancing debug code placement to keep the movement of the failure in control. fast_s_mp_sqr() does only basic arithmetic and is easy to follow. I haven't yet determined if the corrupt data is pre-calculation or post-calculation due to the limits of how much data I can snapshot in the history buffer. Nevertheless I expanded the history mechanism to snapshot the specific mp_int that usually is corrupted via this path (s_mp_exptmod's local res structure). Here is correct vs corrupt mp_init at the specific execution point where it departs from the previous correction invocation. The data fields prefixed by : are the actual content of the mp_int - I've highlighted the mismatching crc32's and the mismatching 64-bit word: Correct invocation: 8057: line=2062, crcLocalVars=1d8f10b6, crcMpIntNoDp=80a0f0a7, crcMpIntDp=e92a3e1f crcRes=02003870, 0018 0005 0020 0002 0016 0002 0000 : 05297100 04f1e4e6 0fb47d28 0ab5d584 00b2778c 08656465 02cc79bb 05e280c3 - 073117bc 037170a2 0603ef41 0a73c7af 0388c6cd 08b543fa 055d90c9 006afe46 : 0c4d0d2b 0b8753bf 0ba6b917 0dbc26af 0d5d541f 03cbd888 0a8b07bb 06ce141b - 0f2e2cdc 0d83829c 00b9e992 007a007e 0b35c3fa 0f97fa98 078b16e2 05681c5a : 09e81cad 0fcb1b35 0f017b34 0828f9c8 08253004 02f4139f 07b97efe 03a2c2c6 - 0baf31f0 038dc84d 0ec2028d 0a4d2163 0b3d8f14 03a5b8a1 07656722 0636f515 : 047c6a4e 0249e773 074fdaae 0c7affcb 025e144e 0e6e524b 0369a7e6 005e5b18 - 07359ab7 094aa102 06e091dc 048578b3 0f2023d6 09e16318 0fb25f70 091e7d0c : 00e038fe 01fe0be1 0c879fba 055feb36 05135c48 063ef5c4 062acf74 0e2ee213 - 0b32d4b4 01ac1beb 0df27135 0645d3a2 02f54fab 04524d06 0e21e0a0 01a58051 : 0d0dd311 0b10815a 08044871 0bec8042 0473b083 0d99e620 0db94b72 07398f84 - 06930d29 021f81cd 0e96625a 0ffa3c78 0c9908d6 0fd6f904 0f5dcfd9 0bd6e140 : 0357bd4b 0488f3a9 00ed811d 0c8a129f 0bde5ab5 0c61d340 042eea72 01fe06f5 - 018c9e3d 025ede93 0ce5786c 00c174de 0479c67d 06c711f5 052ebca1 093bf956 : 042b9b5e 06a62fce 0eef5130 0065890a 0ed4ef4d 0adc823d 0b7ab96f 04639d68 - 0484c7b5 0135f153 0818067f 00cffc19 0097dcba 016e355b 002e3d3e 051065cb : 0b41750c 049fb50f 0be87386 0d76e872 0de83a61 04f8c371 07daa886 03a70e50 - 0c79ea89 016660c2 0963ebd6 09d9b469 0abd18ff 02c370ac 0ad5b8ba 04846255 : 0e7c9e10 03662210 00000011 00000000 00000000 00000000 00000000 00000000 - 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 Corrupt invocation: 8057: line=2062, crcLocalVars=1d8f10b6, crcMpIntNoDp=80a0f0a7, crcMpIntDp=5a521526 crcRes=86bd8450, 0018 0005 0020 0002 0016 0002 0000 : 05297100 04f1e4e6 0fb47d28 0ab5d584 00b2778c 08656465 02cc79bb 05e280c3 - 073117bc 037170a2 0603ef41 0a73c7af 0388c6cd 08b543fa 055d90c9 006afe46 : 0c4d0d2b 0b8753bf 0ba6b917 0dbc26af 0d5d541f 03cbd888 0a8b07bb 06ce141b - 0f2e2cdc 0d83829c 00b9e992 007a007e 0b35c3fa 0f97fa98 078b16e2 05681c5a : 09e81cad 0fcb1b35 0f017b34 0828f9c8 08253004 02f4139f 07b97efe 03a2c2c6 - 0baf31f0 038dc84d 0ec2028d 0a4d2163 0b3d8f14 03a5b8a1 07656722 0636f515 : 047c6a4e 0249e773 074fdaae 0c7affcb 025e144e 0e6e524b 0369a7e6 005e5b18 - 07359ab7 094aa102 06e091dc 048578b3 0f2023d6 09e16318 0fb25f70 091e7d0c : 00e038fe 01fe0be1 0c879fba 055feb36 05135c48 063ef5c4 062acf74 0e2ee213 - 0b32d4b4 01ac1beb 0df27135 0645d3a2 02f54fab 04524d06 0e21e0a0 01a58051 : 0d0dd311 0b10815a 08044871 0bec8042 0473b083 0d99e620 0db94b72 07398f84 - 06930d29 021f81cd 0e96625a 0ffa3c78 0c9908d6 0fd6f904 0f5dcfd9 0bd6e140 : 0357bd4b 0488f3a9 00ed811d 0c8a129f 0bde5ab5 0c61d340 042eea72 01fe06f5 - 018c9e3d 025ede93 0ce5786c 00c174de 0479c67d 06c711f5 052ebca1 093bf956 : 042b9b5e 06a62fce 0eef5130 0065890a 0ed4ef4d 0adc823d 0b7ab96f 04639d68 - 0484c7b5 0135f153 0818067f 00cffc19 0097dcba 016e355b 002e3d3e 051065cb : 0b41750c 049fb50f 0be87386 0d76e872 0de83a61 07156229 072adcf7 03a70e50 - 0c79ea89 016660c2 0963ebd6 09d9b469 0abd18ff 02c370ac 0ad5b8ba 04846255 : 0e7c9e10 03662210 00000011 00000000 00000000 00000000 00000000 00000000 - 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 I don't see any immediate relationship between the corrupt vs expected data or any unique attributes of the corrupt data over multiple captures I've done. The above mp_int is post-execution of fast_s_mp_sqr(), so any corruption occurring within its execution will get folded in and propagated into a form that wont be immediately recognizable since it's undergone arithmetic operations within the routine. The fact the corruption is always a single 64-bit word is a good clue. fast_s_mp_sqr() uses 64-bit scalars (mp_word) in its carry arithmetic logic - I'll be looking into the disassembly of the routine to dig deeper. For reference here is the history structures used for the above dumps: typedef struct _LOCAL_VARS { // local vars of s_mp_exptmod() packaged into a struct mp_int *G; mp_int *X; mp_int *P; mp_int *Y; mp_int M[TAB_SIZE]; mp_int res; mp_int mu; mp_digit buf; int err, bitbuf, bitcpy, bitcnt, mode, digidx, x, y, winsize; } LOCAL_VARS; typedef struct _HISTORY_ELEMENT { ushort lineNumber; ushort pad; uint crcLocalVars; uint crcMpInt_WithoutDp; // mp_int structure excluding .dp uint crcMpIntDp; // all mp_int's in LOCAL_VARS uint crcRes; // just LOCAL_VARS.res uint resDp[160]; // content of LOCAL_VARS.res ushort bitbuf, bitcpy, bitcnt, mode, digidx, x, y; } HISTORY_ELEMENT; Here is the CPU info: root at OpenWrt:/tmp# cat /proc/cpuinfo processor : 0 model name : ARMv7 Processor rev 0 (v7l) BogoMIPS : 6.00 Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 CPU implementer : 0x51 CPU architecture: 7 CPU variant : 0x2 CPU part : 0x04d CPU revision : 0 processor : 1 model name : ARMv7 Processor rev 0 (v7l) BogoMIPS : 12.50 Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 CPU implementer : 0x51 CPU architecture: 7 CPU variant : 0x2 CPU part : 0x04d CPU revision : 0 Hardware : Generic DT based system Revision : 0000 Serial : 0000000000000000 And the first few messages of the kernel log showing version and detected CPU details: [ 0.000000] Booting Linux on physical CPU 0x0 [ 0.000000] Linux version 4.14.171 (builder at buildhost) (gcc version 7.5.0 (OpenWrt GCC 7.5.0 r10947-65030d81f3)) #0 SMP Thu Feb 27 21:05:12 2020 [ 0.000000] CPU: ARMv7 Processor [512f04d0] revision 0 (ARMv7), cr=10c5787d [ 0.000000] CPU: div instructions available: patching division code [ 0.000000] CPU: PIPT / VIPT nonaliasing data cache, PIPT instruction cache [ 0.000000] OF: fdt: Machine model: Netgear Nighthawk X4S R7800 [ 0.000000] Memory policy: Data cache writealloc [ 0.000000] On node 0 totalpages: 122880 [ 0.000000] free_area_init_node: node 0, pgdat c0a27880, node_mem_map dda39000 [ 0.000000] Normal zone: 960 pages used for memmap [ 0.000000] Normal zone: 0 pages reserved [ 0.000000] Normal zone: 122880 pages, LIFO batch:31 [ 0.000000] random: get_random_bytes called from 0xc09008dc with crng_init=0 [ 0.000000] percpu: Embedded 15 pages/cpu s29388 r8192 d23860 u61440 ________________________________ From: Matt Johnston Sent: Friday, March 20, 2020 3:50 AM To: Horshack ?? Subject: Re: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 Hi, That's an interesting failure. You should be able to disable SMP if you set maxcpus=1 as a kernel boot argument - not sure where you would set that for your device though. I guess the other option is that a kernel syscall somewhere is clobbering registers, disabling SMP wouldn't avoid that... Which kernel is it running, and what's the CPU (/proc/cpuinfo)? Cheers, Matt On Fri 20/3/2020, at 3:28 pm, Horshack ?? > wrote: Update - I have isolated the intermittent issue down to the interchangeable functions s_mp_exptmod_fast() and s_mp_exptmod() - by default s_mp_exptmod_fast() is compiled instead of s_mp_exptmod() [BN_MP_EXPTMOD_FAST_C] but both functions intermittently fail and I decided to use s_mp_exptmod() as my focus because it's slightly simpler. s_mp_exptmod() is called indirectly by rsa.c::buf_put_rsa_sign()'s call to mp_exptmod(). For the intermittent failing case if I call mp_exptmod() / s_mp_exptmod() immediately again with the same source mp_int structures it yields the correct data. Example - debug code bolded: DEF_MP_INT(rsa_s_backup); DEF_MP_INT(rsa_s_backup_2); mp_copy (&rsa_s, &rsa_s_backup); mp_copy (&rsa_s, &rsa_s_backup_2); if (mp_exptmod(&rsa_tmp1, key->d, key->n, &rsa_s) != MP_OKAY) { dropbear_exit("RSA error"); } if (mp_exptmod(&rsa_tmp1, key->d, key->n, &rsa_s_backup) != MP_OKAY) { dropbear_exit("RSA error"); } if (mp_exptmod(&rsa_tmp1, key->d, key->n, &rsa_s_backup_2) != MP_OKAY) { dropbear_exit("RSA error"); } printf("after mp_exptmod\n"); dump_mp_int("rsa_s", &rsa_s); dump_mp_int("rsa_s_backup", &rsa_s_backup); dump_mp_int("rsa_s_backup_2", &rsa_s_backup_2); comp_mp_int("rsa_s", "rsa_s_backup", &rsa_s, &rsa_s_backup); comp_mp_int("rsa_s_backup", "rsa_s_backup_2", &rsa_s_backup, &rsa_s_backup_2); mp_clear(&rsa_s_backup); mp_clear(&rsa_s_backup_2); Sample output from a failure, which contains the first portion of each mp_int->dp. Bolded text has wrong data: after mp_exptmod rsa_s [0xbef6c358]: 0000 4a 00 00 00 c0 00 00 00 00 00 00 00 30 e1 8f 00 J...........0... rsa_s->dp [0x008fe130]: 0000 05 fb c0 0f 68 91 ff 0a 9f 05 57 0b 35 a2 bd 05 ....h.....W.5... 0010 57 ec a0 0b 34 3c b1 0f fa 8b b5 08 ed aa 9c 04 W...4<.......... 0020 7e 88 bb 04 12 42 51 05 9a 6d 7d 0a 98 ef 12 0c ~....BQ..m}..... 0030 76 e0 f4 0f ea 89 d7 0c 87 b0 76 03 12 a1 2d 0e v.........v...-. 0040 d7 3c df 06 0f 54 92 04 23 90 .<...T..#. rsa_s_backup [0xbef6c398]: 0000 4a 00 00 00 c0 00 00 00 00 00 00 00 00 d8 8f 00 J............... rsa_s_backup->dp [0x008fd800]: 0000 ec 9f a0 01 d4 8e e8 07 c3 ae df 0b 45 61 e6 06 ............Ea.. 0010 a1 99 59 03 d7 49 24 02 50 a6 ac 0a de a2 5c 0d ..Y..I$.P.....\. 0020 cb b7 3c 05 33 cb da 08 28 10 f2 04 14 69 d6 07 ..<.3...(....i.. 0030 8c 8e a5 04 f5 fc 92 0c ba 88 d9 04 71 b4 b2 08 ............q... 0040 bc 4f c7 0d de 73 f9 06 0d bf .O...s.... rsa_s_backup_2 [0xbef6c3a8]: 0000 4a 00 00 00 c0 00 00 00 00 00 00 00 e0 d1 8f 00 J............... rsa_s_backup_2->dp [0x008fd1e0]: 0000 ec 9f a0 01 d4 8e e8 07 c3 ae df 0b 45 61 e6 06 ............Ea.. 0010 a1 99 59 03 d7 49 24 02 50 a6 ac 0a de a2 5c 0d ..Y..I$.P.....\. 0020 cb b7 3c 05 33 cb da 08 28 10 f2 04 14 69 d6 07 ..<.3...(....i.. 0030 8c 8e a5 04 f5 fc 92 0c ba 88 d9 04 71 b4 b2 08 ............q... 0040 bc 4f c7 0d de 73 f9 06 0d bf .O...s.... rsa_s and rsa_s_backup differ Sometimes it's the second or third call that yields the incorrect data. In this instance it was the second call: after mp_exptmod rsa_s [0xbe9a6358]: 0000 4a 00 00 00 c0 00 00 00 00 00 00 00 30 c1 40 02 J...........0. at . rsa_s->dp [0x0240c130]: 0000 25 b9 db 00 ec 62 00 0d 80 2d b0 0d 00 13 d3 06 %....b...-...... 0010 3f ec 8b 0a af 5d e9 03 2d f4 4b 0c 6c 3c 72 08 ?....]..-.K.ldp [0x0240b800]: 0000 df 86 0c 0a 6c 2f 68 09 f9 a1 37 01 26 02 e7 0b ....l/h...7.&... 0010 69 5c b8 0e 0b 95 3a 0d 26 24 00 0e 97 6f dc 0b i\....:.&$...o.. 0020 64 95 ed 0a c0 75 53 03 66 3d ff 0b 26 4b ce 09 d....uS.f=..&K.. 0030 89 12 d2 03 9b 9b 0b 09 19 2c 5a 00 2c 99 fc 0b .........,Z.,... 0040 ea ad 61 09 38 e1 6a 0a 49 a5 ..a.8.j.I. rsa_s_backup_2 [0xbe9a63a8]: 0000 4a 00 00 00 c0 00 00 00 00 00 00 00 e0 b1 40 02 J............. at . rsa_s_backup_2->dp [0x0240b1e0]: 0000 25 b9 db 00 ec 62 00 0d 80 2d b0 0d 00 13 d3 06 %....b...-...... 0010 3f ec 8b 0a af 5d e9 03 2d f4 4b 0c 6c 3c 72 08 ?....]..-.K.l> Sent: Thursday, March 19, 2020 7:11 AM To: Matt Johnston > Cc: dropbear at ucc.asn.au > Subject: Re: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 Thanks Matt, I'll give that a shot when I get a build environment set up for the server-side/openwrt. I also plan to look at the RSA blinding logic in buf_put_rsa_sign(). Considering the intermittency of the issue I'm thinking the issue has some correlation or dependency to the random data generated or transformed by that logic. Crypto is well outside my core competency so it'll be slow-going. ________________________________ From: Matt Johnston > Sent: Thursday, March 19, 2020 7:04 AM To: Horshack ?? > Cc: dropbear at ucc.asn.au > Subject: Re: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 Hi, The first thing I'd try would be to build with -O0 compilation flags to rule out compiler optimisations doing something strange. Cheers, Matt On Thu 19/3/2020, at 3:42 pm, Horshack ?? > wrote: Update - I cloned and built the dbclient source so I could enable the debug tracing facility to get more information about the 'Bad hostkey signature'. The intermittent failure is detected in recv_msg_kexdh_reply() -> buf_rsa_verify() -> mp_cmd(). If I bypass the buf_rsa_verify() call then the session proceeds normally without issue, which indicates everything else in the key exchange is working 100% of the time. I'll dig deeper to see why the signed host key sent by the server is wrong. ________________________________ From: Horshack ?? Sent: Wednesday, March 18, 2020 9:36 AM To: dropbear at ucc.asn.au > Subject: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 Hi, I have a strange issue on my Netgear X4S R7800. Running either DD-WRT or OpenWrt, approximately 30-70% of my SSH login attempts fail. For OpenSSH clients the error reported is "error in libcrypto". For the PuTTY client the error is more descriptive - "Signature from server's host key is invalid". The failure occurs even when using the OpenSSH client built in to OpenWrt itself (ie, SSH'ing into the router from the router via an existing remote SSH session). The failure appears to be at the tail end of the key exchange, before authentication. I've tried varying the cipher (aes128-ctr / aes256-ctr), the MAC (hmac-sha1 / hmac-sha2-256), and the key exchange algo (curve25519-sha256 / curve25519-sha256 at libssh.org / diffie-hellman-group14-sha256 / diffie-hellman-group14-sha1) but the intermittent failure still occurs. The frequency of failure is about the same for all these configuration options except for diffie-hellman-group14-sha256, which fails much more frequently - it sometimes takes hundreds of attempts to succeed. Perhaps that will provide a clue to the underlying cause. Once an SSH login succeeds the connection is stable. However if I initiate a manual rekey operation via ~R then the key re-exchange fails. The router is otherwise very stable with no noticeable issues. I'm an embedded firmware engineer but have never worked on DD-WRT/OpenWrt firmware or dropbear. I have a conceptual understanding of the key exchange algo but haven't looked at the actual code of any implementation including Dropbear's. I'm seek ideas on how to troubleshoot this issue. Considering the problem is intermittent I'm thinking it's some variant in the key generation/exchange algorithm that's failing due to some issue with the router, or a more remote possibility, an issue with the Dropbear implementation. Here are pastebin links to the PuTTY full debug logs (w/raw data dumps) for both the failure and success cases: Failure Case: https://pastebin.com/MS2BtFmW Success Case: https://pastebin.com/c4j66Ga9 The only message I see from dropbear for a failed connection attempt is: authpriv.info dropbear[15948]: Child connection from 192.168.1.249:54819 authpriv.info dropbear[15948]: Exit before auth: Disconnect received Thanks! -------------- next part -------------- An HTML attachment was scrubbed... URL: https://lists.ucc.gu.uwa.edu.au/pipermail/dropbear/attachments/20200321/a6f633f7/attachment-0001.htm From horshack at live.com Sat Mar 21 22:54:44 2020 From: horshack at live.com (=?utf-8?B?SG9yc2hhY2sg4oCq4oCs?=) Date: Sat, 21 Mar 2020 14:54:44 +0000 Subject: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 In-Reply-To: References: , <44E94890-EAC4-4E9B-89AA-CE81087B7EFA@ucc.asn.au>, , Message-ID: Including mailing list for my last two messages below... Begin forwarded message: From: Horshack ?? Date: March 21, 2020 at 7:35:18 AM PDT To: Matt Johnston Cc: "dropbear at ucc.asn.au" Subject: Re: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 ? Disassembly of fast_s_mp_sqr() and other libtommath functions reveals gcc is utilizing the arm NEON SIMD instructions and registers for calculations involved with libtommath's mp_word scalar. Based on the 64-bit word corruption I see I'm guessing the SIMD registers aren't being preserved/restored properly somewhere, probably during a context switch, specifically s16?s31 (d8?d15, q4?q7), which AAPCS says must be preserved and which I see being used in the disassembly of fast_s_mp_sqr(). I'lll write some test code later today to see if this is the case, and if so, try to track down where and why the registers aren't being preserved. ________________________________ From: Horshack ?? Sent: Saturday, March 21, 2020 1:11 AM To: Matt Johnston Cc: dropbear at ucc.asn.au Subject: Re: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 I have one of the failure paths isolated down to a single corrupt 64-bit word in memory, which required a significant amount of code instrumentation to achieve. I implemented a code execution history buffer that gets filled at various checkpoints within s_mp_exptmod() and some of the modules called by it. To facilitate this history mechanism I packaged all of s_mp_exptmod()'s local variables inside a structure , which consists of saving the local scalar vars in addition to crc32's of all the mp_int data structures with a separate crc32 of the mp_int.dp payload (data). When a failure occurs, ie one or more of the three back-to-back debug invocations of s_mp_exptmod yields a mismatching signed key result, I dump out the history elements for each of the invocations to determine the first code checkpoint where failing invocation departed from the known correct invocation. Here's a sample capture demonstrating. Format is event #:, source code line #, crc32 of local scalars, crc32 of mp_int structures (minus dp field), and crc32 of all the mp_int dp data payloads. In this sample, the crc32 of the dp data payload is different, which causes all subsequent crc32's for the remainder of the invocation to be difference since the data propagates through all the subsequent calculations performed in the routine. 1554: line=0492, crcLocalVars=6a08573e, crcMpIntNoDp=ab967993, crcMpIntDp=ded4078e crcRes=2554be5b, 0021 0005 0016 0002 0061 0003 0001 1555: line=0488, crcLocalVars=7dc8fe2c, crcMpIntNoDp=ad3e197a, crcMpIntDp=e71d5c11 crcRes=5ef59250, 0021 0005 0016 0002 0061 0004 0001 1556: line=2049, crcLocalVars=7dc8fe2c, crcMpIntNoDp=ad3e197a, crcMpIntDp=e71d5c11 crcRes=5ef59250, 0021 0005 0016 0002 0061 0004 0001 1557: line=2062, crcLocalVars=7dc8fe2c, crcMpIntNoDp=ab967993, crcMpIntDp=21b13223 crcRes=a43fde70, 0021 0005 0016 0002 0061 0004 0001 1558: line=0492, crcLocalVars=7dc8fe2c, crcMpIntNoDp=ab967993, crcMpIntDp=21b13223 crcRes=a43fde70, 0021 0005 0016 0002 0061 0004 0001 1559: line=0501, crcLocalVars=7a3e1d2a, crcMpIntNoDp=ad3e197a, crcMpIntDp=7691624d crcRes=6d1388bc, 0021 0005 0016 0002 0061 0005 0001 1554: line=0492, crcLocalVars=6a08573e, crcMpIntNoDp=ab967993, crcMpIntDp=ded4078e crcRes=2554be5b, 0021 0005 0016 0002 0061 0003 0001 1555: line=0488, crcLocalVars=7dc8fe2c, crcMpIntNoDp=ad3e197a, crcMpIntDp=e71d5c11 crcRes=5ef59250, 0021 0005 0016 0002 0061 0004 0001 1556: line=2049, crcLocalVars=7dc8fe2c, crcMpIntNoDp=ad3e197a, crcMpIntDp=e71d5c11 crcRes=5ef59250, 0021 0005 0016 0002 0061 0004 0001 1557: line=2062, crcLocalVars=7dc8fe2c, crcMpIntNoDp=ab967993, crcMpIntDp=a2639ce4 crcRes=74f7dec6, 0021 0005 0016 0002 0061 0004 0001 1558: line=0492, crcLocalVars=7dc8fe2c, crcMpIntNoDp=ab967993, crcMpIntDp=a2639ce4 crcRes=74f7dec6, 0021 0005 0016 0002 0061 0004 0001 1559: line=0501, crcLocalVars=7a3e1d2a, crcMpIntNoDp=ad3e197a, crcMpIntDp=5e3343d2 crcRes=517ed1b0, 0021 0005 0016 0002 0061 0005 0001 I initially found the failure occurs at seemingly random places, affected mostly by the variances of code/data placement between builds, which also affects the frequency of failure. Through a lot of trial and error I was able to tease the failure down to one of the simplest code paths (fast_s_mp_sqr), which required balancing debug code placement to keep the movement of the failure in control. fast_s_mp_sqr() does only basic arithmetic and is easy to follow. I haven't yet determined if the corrupt data is pre-calculation or post-calculation due to the limits of how much data I can snapshot in the history buffer. Nevertheless I expanded the history mechanism to snapshot the specific mp_int that usually is corrupted via this path (s_mp_exptmod's local res structure). Here is correct vs corrupt mp_init at the specific execution point where it departs from the previous correction invocation. The data fields prefixed by : are the actual content of the mp_int - I've highlighted the mismatching crc32's and the mismatching 64-bit word: Correct invocation: 8057: line=2062, crcLocalVars=1d8f10b6, crcMpIntNoDp=80a0f0a7, crcMpIntDp=e92a3e1f crcRes=02003870, 0018 0005 0020 0002 0016 0002 0000 : 05297100 04f1e4e6 0fb47d28 0ab5d584 00b2778c 08656465 02cc79bb 05e280c3 - 073117bc 037170a2 0603ef41 0a73c7af 0388c6cd 08b543fa 055d90c9 006afe46 : 0c4d0d2b 0b8753bf 0ba6b917 0dbc26af 0d5d541f 03cbd888 0a8b07bb 06ce141b - 0f2e2cdc 0d83829c 00b9e992 007a007e 0b35c3fa 0f97fa98 078b16e2 05681c5a : 09e81cad 0fcb1b35 0f017b34 0828f9c8 08253004 02f4139f 07b97efe 03a2c2c6 - 0baf31f0 038dc84d 0ec2028d 0a4d2163 0b3d8f14 03a5b8a1 07656722 0636f515 : 047c6a4e 0249e773 074fdaae 0c7affcb 025e144e 0e6e524b 0369a7e6 005e5b18 - 07359ab7 094aa102 06e091dc 048578b3 0f2023d6 09e16318 0fb25f70 091e7d0c : 00e038fe 01fe0be1 0c879fba 055feb36 05135c48 063ef5c4 062acf74 0e2ee213 - 0b32d4b4 01ac1beb 0df27135 0645d3a2 02f54fab 04524d06 0e21e0a0 01a58051 : 0d0dd311 0b10815a 08044871 0bec8042 0473b083 0d99e620 0db94b72 07398f84 - 06930d29 021f81cd 0e96625a 0ffa3c78 0c9908d6 0fd6f904 0f5dcfd9 0bd6e140 : 0357bd4b 0488f3a9 00ed811d 0c8a129f 0bde5ab5 0c61d340 042eea72 01fe06f5 - 018c9e3d 025ede93 0ce5786c 00c174de 0479c67d 06c711f5 052ebca1 093bf956 : 042b9b5e 06a62fce 0eef5130 0065890a 0ed4ef4d 0adc823d 0b7ab96f 04639d68 - 0484c7b5 0135f153 0818067f 00cffc19 0097dcba 016e355b 002e3d3e 051065cb : 0b41750c 049fb50f 0be87386 0d76e872 0de83a61 04f8c371 07daa886 03a70e50 - 0c79ea89 016660c2 0963ebd6 09d9b469 0abd18ff 02c370ac 0ad5b8ba 04846255 : 0e7c9e10 03662210 00000011 00000000 00000000 00000000 00000000 00000000 - 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 Corrupt invocation: 8057: line=2062, crcLocalVars=1d8f10b6, crcMpIntNoDp=80a0f0a7, crcMpIntDp=5a521526 crcRes=86bd8450, 0018 0005 0020 0002 0016 0002 0000 : 05297100 04f1e4e6 0fb47d28 0ab5d584 00b2778c 08656465 02cc79bb 05e280c3 - 073117bc 037170a2 0603ef41 0a73c7af 0388c6cd 08b543fa 055d90c9 006afe46 : 0c4d0d2b 0b8753bf 0ba6b917 0dbc26af 0d5d541f 03cbd888 0a8b07bb 06ce141b - 0f2e2cdc 0d83829c 00b9e992 007a007e 0b35c3fa 0f97fa98 078b16e2 05681c5a : 09e81cad 0fcb1b35 0f017b34 0828f9c8 08253004 02f4139f 07b97efe 03a2c2c6 - 0baf31f0 038dc84d 0ec2028d 0a4d2163 0b3d8f14 03a5b8a1 07656722 0636f515 : 047c6a4e 0249e773 074fdaae 0c7affcb 025e144e 0e6e524b 0369a7e6 005e5b18 - 07359ab7 094aa102 06e091dc 048578b3 0f2023d6 09e16318 0fb25f70 091e7d0c : 00e038fe 01fe0be1 0c879fba 055feb36 05135c48 063ef5c4 062acf74 0e2ee213 - 0b32d4b4 01ac1beb 0df27135 0645d3a2 02f54fab 04524d06 0e21e0a0 01a58051 : 0d0dd311 0b10815a 08044871 0bec8042 0473b083 0d99e620 0db94b72 07398f84 - 06930d29 021f81cd 0e96625a 0ffa3c78 0c9908d6 0fd6f904 0f5dcfd9 0bd6e140 : 0357bd4b 0488f3a9 00ed811d 0c8a129f 0bde5ab5 0c61d340 042eea72 01fe06f5 - 018c9e3d 025ede93 0ce5786c 00c174de 0479c67d 06c711f5 052ebca1 093bf956 : 042b9b5e 06a62fce 0eef5130 0065890a 0ed4ef4d 0adc823d 0b7ab96f 04639d68 - 0484c7b5 0135f153 0818067f 00cffc19 0097dcba 016e355b 002e3d3e 051065cb : 0b41750c 049fb50f 0be87386 0d76e872 0de83a61 07156229 072adcf7 03a70e50 - 0c79ea89 016660c2 0963ebd6 09d9b469 0abd18ff 02c370ac 0ad5b8ba 04846255 : 0e7c9e10 03662210 00000011 00000000 00000000 00000000 00000000 00000000 - 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 I don't see any immediate relationship between the corrupt vs expected data or any unique attributes of the corrupt data over multiple captures I've done. The above mp_int is post-execution of fast_s_mp_sqr(), so any corruption occurring within its execution will get folded in and propagated into a form that wont be immediately recognizable since it's undergone arithmetic operations within the routine. The fact the corruption is always a single 64-bit word is a good clue. fast_s_mp_sqr() uses 64-bit scalars (mp_word) in its carry arithmetic logic - I'll be looking into the disassembly of the routine to dig deeper. For reference here is the history structures used for the above dumps: typedef struct _LOCAL_VARS { // local vars of s_mp_exptmod() packaged into a struct mp_int *G; mp_int *X; mp_int *P; mp_int *Y; mp_int M[TAB_SIZE]; mp_int res; mp_int mu; mp_digit buf; int err, bitbuf, bitcpy, bitcnt, mode, digidx, x, y, winsize; } LOCAL_VARS; typedef struct _HISTORY_ELEMENT { ushort lineNumber; ushort pad; uint crcLocalVars; uint crcMpInt_WithoutDp; // mp_int structure excluding .dp uint crcMpIntDp; // all mp_int's in LOCAL_VARS uint crcRes; // just LOCAL_VARS.res uint resDp[160]; // content of LOCAL_VARS.res ushort bitbuf, bitcpy, bitcnt, mode, digidx, x, y; } HISTORY_ELEMENT; Here is the CPU info: root at OpenWrt:/tmp# cat /proc/cpuinfo processor : 0 model name : ARMv7 Processor rev 0 (v7l) BogoMIPS : 6.00 Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 CPU implementer : 0x51 CPU architecture: 7 CPU variant : 0x2 CPU part : 0x04d CPU revision : 0 processor : 1 model name : ARMv7 Processor rev 0 (v7l) BogoMIPS : 12.50 Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 CPU implementer : 0x51 CPU architecture: 7 CPU variant : 0x2 CPU part : 0x04d CPU revision : 0 Hardware : Generic DT based system Revision : 0000 Serial : 0000000000000000 And the first few messages of the kernel log showing version and detected CPU details: [ 0.000000] Booting Linux on physical CPU 0x0 [ 0.000000] Linux version 4.14.171 (builder at buildhost) (gcc version 7.5.0 (OpenWrt GCC 7.5.0 r10947-65030d81f3)) #0 SMP Thu Feb 27 21:05:12 2020 [ 0.000000] CPU: ARMv7 Processor [512f04d0] revision 0 (ARMv7), cr=10c5787d [ 0.000000] CPU: div instructions available: patching division code [ 0.000000] CPU: PIPT / VIPT nonaliasing data cache, PIPT instruction cache [ 0.000000] OF: fdt: Machine model: Netgear Nighthawk X4S R7800 [ 0.000000] Memory policy: Data cache writealloc [ 0.000000] On node 0 totalpages: 122880 [ 0.000000] free_area_init_node: node 0, pgdat c0a27880, node_mem_map dda39000 [ 0.000000] Normal zone: 960 pages used for memmap [ 0.000000] Normal zone: 0 pages reserved [ 0.000000] Normal zone: 122880 pages, LIFO batch:31 [ 0.000000] random: get_random_bytes called from 0xc09008dc with crng_init=0 [ 0.000000] percpu: Embedded 15 pages/cpu s29388 r8192 d23860 u61440 ________________________________ From: Matt Johnston Sent: Friday, March 20, 2020 3:50 AM To: Horshack ?? Subject: Re: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 Hi, That's an interesting failure. You should be able to disable SMP if you set maxcpus=1 as a kernel boot argument - not sure where you would set that for your device though. I guess the other option is that a kernel syscall somewhere is clobbering registers, disabling SMP wouldn't avoid that... Which kernel is it running, and what's the CPU (/proc/cpuinfo)? Cheers, Matt On Fri 20/3/2020, at 3:28 pm, Horshack ?? > wrote: Update - I have isolated the intermittent issue down to the interchangeable functions s_mp_exptmod_fast() and s_mp_exptmod() - by default s_mp_exptmod_fast() is compiled instead of s_mp_exptmod() [BN_MP_EXPTMOD_FAST_C] but both functions intermittently fail and I decided to use s_mp_exptmod() as my focus because it's slightly simpler. s_mp_exptmod() is called indirectly by rsa.c::buf_put_rsa_sign()'s call to mp_exptmod(). For the intermittent failing case if I call mp_exptmod() / s_mp_exptmod() immediately again with the same source mp_int structures it yields the correct data. Example - debug code bolded: DEF_MP_INT(rsa_s_backup); DEF_MP_INT(rsa_s_backup_2); mp_copy (&rsa_s, &rsa_s_backup); mp_copy (&rsa_s, &rsa_s_backup_2); if (mp_exptmod(&rsa_tmp1, key->d, key->n, &rsa_s) != MP_OKAY) { dropbear_exit("RSA error"); } if (mp_exptmod(&rsa_tmp1, key->d, key->n, &rsa_s_backup) != MP_OKAY) { dropbear_exit("RSA error"); } if (mp_exptmod(&rsa_tmp1, key->d, key->n, &rsa_s_backup_2) != MP_OKAY) { dropbear_exit("RSA error"); } printf("after mp_exptmod\n"); dump_mp_int("rsa_s", &rsa_s); dump_mp_int("rsa_s_backup", &rsa_s_backup); dump_mp_int("rsa_s_backup_2", &rsa_s_backup_2); comp_mp_int("rsa_s", "rsa_s_backup", &rsa_s, &rsa_s_backup); comp_mp_int("rsa_s_backup", "rsa_s_backup_2", &rsa_s_backup, &rsa_s_backup_2); mp_clear(&rsa_s_backup); mp_clear(&rsa_s_backup_2); Sample output from a failure, which contains the first portion of each mp_int->dp. Bolded text has wrong data: after mp_exptmod rsa_s [0xbef6c358]: 0000 4a 00 00 00 c0 00 00 00 00 00 00 00 30 e1 8f 00 J...........0... rsa_s->dp [0x008fe130]: 0000 05 fb c0 0f 68 91 ff 0a 9f 05 57 0b 35 a2 bd 05 ....h.....W.5... 0010 57 ec a0 0b 34 3c b1 0f fa 8b b5 08 ed aa 9c 04 W...4<.......... 0020 7e 88 bb 04 12 42 51 05 9a 6d 7d 0a 98 ef 12 0c ~....BQ..m}..... 0030 76 e0 f4 0f ea 89 d7 0c 87 b0 76 03 12 a1 2d 0e v.........v...-. 0040 d7 3c df 06 0f 54 92 04 23 90 .<...T..#. rsa_s_backup [0xbef6c398]: 0000 4a 00 00 00 c0 00 00 00 00 00 00 00 00 d8 8f 00 J............... rsa_s_backup->dp [0x008fd800]: 0000 ec 9f a0 01 d4 8e e8 07 c3 ae df 0b 45 61 e6 06 ............Ea.. 0010 a1 99 59 03 d7 49 24 02 50 a6 ac 0a de a2 5c 0d ..Y..I$.P.....\. 0020 cb b7 3c 05 33 cb da 08 28 10 f2 04 14 69 d6 07 ..<.3...(....i.. 0030 8c 8e a5 04 f5 fc 92 0c ba 88 d9 04 71 b4 b2 08 ............q... 0040 bc 4f c7 0d de 73 f9 06 0d bf .O...s.... rsa_s_backup_2 [0xbef6c3a8]: 0000 4a 00 00 00 c0 00 00 00 00 00 00 00 e0 d1 8f 00 J............... rsa_s_backup_2->dp [0x008fd1e0]: 0000 ec 9f a0 01 d4 8e e8 07 c3 ae df 0b 45 61 e6 06 ............Ea.. 0010 a1 99 59 03 d7 49 24 02 50 a6 ac 0a de a2 5c 0d ..Y..I$.P.....\. 0020 cb b7 3c 05 33 cb da 08 28 10 f2 04 14 69 d6 07 ..<.3...(....i.. 0030 8c 8e a5 04 f5 fc 92 0c ba 88 d9 04 71 b4 b2 08 ............q... 0040 bc 4f c7 0d de 73 f9 06 0d bf .O...s.... rsa_s and rsa_s_backup differ Sometimes it's the second or third call that yields the incorrect data. In this instance it was the second call: after mp_exptmod rsa_s [0xbe9a6358]: 0000 4a 00 00 00 c0 00 00 00 00 00 00 00 30 c1 40 02 J...........0. at . rsa_s->dp [0x0240c130]: 0000 25 b9 db 00 ec 62 00 0d 80 2d b0 0d 00 13 d3 06 %....b...-...... 0010 3f ec 8b 0a af 5d e9 03 2d f4 4b 0c 6c 3c 72 08 ?....]..-.K.ldp [0x0240b800]: 0000 df 86 0c 0a 6c 2f 68 09 f9 a1 37 01 26 02 e7 0b ....l/h...7.&... 0010 69 5c b8 0e 0b 95 3a 0d 26 24 00 0e 97 6f dc 0b i\....:.&$...o.. 0020 64 95 ed 0a c0 75 53 03 66 3d ff 0b 26 4b ce 09 d....uS.f=..&K.. 0030 89 12 d2 03 9b 9b 0b 09 19 2c 5a 00 2c 99 fc 0b .........,Z.,... 0040 ea ad 61 09 38 e1 6a 0a 49 a5 ..a.8.j.I. rsa_s_backup_2 [0xbe9a63a8]: 0000 4a 00 00 00 c0 00 00 00 00 00 00 00 e0 b1 40 02 J............. at . rsa_s_backup_2->dp [0x0240b1e0]: 0000 25 b9 db 00 ec 62 00 0d 80 2d b0 0d 00 13 d3 06 %....b...-...... 0010 3f ec 8b 0a af 5d e9 03 2d f4 4b 0c 6c 3c 72 08 ?....]..-.K.l> Sent: Thursday, March 19, 2020 7:11 AM To: Matt Johnston > Cc: dropbear at ucc.asn.au > Subject: Re: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 Thanks Matt, I'll give that a shot when I get a build environment set up for the server-side/openwrt. I also plan to look at the RSA blinding logic in buf_put_rsa_sign(). Considering the intermittency of the issue I'm thinking the issue has some correlation or dependency to the random data generated or transformed by that logic. Crypto is well outside my core competency so it'll be slow-going. ________________________________ From: Matt Johnston > Sent: Thursday, March 19, 2020 7:04 AM To: Horshack ?? > Cc: dropbear at ucc.asn.au > Subject: Re: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 Hi, The first thing I'd try would be to build with -O0 compilation flags to rule out compiler optimisations doing something strange. Cheers, Matt On Thu 19/3/2020, at 3:42 pm, Horshack ?? > wrote: Update - I cloned and built the dbclient source so I could enable the debug tracing facility to get more information about the 'Bad hostkey signature'. The intermittent failure is detected in recv_msg_kexdh_reply() -> buf_rsa_verify() -> mp_cmd(). If I bypass the buf_rsa_verify() call then the session proceeds normally without issue, which indicates everything else in the key exchange is working 100% of the time. I'll dig deeper to see why the signed host key sent by the server is wrong. ________________________________ From: Horshack ?? Sent: Wednesday, March 18, 2020 9:36 AM To: dropbear at ucc.asn.au > Subject: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 Hi, I have a strange issue on my Netgear X4S R7800. Running either DD-WRT or OpenWrt, approximately 30-70% of my SSH login attempts fail. For OpenSSH clients the error reported is "error in libcrypto". For the PuTTY client the error is more descriptive - "Signature from server's host key is invalid". The failure occurs even when using the OpenSSH client built in to OpenWrt itself (ie, SSH'ing into the router from the router via an existing remote SSH session). The failure appears to be at the tail end of the key exchange, before authentication. I've tried varying the cipher (aes128-ctr / aes256-ctr), the MAC (hmac-sha1 / hmac-sha2-256), and the key exchange algo (curve25519-sha256 / curve25519-sha256 at libssh.org / diffie-hellman-group14-sha256 / diffie-hellman-group14-sha1) but the intermittent failure still occurs. The frequency of failure is about the same for all these configuration options except for diffie-hellman-group14-sha256, which fails much more frequently - it sometimes takes hundreds of attempts to succeed. Perhaps that will provide a clue to the underlying cause. Once an SSH login succeeds the connection is stable. However if I initiate a manual rekey operation via ~R then the key re-exchange fails. The router is otherwise very stable with no noticeable issues. I'm an embedded firmware engineer but have never worked on DD-WRT/OpenWrt firmware or dropbear. I have a conceptual understanding of the key exchange algo but haven't looked at the actual code of any implementation including Dropbear's. I'm seek ideas on how to troubleshoot this issue. Considering the problem is intermittent I'm thinking it's some variant in the key generation/exchange algorithm that's failing due to some issue with the router, or a more remote possibility, an issue with the Dropbear implementation. Here are pastebin links to the PuTTY full debug logs (w/raw data dumps) for both the failure and success cases: Failure Case: https://pastebin.com/MS2BtFmW Success Case: https://pastebin.com/c4j66Ga9 The only message I see from dropbear for a failed connection attempt is: authpriv.info dropbear[15948]: Child connection from 192.168.1.249:54819 authpriv.info dropbear[15948]: Exit before auth: Disconnect received Thanks! -------------- next part -------------- An HTML attachment was scrubbed... URL: https://lists.ucc.gu.uwa.edu.au/pipermail/dropbear/attachments/20200321/64061c06/attachment-0001.htm From matt at ucc.asn.au Tue Mar 24 23:11:13 2020 From: matt at ucc.asn.au (Matt Johnston) Date: Tue, 24 Mar 2020 23:11:13 +0800 Subject: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 In-Reply-To: References: <44E94890-EAC4-4E9B-89AA-CE81087B7EFA@ucc.asn.au> Message-ID: <1E0D44CA-5341-4C3A-924E-8C6BF850B64A@ucc.asn.au> Good work narrowing down a test case there. That's an interesting finding - I guess it might be worth posting on OpenWRT lists/forum to try find other testers. Could it be power related if the tight multiplication loop is stressing it somehow? It doesn't seem to be using the Neon instruction for anything apart from loads/stores though - is there something that the compiler should be doing mixing Neon and non-Neon operations? Cheers, Matt (Your emails got held up being over 100kB, I've trimmed the reply below and let them through. Apologies to everyone for the stale old one that got let through with them just now, I wasn't looking closely) > On Tue 24/3/2020, at 11:23 am, Horshack ?? wrote: > > I was able to isolate the issue to just a handful of assembly instructions within fast_s_mp_sqr(), related to the squaring loop. I broke that code out into a separate utility that reproduces the issue within a few seconds. The failure is somewhat sensitive to the data pattern and very sensitive to timing, indicating a likely memory/data path issue within my particular router. I'm guessing it's the IPQ8065 and not the SDRAM because I can get it to fail with a tiny data set easily fits within DCACHE. I can alter the frequency of the failure with a single ARM memory barrier instruction, which at first implied a superscalar data ordering condition but the memory barrier also alters the timing through the DCACHE so that is likely the effect it's having. I was able to exclude the VFP/Neon register corruption as the cause with some test code. I also excluded any context switch-speciifc issue by measuring the # of context switches in /proc//status and catching a failure where no switches had occurred. I also modified the affinity so the utility runs on just one processor to rule out a specific core having the issue. > > I put the source and binary of my utility on github - if anyone on this mailing list has this model router can you give it a try if possible? You only need the ipq8065-sqrbug (binary) and run-ipq8065-sqrbug.sh (script). Here's the link to the repository: https://github.com/horshack-dpreview/ipq8065-sqrbug > > > From: Horshack ?? > > Sent: Saturday, March 21, 2020 7:54 AM > To: dropbear at ucc.asn.au > > Subject: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 > > Including mailing list for my last two messages below... > > Begin forwarded message: > >> From: Horshack ?? >> Date: March 21, 2020 at 7:35:18 AM PDT >> To: Matt Johnston >> Cc: "dropbear at ucc.asn.au" >> Subject: Re: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 >> >> ? >> Disassembly of fast_s_mp_sqr() and other libtommath functions reveals gcc is utilizing the arm NEON SIMD instructions and registers for calculations involved with libtommath's mp_word scalar. Based on the 64-bit word corruption I see I'm guessing the SIMD registers aren't being preserved/restored properly somewhere, probably during a context switch, specifically s16?s31 (d8?d15, q4?q7), which AAPCS says must be preserved and which I see being used in the disassembly of fast_s_mp_sqr(). I'lll write some test code later today to see if this is the case, and if so, try to track down where and why the registers aren't being preserved. >> >> From: Horshack ?? >> Sent: Saturday, March 21, 2020 1:11 AM >> To: Matt Johnston >> Cc: dropbear at ucc.asn.au >> Subject: Re: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 >> >> I have one of the failure paths isolated down to a single corrupt 64-bit word in memory, which required a significant amount of code instrumentation to achieve. I implemented a code execution history buffer that gets filled at various checkpoints within s_mp_exptmod() and some of the modules called by it. To facilitate this history mechanism I packaged all of s_mp_exptmod()'s local variables inside a structure , which consists of saving the local scalar vars in addition to crc32's of all the mp_int data structures with a separate crc32 of the mp_int.dp payload (data). When a failure occurs, ie one or more of the three back-to-back debug invocations of s_mp_exptmod yields a mismatching signed key result, I dump out the history elements for each of the invocations to determine the first code checkpoint where failing invocation departed from the known correct invocation. *snipped* -------------- next part -------------- An HTML attachment was scrubbed... URL: https://lists.ucc.gu.uwa.edu.au/pipermail/dropbear/attachments/20200324/cc7b1c69/attachment.htm From s.gottschall at dd-wrt.com Wed Mar 25 12:13:16 2020 From: s.gottschall at dd-wrt.com (Sebastian Gottschall) Date: Wed, 25 Mar 2020 05:13:16 +0100 Subject: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 In-Reply-To: <1E0D44CA-5341-4C3A-924E-8C6BF850B64A@ucc.asn.au> References: <44E94890-EAC4-4E9B-89AA-CE81087B7EFA@ucc.asn.au> <1E0D44CA-5341-4C3A-924E-8C6BF850B64A@ucc.asn.au> Message-ID: <2ec5919a-a90b-6437-8fbd-4922fb6fb121@dd-wrt.com> if the corruption is caused by a context switch the problem can be caused by the kernel. try the following and disable "CONFIG_KERNEL_MODE_NEON" in the kernel config. this will disable some kernel crypto assembly code Am 24.03.2020 um 16:11 schrieb Matt Johnston: > Good work narrowing down a test case there. > That's an interesting finding - I guess it might be worth posting on > OpenWRT lists/forum to try find other testers. > Could it be power related if the tight multiplication loop is > stressing it somehow? It doesn't seem to be using the Neon instruction > for anything apart from loads/stores though - is there something that > the compiler should be doing mixing Neon and non-Neon operations? > > Cheers, > Matt > > (Your emails got held up being over 100kB, I've trimmed the reply > below and let them through. Apologies to everyone for the stale old > one that got let through with them just now, I wasn't looking closely) > >> On Tue 24/3/2020, at 11:23 am, Horshack ?? > > wrote: >> >> I was able to isolate the issue to just a handful of assembly >> instructions within fast_s_mp_sqr(), related to the squaring loop. I >> broke that code out into a separate utility that reproduces the issue >> within a few seconds. The failure is somewhat sensitive to the data >> pattern and very sensitive to timing, indicating a likely memory/data >> path issue within my particular router. I'm guessing it's the IPQ8065 >> and not the SDRAM because I can get it to fail with a tiny data set >> easily fits within DCACHE. I can alter the frequency of the failure >> with a single ARM memory barrier instruction, which at first implied >> a superscalar data ordering condition but the memory barrier also >> alters the timing through the DCACHE so that is likely the effect >> it's having. I was able to exclude the VFP/Neon register corruption >> as the cause with some test code. I also excluded any context >> switch-speciifc issue by measuring the # of context switches in >> /proc//status and catching a failure where no switches had >> occurred. I also modified the affinity so the utility runs on just >> one processor to rule out a specific core having the issue. >> >> I put the source and binary of my utility on github - if anyone on >> this mailing list has this model router can you give it a try if >> possible? You only need the ipq8065-sqrbug (binary) and >> run-ipq8065-sqrbug.sh (script). Here's the link to the >> repository:https://github.com/horshack-dpreview/ipq8065-sqrbug >> >> >> >> ------------------------------------------------------------------------ >> *From:*Horshack ?? > >> *Sent:*Saturday, March 21, 2020 7:54 AM >> *To:*dropbear at ucc.asn.au >> > > >> *Subject:*SSH key exchange fails 30-70% of the time on Netgear X4S R7800 >> Including mailing list for my last two messages below... >> >> Begin forwarded message: >> >>> *From:*Horshack ?? > >>> *Date:*March 21, 2020 at 7:35:18 AM PDT >>> *To:*Matt Johnston > >>> *Cc:*"dropbear at ucc.asn.au " >>> > >>> *Subject:**Re:? SSH key exchange fails 30-70% of the time on Netgear >>> X4S R7800* >>> >>> ? >>> Disassembly of fast_s_mp_sqr() and other libtommath functions >>> reveals gcc is utilizing the arm NEON SIMD instructions and >>> registers for calculations involved with libtommath's mp_word >>> scalar. Based on the 64-bit word corruption I see I'm guessing the >>> SIMD registers aren't being preserved/restored properly somewhere, >>> probably during a context switch, specifically s16?s31 (d8?d15, >>> q4?q7), which AAPCS says must be preserved and which I see being >>> used in the disassembly of fast_s_mp_sqr(). I'lll write some test >>> code later today to see if this is the case, and if so, try to track >>> down where and why the registers aren't being preserved. >>> >>> ------------------------------------------------------------------------ >>> *From:*Horshack ?? > >>> *Sent:*Saturday, March 21, 2020 1:11 AM >>> *To:*Matt Johnston > >>> *Cc:*dropbear at ucc.asn.au >>> > >>> *Subject:*Re: SSH key exchange fails 30-70% of the time on Netgear >>> X4S R7800 >>> I have one of the failure paths isolated down to a single corrupt >>> 64-bit word in memory, which required a significant amount of code >>> instrumentation to achieve. I implemented a code execution history >>> buffer that gets filled at various checkpoints within s_mp_exptmod() >>> and some of the modules called by it. To facilitate this history >>> mechanism I packaged all of s_mp_exptmod()'s local variables inside >>> a structure , which consists of saving the local scalar vars in >>> addition to crc32's of all the mp_int data structures with a >>> separate crc32 of the mp_int.dp payload (data). When a failure >>> occurs, ie one or more of the three back-to-back debug invocations >>> of s_mp_exptmod yields a mismatching signed key result, I? dump out >>> the history elements for each of the invocations to determine the >>> first code checkpoint where failing invocation departed from the >>> known correct invocation. > > *snipped* > > -------------- next part -------------- An HTML attachment was scrubbed... URL: https://lists.ucc.gu.uwa.edu.au/pipermail/dropbear/attachments/20200325/c4f3719f/attachment-0001.htm From horshack at live.com Wed Mar 25 12:25:41 2020 From: horshack at live.com (=?utf-8?B?SG9yc2hhY2sg4oCq4oCs?=) Date: Wed, 25 Mar 2020 04:25:41 +0000 Subject: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 In-Reply-To: <2ec5919a-a90b-6437-8fbd-4922fb6fb121@dd-wrt.com> References: <44E94890-EAC4-4E9B-89AA-CE81087B7EFA@ucc.asn.au> <1E0D44CA-5341-4C3A-924E-8C6BF850B64A@ucc.asn.au>, <2ec5919a-a90b-6437-8fbd-4922fb6fb121@dd-wrt.com> Message-ID: I excluded context switches as a possible culprit by looping until a corruption happened for which no context switches occurred while the test was running (ie, at the start of the test I would save the # of involuntary/voluntary context switches from /proc//status, then check those counts again after the failure - if they were different I restarted the test and kept looping until a failure happened in which the ctx switch counts were the same. ________________________________ From: dropbear-bounces+horshack=live.com at ucc.asn.au on behalf of Sebastian Gottschall Sent: Tuesday, March 24, 2020 9:13 PM To: dropbear at ucc.asn.au Subject: Re: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 if the corruption is caused by a context switch the problem can be caused by the kernel. try the following and disable "CONFIG_KERNEL_MODE_NEON" in the kernel config. this will disable some kernel crypto assembly code Am 24.03.2020 um 16:11 schrieb Matt Johnston: Good work narrowing down a test case there. That's an interesting finding - I guess it might be worth posting on OpenWRT lists/forum to try find other testers. Could it be power related if the tight multiplication loop is stressing it somehow? It doesn't seem to be using the Neon instruction for anything apart from loads/stores though - is there something that the compiler should be doing mixing Neon and non-Neon operations? Cheers, Matt (Your emails got held up being over 100kB, I've trimmed the reply below and let them through. Apologies to everyone for the stale old one that got let through with them just now, I wasn't looking closely) On Tue 24/3/2020, at 11:23 am, Horshack ?? > wrote: I was able to isolate the issue to just a handful of assembly instructions within fast_s_mp_sqr(), related to the squaring loop. I broke that code out into a separate utility that reproduces the issue within a few seconds. The failure is somewhat sensitive to the data pattern and very sensitive to timing, indicating a likely memory/data path issue within my particular router. I'm guessing it's the IPQ8065 and not the SDRAM because I can get it to fail with a tiny data set easily fits within DCACHE. I can alter the frequency of the failure with a single ARM memory barrier instruction, which at first implied a superscalar data ordering condition but the memory barrier also alters the timing through the DCACHE so that is likely the effect it's having. I was able to exclude the VFP/Neon register corruption as the cause with some test code. I also excluded any context switch-speciifc issue by measuring the # of context switches in /proc//status and catching a failure where no switches had occurred. I also modified the affinity so the utility runs on just one processor to rule out a specific core having the issue. I put the source and binary of my utility on github - if anyone on this mailing list has this model router can you give it a try if possible? You only need the ipq8065-sqrbug (binary) and run-ipq8065-sqrbug.sh (script). Here's the link to the repository: https://github.com/horshack-dpreview/ipq8065-sqrbug ________________________________ From: Horshack ?? > Sent: Saturday, March 21, 2020 7:54 AM To: dropbear at ucc.asn.au > Subject: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 Including mailing list for my last two messages below... Begin forwarded message: From: Horshack ?? > Date: March 21, 2020 at 7:35:18 AM PDT To: Matt Johnston > Cc: "dropbear at ucc.asn.au" > Subject: Re: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 ? Disassembly of fast_s_mp_sqr() and other libtommath functions reveals gcc is utilizing the arm NEON SIMD instructions and registers for calculations involved with libtommath's mp_word scalar. Based on the 64-bit word corruption I see I'm guessing the SIMD registers aren't being preserved/restored properly somewhere, probably during a context switch, specifically s16?s31 (d8?d15, q4?q7), which AAPCS says must be preserved and which I see being used in the disassembly of fast_s_mp_sqr(). I'lll write some test code later today to see if this is the case, and if so, try to track down where and why the registers aren't being preserved. ________________________________ From: Horshack ?? > Sent: Saturday, March 21, 2020 1:11 AM To: Matt Johnston > Cc: dropbear at ucc.asn.au > Subject: Re: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 I have one of the failure paths isolated down to a single corrupt 64-bit word in memory, which required a significant amount of code instrumentation to achieve. I implemented a code execution history buffer that gets filled at various checkpoints within s_mp_exptmod() and some of the modules called by it. To facilitate this history mechanism I packaged all of s_mp_exptmod()'s local variables inside a structure , which consists of saving the local scalar vars in addition to crc32's of all the mp_int data structures with a separate crc32 of the mp_int.dp payload (data). When a failure occurs, ie one or more of the three back-to-back debug invocations of s_mp_exptmod yields a mismatching signed key result, I dump out the history elements for each of the invocations to determine the first code checkpoint where failing invocation departed from the known correct invocation. *snipped* -------------- next part -------------- An HTML attachment was scrubbed... URL: https://lists.ucc.gu.uwa.edu.au/pipermail/dropbear/attachments/20200325/e016930f/attachment-0001.htm From s.gottschall at dd-wrt.com Wed Mar 25 12:57:36 2020 From: s.gottschall at dd-wrt.com (Sebastian Gottschall) Date: Wed, 25 Mar 2020 05:57:36 +0100 Subject: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 In-Reply-To: References: <44E94890-EAC4-4E9B-89AA-CE81087B7EFA@ucc.asn.au> <1E0D44CA-5341-4C3A-924E-8C6BF850B64A@ucc.asn.au> <2ec5919a-a90b-6437-8fbd-4922fb6fb121@dd-wrt.com> Message-ID: <71fcd2e2-7ece-1f7c-aebc-6b7f948c8802@dd-wrt.com> how can you make sure that no context switch is happening if the kernel uses neon instructions by itself? by stopping the kernel? this is faily impossible. check if this option is on, and disable it to make sure that the kernel does not make use of neon instructions Am 25.03.2020 um 05:25 schrieb Horshack ??: > I excluded context switches as a possible culprit by looping until a > corruption happened for which no context switches occurred while the > test was running (ie, at the start of the test I would save the # of > involuntary/voluntary context switches from /proc//status, then > check those counts again after the failure - if they were different I > restarted the test and kept looping until a failure happened in which > the ctx switch counts were the same. > > ------------------------------------------------------------------------ > *From:* dropbear-bounces+horshack=live.com at ucc.asn.au > on behalf of Sebastian > Gottschall > *Sent:* Tuesday, March 24, 2020 9:13 PM > *To:* dropbear at ucc.asn.au > *Subject:* Re: SSH key exchange fails 30-70% of the time on Netgear > X4S R7800 > > if the corruption is caused by a context switch the problem can be > caused by the kernel. > try the following and disable "CONFIG_KERNEL_MODE_NEON" > in the kernel config. this will disable some kernel crypto assembly code > > Am 24.03.2020 um 16:11 schrieb Matt Johnston: >> Good work narrowing down a test case there. >> That's an interesting finding - I guess it might be worth posting on >> OpenWRT lists/forum to try find other testers. >> Could it be power related if the tight multiplication loop is >> stressing it somehow? It doesn't seem to be using the Neon >> instruction for anything apart from loads/stores though - is there >> something that the compiler should be doing mixing Neon and non-Neon >> operations? >> >> Cheers, >> Matt >> >> (Your emails got held up being over 100kB, I've trimmed the reply >> below and let them through. Apologies to everyone for the stale old >> one that got let through with them just now, I wasn't looking closely) >> >>> On Tue 24/3/2020, at 11:23 am, Horshack ?? >> > wrote: >>> >>> I was able to isolate the issue to just a handful of assembly >>> instructions within fast_s_mp_sqr(), related to the squaring loop. I >>> broke that code out into a separate utility that reproduces the >>> issue within a few seconds. The failure is somewhat sensitive to the >>> data pattern and very sensitive to timing, indicating a likely >>> memory/data path issue within my particular router. I'm guessing >>> it's the IPQ8065 and not the SDRAM because I can get it to fail with >>> a tiny data set easily fits within DCACHE. I can alter the frequency >>> of the failure with a single ARM memory barrier instruction, which >>> at first implied a superscalar data ordering condition but the >>> memory barrier also alters the timing through the DCACHE so that is >>> likely the effect it's having. I was able to exclude the VFP/Neon >>> register corruption as the cause with some test code. I also >>> excluded any context switch-speciifc issue by measuring the # of >>> context switches in /proc//status and catching a failure where >>> no switches had occurred. I also modified the affinity so the >>> utility runs on just one processor to rule out a specific core >>> having the issue. >>> >>> I put the source and binary of my utility on github - if anyone on >>> this mailing list has this model router can you give it a try if >>> possible? You only need the ipq8065-sqrbug (binary) and >>> run-ipq8065-sqrbug.sh (script). Here's the link to the >>> repository:https://github.com/horshack-dpreview/ipq8065-sqrbug >>> >>> >>> >>> ------------------------------------------------------------------------ >>> *From:*Horshack ?? > >>> *Sent:*Saturday, March 21, 2020 7:54 AM >>> *To:*dropbear at ucc.asn.au >>> >> > >>> *Subject:*SSH key exchange fails 30-70% of the time on Netgear X4S >>> R7800 >>> Including mailing list for my last two messages below... >>> >>> Begin forwarded message: >>> >>>> *From:*Horshack ?? > >>>> *Date:*March 21, 2020 at 7:35:18 AM PDT >>>> *To:*Matt Johnston > >>>> *Cc:*"dropbear at ucc.asn.au " >>>> > >>>> *Subject:**Re:? SSH key exchange fails 30-70% of the time on >>>> Netgear X4S R7800* >>>> >>>> ? >>>> Disassembly of fast_s_mp_sqr() and other libtommath functions >>>> reveals gcc is utilizing the arm NEON SIMD instructions and >>>> registers for calculations involved with libtommath's mp_word >>>> scalar. Based on the 64-bit word corruption I see I'm guessing the >>>> SIMD registers aren't being preserved/restored properly somewhere, >>>> probably during a context switch, specifically s16?s31 (d8?d15, >>>> q4?q7), which AAPCS says must be preserved and which I see being >>>> used in the disassembly of fast_s_mp_sqr(). I'lll write some test >>>> code later today to see if this is the case, and if so, try to >>>> track down where and why the registers aren't being preserved. >>>> >>>> ------------------------------------------------------------------------ >>>> *From:*Horshack ?? > >>>> *Sent:*Saturday, March 21, 2020 1:11 AM >>>> *To:*Matt Johnston > >>>> *Cc:*dropbear at ucc.asn.au >>>> > >>>> *Subject:*Re: SSH key exchange fails 30-70% of the time on Netgear >>>> X4S R7800 >>>> I have one of the failure paths isolated down to a single corrupt >>>> 64-bit word in memory, which required a significant amount of code >>>> instrumentation to achieve. I implemented a code execution history >>>> buffer that gets filled at various checkpoints within >>>> s_mp_exptmod() and some of the modules called by it. To facilitate >>>> this history mechanism I packaged all of s_mp_exptmod()'s local >>>> variables inside a structure , which consists of saving the local >>>> scalar vars in addition to crc32's of all the mp_int data >>>> structures with a separate crc32 of the mp_int.dp payload (data). >>>> When a failure occurs, ie one or more of the three back-to-back >>>> debug invocations of s_mp_exptmod yields a mismatching signed key >>>> result, I? dump out the history elements for each of the >>>> invocations to determine the first code checkpoint where failing >>>> invocation departed from the known correct invocation. >> >> *snipped* >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: https://lists.ucc.gu.uwa.edu.au/pipermail/dropbear/attachments/20200325/cb03e753/attachment-0001.htm From ada at thorsis.com Thu Mar 26 18:45:03 2020 From: ada at thorsis.com (Alexander Dahl) Date: Thu, 26 Mar 2020 11:45:03 +0100 Subject: [PATCH 0 of 1] Fix build In-Reply-To: References: Message-ID: <198158015.XBfoIZF8hE@ada> Hello, Am Mittwoch, 11. M?rz 2020, 11:04:03 CET schrieb Alexander Dahl: > I'm currently working on upgrading dropbear in the ptxdist embedded > Linux build system. ptxdist has a quite strict policy to explicitly pass > all available options of ./configure in order to have a predictable > build result. So I tried passing --disable-fuzz which did not have the > expected effect. I confirmed the broken behaviour on my workstation > (Debian 9 (stretch) on amd64) and this is the fix I came up with. It's > just mimicking what the other options do, I have only very few > experience with autotools, so please review carefully. Gentle ping on this patch. > And sorry for the messed up mails from yesterday, my Mercurial skills > are a bit rusty and sending patches with hg is a little different from > Git ? ;-) I noticed pull requests from the GitHub mirror were integrated, are patches sent to this mailing list still considered for inclusion? Greets Alex From matt at ucc.asn.au Fri Mar 27 23:27:26 2020 From: matt at ucc.asn.au (Matt Johnston) Date: Fri, 27 Mar 2020 23:27:26 +0800 Subject: [PATCH 0 of 1] Fix build In-Reply-To: <198158015.XBfoIZF8hE@ada> References: <198158015.XBfoIZF8hE@ada> Message-ID: > On Thu 26/3/2020, at 6:45 pm, Alexander Dahl wrote: > > Gentle ping on this patch. Hi Alex, Sorry for the delay, it's merged now. Cheers, Matt From horshack at live.com Sun Mar 29 04:06:12 2020 From: horshack at live.com (=?utf-8?B?SG9yc2hhY2sg4oCq4oCs?=) Date: Sat, 28 Mar 2020 20:06:12 +0000 Subject: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 In-Reply-To: References: <44E94890-EAC4-4E9B-89AA-CE81087B7EFA@ucc.asn.au> <1E0D44CA-5341-4C3A-924E-8C6BF850B64A@ucc.asn.au>, <2ec5919a-a90b-6437-8fbd-4922fb6fb121@dd-wrt.com>, Message-ID: As a postscript, I was able to refine the logic to produce the corrupted result almost instantaneously. I'm also able to get it to fail with an all-zero input dataset and a bitwise OR operation instead of the original squaring multiplication operations, which allows me to see what actual corrupted loads are. The result is very interesting - sometimes the corrupted data is valid ARM instructions, other times valid kernel-space addresses, so it seems clear this is an addressing problem. Also interesting is how I'll see just one or a few corrupted words, which implies the corruption is in the interface between DCACHE and the processor rather than errant fetch of a line into DCACHE from memory (otherwise the entire DCACHE line would hold corrupt data). You can see a sample of the failure output here: https://github.com/horshack-dpreview/ipq8065-sqrbug/blob/master/SampleFailures.txt Finally, to exclude any possibility the issue is related to possible kernel code running and corrupting register sets/memory (such as an interrupt routine), I ported the test to a kernel module and ran the logic within a local_irq_disable() block, which disables both preemption and interrupts on the core. Still fails. I created a separate repository for the kernel module version here: https://github.com/horshack-dpreview/ipq8065-sqrbug-driver ________________________________ From: Horshack ?? Sent: Tuesday, March 24, 2020 9:25 PM To: Sebastian Gottschall ; dropbear at ucc.asn.au Subject: Re: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 I excluded context switches as a possible culprit by looping until a corruption happened for which no context switches occurred while the test was running (ie, at the start of the test I would save the # of involuntary/voluntary context switches from /proc//status, then check those counts again after the failure - if they were different I restarted the test and kept looping until a failure happened in which the ctx switch counts were the same. ________________________________ From: dropbear-bounces+horshack=live.com at ucc.asn.au on behalf of Sebastian Gottschall Sent: Tuesday, March 24, 2020 9:13 PM To: dropbear at ucc.asn.au Subject: Re: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 if the corruption is caused by a context switch the problem can be caused by the kernel. try the following and disable "CONFIG_KERNEL_MODE_NEON" in the kernel config. this will disable some kernel crypto assembly code Am 24.03.2020 um 16:11 schrieb Matt Johnston: Good work narrowing down a test case there. That's an interesting finding - I guess it might be worth posting on OpenWRT lists/forum to try find other testers. Could it be power related if the tight multiplication loop is stressing it somehow? It doesn't seem to be using the Neon instruction for anything apart from loads/stores though - is there something that the compiler should be doing mixing Neon and non-Neon operations? Cheers, Matt (Your emails got held up being over 100kB, I've trimmed the reply below and let them through. Apologies to everyone for the stale old one that got let through with them just now, I wasn't looking closely) On Tue 24/3/2020, at 11:23 am, Horshack ?? > wrote: I was able to isolate the issue to just a handful of assembly instructions within fast_s_mp_sqr(), related to the squaring loop. I broke that code out into a separate utility that reproduces the issue within a few seconds. The failure is somewhat sensitive to the data pattern and very sensitive to timing, indicating a likely memory/data path issue within my particular router. I'm guessing it's the IPQ8065 and not the SDRAM because I can get it to fail with a tiny data set easily fits within DCACHE. I can alter the frequency of the failure with a single ARM memory barrier instruction, which at first implied a superscalar data ordering condition but the memory barrier also alters the timing through the DCACHE so that is likely the effect it's having. I was able to exclude the VFP/Neon register corruption as the cause with some test code. I also excluded any context switch-speciifc issue by measuring the # of context switches in /proc//status and catching a failure where no switches had occurred. I also modified the affinity so the utility runs on just one processor to rule out a specific core having the issue. I put the source and binary of my utility on github - if anyone on this mailing list has this model router can you give it a try if possible? You only need the ipq8065-sqrbug (binary) and run-ipq8065-sqrbug.sh (script). Here's the link to the repository: https://github.com/horshack-dpreview/ipq8065-sqrbug ________________________________ From: Horshack ?? > Sent: Saturday, March 21, 2020 7:54 AM To: dropbear at ucc.asn.au > Subject: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 Including mailing list for my last two messages below... Begin forwarded message: From: Horshack ?? > Date: March 21, 2020 at 7:35:18 AM PDT To: Matt Johnston > Cc: "dropbear at ucc.asn.au" > Subject: Re: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 ? Disassembly of fast_s_mp_sqr() and other libtommath functions reveals gcc is utilizing the arm NEON SIMD instructions and registers for calculations involved with libtommath's mp_word scalar. Based on the 64-bit word corruption I see I'm guessing the SIMD registers aren't being preserved/restored properly somewhere, probably during a context switch, specifically s16?s31 (d8?d15, q4?q7), which AAPCS says must be preserved and which I see being used in the disassembly of fast_s_mp_sqr(). I'lll write some test code later today to see if this is the case, and if so, try to track down where and why the registers aren't being preserved. ________________________________ From: Horshack ?? > Sent: Saturday, March 21, 2020 1:11 AM To: Matt Johnston > Cc: dropbear at ucc.asn.au > Subject: Re: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 I have one of the failure paths isolated down to a single corrupt 64-bit word in memory, which required a significant amount of code instrumentation to achieve. I implemented a code execution history buffer that gets filled at various checkpoints within s_mp_exptmod() and some of the modules called by it. To facilitate this history mechanism I packaged all of s_mp_exptmod()'s local variables inside a structure , which consists of saving the local scalar vars in addition to crc32's of all the mp_int data structures with a separate crc32 of the mp_int.dp payload (data). When a failure occurs, ie one or more of the three back-to-back debug invocations of s_mp_exptmod yields a mismatching signed key result, I dump out the history elements for each of the invocations to determine the first code checkpoint where failing invocation departed from the known correct invocation. *snipped* -------------- next part -------------- An HTML attachment was scrubbed... URL: https://lists.ucc.gu.uwa.edu.au/pipermail/dropbear/attachments/20200328/9c5a0351/attachment-0001.htm From s.gottschall at dd-wrt.com Sun Mar 29 05:32:23 2020 From: s.gottschall at dd-wrt.com (Sebastian Gottschall) Date: Sat, 28 Mar 2020 22:32:23 +0100 Subject: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 In-Reply-To: References: <44E94890-EAC4-4E9B-89AA-CE81087B7EFA@ucc.asn.au> <1E0D44CA-5341-4C3A-924E-8C6BF850B64A@ucc.asn.au> <2ec5919a-a90b-6437-8fbd-4922fb6fb121@dd-wrt.com> Message-ID: <53c03edd-fa50-637e-e831-efde901da6f9@dd-wrt.com> i can exclude neon code for dd-wrt in dropbear if it helps. but would be greater to nail down the problem. otherwise other programms would be likelly affected too Am 28.03.2020 um 21:06 schrieb Horshack ??: > As a postscript, I was able to refine the logic to produce the > corrupted result almost instantaneously. I'm also able to get it to > fail with an all-zero input dataset and a bitwise OR operation instead > of the original squaring multiplication operations, which allows me to > see what actual corrupted loads are. The result is very interesting - > sometimes the corrupted data is valid ARM instructions, other times > valid kernel-space addresses, so it seems clear this is an addressing > problem. Also interesting is how I'll see just one or a few corrupted > words, which implies the corruption is in the interface between DCACHE > and the processor rather than errant fetch of a line into DCACHE from > memory (otherwise the entire DCACHE line would hold corrupt data). You > can see a sample of the failure output here: > https://github.com/horshack-dpreview/ipq8065-sqrbug/blob/master/SampleFailures.txt > > > Finally, to exclude any possibility the issue is related to possible > kernel code running and corrupting register sets/memory (such as an > interrupt routine), I ported the test to a kernel module and ran the > logic within a local_irq_disable() block, which disables both > preemption and interrupts on the core. Still fails. I created a > separate repository for the kernel module version here: > https://github.com/horshack-dpreview/ipq8065-sqrbug-driver > > > ------------------------------------------------------------------------ > *From:* Horshack ?? > *Sent:* Tuesday, March 24, 2020 9:25 PM > *To:* Sebastian Gottschall ; > dropbear at ucc.asn.au > *Subject:* Re: SSH key exchange fails 30-70% of the time on Netgear > X4S R7800 > I excluded context switches as a possible culprit by looping until a > corruption happened for which no context switches occurred while the > test was running (ie, at the start of the test I would save the # of > involuntary/voluntary context switches from /proc//status, then > check those counts again after the failure - if they were different I > restarted the test and kept looping until a failure happened in which > the ctx switch counts were the same. > > ------------------------------------------------------------------------ > *From:* dropbear-bounces+horshack=live.com at ucc.asn.au > on behalf of Sebastian > Gottschall > *Sent:* Tuesday, March 24, 2020 9:13 PM > *To:* dropbear at ucc.asn.au > *Subject:* Re: SSH key exchange fails 30-70% of the time on Netgear > X4S R7800 > > if the corruption is caused by a context switch the problem can be > caused by the kernel. > try the following and disable "CONFIG_KERNEL_MODE_NEON" > in the kernel config. this will disable some kernel crypto assembly code > > Am 24.03.2020 um 16:11 schrieb Matt Johnston: >> Good work narrowing down a test case there. >> That's an interesting finding - I guess it might be worth posting on >> OpenWRT lists/forum to try find other testers. >> Could it be power related if the tight multiplication loop is >> stressing it somehow? It doesn't seem to be using the Neon >> instruction for anything apart from loads/stores though - is there >> something that the compiler should be doing mixing Neon and non-Neon >> operations? >> >> Cheers, >> Matt >> >> (Your emails got held up being over 100kB, I've trimmed the reply >> below and let them through. Apologies to everyone for the stale old >> one that got let through with them just now, I wasn't looking closely) >> >>> On Tue 24/3/2020, at 11:23 am, Horshack ?? >> > wrote: >>> >>> I was able to isolate the issue to just a handful of assembly >>> instructions within fast_s_mp_sqr(), related to the squaring loop. I >>> broke that code out into a separate utility that reproduces the >>> issue within a few seconds. The failure is somewhat sensitive to the >>> data pattern and very sensitive to timing, indicating a likely >>> memory/data path issue within my particular router. I'm guessing >>> it's the IPQ8065 and not the SDRAM because I can get it to fail with >>> a tiny data set easily fits within DCACHE. I can alter the frequency >>> of the failure with a single ARM memory barrier instruction, which >>> at first implied a superscalar data ordering condition but the >>> memory barrier also alters the timing through the DCACHE so that is >>> likely the effect it's having. I was able to exclude the VFP/Neon >>> register corruption as the cause with some test code. I also >>> excluded any context switch-speciifc issue by measuring the # of >>> context switches in /proc//status and catching a failure where >>> no switches had occurred. I also modified the affinity so the >>> utility runs on just one processor to rule out a specific core >>> having the issue. >>> >>> I put the source and binary of my utility on github - if anyone on >>> this mailing list has this model router can you give it a try if >>> possible? You only need the ipq8065-sqrbug (binary) and >>> run-ipq8065-sqrbug.sh (script). Here's the link to the >>> repository:https://github.com/horshack-dpreview/ipq8065-sqrbug >>> >>> >>> >>> ------------------------------------------------------------------------ >>> *From:*Horshack ?? > >>> *Sent:*Saturday, March 21, 2020 7:54 AM >>> *To:*dropbear at ucc.asn.au >>> >> > >>> *Subject:*SSH key exchange fails 30-70% of the time on Netgear X4S >>> R7800 >>> Including mailing list for my last two messages below... >>> >>> Begin forwarded message: >>> >>>> *From:*Horshack ?? > >>>> *Date:*March 21, 2020 at 7:35:18 AM PDT >>>> *To:*Matt Johnston > >>>> *Cc:*"dropbear at ucc.asn.au " >>>> > >>>> *Subject:**Re:? SSH key exchange fails 30-70% of the time on >>>> Netgear X4S R7800* >>>> >>>> ? >>>> Disassembly of fast_s_mp_sqr() and other libtommath functions >>>> reveals gcc is utilizing the arm NEON SIMD instructions and >>>> registers for calculations involved with libtommath's mp_word >>>> scalar. Based on the 64-bit word corruption I see I'm guessing the >>>> SIMD registers aren't being preserved/restored properly somewhere, >>>> probably during a context switch, specifically s16?s31 (d8?d15, >>>> q4?q7), which AAPCS says must be preserved and which I see being >>>> used in the disassembly of fast_s_mp_sqr(). I'lll write some test >>>> code later today to see if this is the case, and if so, try to >>>> track down where and why the registers aren't being preserved. >>>> >>>> ------------------------------------------------------------------------ >>>> *From:*Horshack ?? > >>>> *Sent:*Saturday, March 21, 2020 1:11 AM >>>> *To:*Matt Johnston > >>>> *Cc:*dropbear at ucc.asn.au >>>> > >>>> *Subject:*Re: SSH key exchange fails 30-70% of the time on Netgear >>>> X4S R7800 >>>> I have one of the failure paths isolated down to a single corrupt >>>> 64-bit word in memory, which required a significant amount of code >>>> instrumentation to achieve. I implemented a code execution history >>>> buffer that gets filled at various checkpoints within >>>> s_mp_exptmod() and some of the modules called by it. To facilitate >>>> this history mechanism I packaged all of s_mp_exptmod()'s local >>>> variables inside a structure , which consists of saving the local >>>> scalar vars in addition to crc32's of all the mp_int data >>>> structures with a separate crc32 of the mp_int.dp payload (data). >>>> When a failure occurs, ie one or more of the three back-to-back >>>> debug invocations of s_mp_exptmod yields a mismatching signed key >>>> result, I? dump out the history elements for each of the >>>> invocations to determine the first code checkpoint where failing >>>> invocation departed from the known correct invocation. >> >> *snipped* >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: https://lists.ucc.gu.uwa.edu.au/pipermail/dropbear/attachments/20200328/d2548b88/attachment-0001.htm From horshack at live.com Sun Mar 29 06:21:46 2020 From: horshack at live.com (=?utf-8?B?SG9yc2hhY2sg4oCq4oCs?=) Date: Sat, 28 Mar 2020 22:21:46 +0000 Subject: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 In-Reply-To: <53c03edd-fa50-637e-e831-efde901da6f9@dd-wrt.com> References: <44E94890-EAC4-4E9B-89AA-CE81087B7EFA@ucc.asn.au> <1E0D44CA-5341-4C3A-924E-8C6BF850B64A@ucc.asn.au> <2ec5919a-a90b-6437-8fbd-4922fb6fb121@dd-wrt.com> , <53c03edd-fa50-637e-e831-efde901da6f9@dd-wrt.com> Message-ID: Part of the refinement to get the test to fail faster was changing the _W operand from 64 to 32-bits, which also eliminated the use of the neon/vfp registers. You can see the revised disassembly here: https://github.com/horshack-dpreview/ipq8065-sqrbug/blob/master/sqr.c. The essential nature of what causes the corruption to manifest is the ping-ponging of the reads from both ends of the input array, collapsing inward in each iteration. ________________________________ From: dropbear-bounces+horshack=live.com at ucc.asn.au on behalf of Sebastian Gottschall Sent: Saturday, March 28, 2020 2:32 PM To: dropbear at ucc.asn.au Subject: Re: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 i can exclude neon code for dd-wrt in dropbear if it helps. but would be greater to nail down the problem. otherwise other programms would be likelly affected too Am 28.03.2020 um 21:06 schrieb Horshack ??: As a postscript, I was able to refine the logic to produce the corrupted result almost instantaneously. I'm also able to get it to fail with an all-zero input dataset and a bitwise OR operation instead of the original squaring multiplication operations, which allows me to see what actual corrupted loads are. The result is very interesting - sometimes the corrupted data is valid ARM instructions, other times valid kernel-space addresses, so it seems clear this is an addressing problem. Also interesting is how I'll see just one or a few corrupted words, which implies the corruption is in the interface between DCACHE and the processor rather than errant fetch of a line into DCACHE from memory (otherwise the entire DCACHE line would hold corrupt data). You can see a sample of the failure output here: https://github.com/horshack-dpreview/ipq8065-sqrbug/blob/master/SampleFailures.txt Finally, to exclude any possibility the issue is related to possible kernel code running and corrupting register sets/memory (such as an interrupt routine), I ported the test to a kernel module and ran the logic within a local_irq_disable() block, which disables both preemption and interrupts on the core. Still fails. I created a separate repository for the kernel module version here: https://github.com/horshack-dpreview/ipq8065-sqrbug-driver ________________________________ From: Horshack ?? Sent: Tuesday, March 24, 2020 9:25 PM To: Sebastian Gottschall ; dropbear at ucc.asn.au Subject: Re: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 I excluded context switches as a possible culprit by looping until a corruption happened for which no context switches occurred while the test was running (ie, at the start of the test I would save the # of involuntary/voluntary context switches from /proc//status, then check those counts again after the failure - if they were different I restarted the test and kept looping until a failure happened in which the ctx switch counts were the same. ________________________________ From: dropbear-bounces+horshack=live.com at ucc.asn.au on behalf of Sebastian Gottschall Sent: Tuesday, March 24, 2020 9:13 PM To: dropbear at ucc.asn.au Subject: Re: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 if the corruption is caused by a context switch the problem can be caused by the kernel. try the following and disable "CONFIG_KERNEL_MODE_NEON" in the kernel config. this will disable some kernel crypto assembly code Am 24.03.2020 um 16:11 schrieb Matt Johnston: Good work narrowing down a test case there. That's an interesting finding - I guess it might be worth posting on OpenWRT lists/forum to try find other testers. Could it be power related if the tight multiplication loop is stressing it somehow? It doesn't seem to be using the Neon instruction for anything apart from loads/stores though - is there something that the compiler should be doing mixing Neon and non-Neon operations? Cheers, Matt (Your emails got held up being over 100kB, I've trimmed the reply below and let them through. Apologies to everyone for the stale old one that got let through with them just now, I wasn't looking closely) On Tue 24/3/2020, at 11:23 am, Horshack ?? > wrote: I was able to isolate the issue to just a handful of assembly instructions within fast_s_mp_sqr(), related to the squaring loop. I broke that code out into a separate utility that reproduces the issue within a few seconds. The failure is somewhat sensitive to the data pattern and very sensitive to timing, indicating a likely memory/data path issue within my particular router. I'm guessing it's the IPQ8065 and not the SDRAM because I can get it to fail with a tiny data set easily fits within DCACHE. I can alter the frequency of the failure with a single ARM memory barrier instruction, which at first implied a superscalar data ordering condition but the memory barrier also alters the timing through the DCACHE so that is likely the effect it's having. I was able to exclude the VFP/Neon register corruption as the cause with some test code. I also excluded any context switch-speciifc issue by measuring the # of context switches in /proc//status and catching a failure where no switches had occurred. I also modified the affinity so the utility runs on just one processor to rule out a specific core having the issue. I put the source and binary of my utility on github - if anyone on this mailing list has this model router can you give it a try if possible? You only need the ipq8065-sqrbug (binary) and run-ipq8065-sqrbug.sh (script). Here's the link to the repository: https://github.com/horshack-dpreview/ipq8065-sqrbug ________________________________ From: Horshack ?? > Sent: Saturday, March 21, 2020 7:54 AM To: dropbear at ucc.asn.au > Subject: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 Including mailing list for my last two messages below... Begin forwarded message: From: Horshack ?? > Date: March 21, 2020 at 7:35:18 AM PDT To: Matt Johnston > Cc: "dropbear at ucc.asn.au" > Subject: Re: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 ? Disassembly of fast_s_mp_sqr() and other libtommath functions reveals gcc is utilizing the arm NEON SIMD instructions and registers for calculations involved with libtommath's mp_word scalar. Based on the 64-bit word corruption I see I'm guessing the SIMD registers aren't being preserved/restored properly somewhere, probably during a context switch, specifically s16?s31 (d8?d15, q4?q7), which AAPCS says must be preserved and which I see being used in the disassembly of fast_s_mp_sqr(). I'lll write some test code later today to see if this is the case, and if so, try to track down where and why the registers aren't being preserved. ________________________________ From: Horshack ?? > Sent: Saturday, March 21, 2020 1:11 AM To: Matt Johnston > Cc: dropbear at ucc.asn.au > Subject: Re: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 I have one of the failure paths isolated down to a single corrupt 64-bit word in memory, which required a significant amount of code instrumentation to achieve. I implemented a code execution history buffer that gets filled at various checkpoints within s_mp_exptmod() and some of the modules called by it. To facilitate this history mechanism I packaged all of s_mp_exptmod()'s local variables inside a structure , which consists of saving the local scalar vars in addition to crc32's of all the mp_int data structures with a separate crc32 of the mp_int.dp payload (data). When a failure occurs, ie one or more of the three back-to-back debug invocations of s_mp_exptmod yields a mismatching signed key result, I dump out the history elements for each of the invocations to determine the first code checkpoint where failing invocation departed from the known correct invocation. *snipped* -------------- next part -------------- An HTML attachment was scrubbed... URL: https://lists.ucc.gu.uwa.edu.au/pipermail/dropbear/attachments/20200328/64112516/attachment-0001.htm