45s login delay

man at lundinova.se man at lundinova.se
Wed Mar 16 18:51:00 WST 2011


Hi Peter,

Thanks for your reply, that's very informative.

I tried openssl and got down to around the 10s you're talking about. Doesn't
leave space for much else though.
I haven't found any real specs for the CPU - but the manufacturer mentions type
and size of cache in their other CPUs but not in this one, so I fear the worst.

I'll try your converted code and see where that brings me.

Kind regards/Magnus

Quoting Peter Turczak <peter at turczak.de>:

> Hi Magnus,
> hi Rob,
>
> a while ago I made the same observations you did. On an m68k-nommu with 166
> MHz the RSA exchange took quite forever. After some profiling I found out the
> comba multiply routine in libtommath was eating most of the time. It seems
> gcc produces quite inefficient code there. Libtommath resizes its large
> integers while calculating leading to more work for user memory management.
> Therefore I converted dropbear to use libtomsfastmath which helped a lot at
> the expense of a larger memory footprint. After porting some parts to
> assembler (which libtomsfastmath has special hooks for) I cut down the time
> to 10sec which is IMHO much better.
>
> The version I did was more a proof of concept and is not shiny and packed but
> will compile, maybe you could have a look at it.
> (http://peter.turczak.de/dropbear-tfm.tgz)
>
> Rob is right in a way, but openssl uses assembler all along. Furthermore a
> missing L1 will contribute to slowing the keyexchange to a crawl.
>
> Best regards,
>
> Peter
>
> On Mar 15, 2011, at 10:25 PM, Rob Landley wrote:
>
> > On 03/15/2011 08:02 AM, Magnus Nilsson wrote:
> >> Sorry, I was unclear - it's only 100% busy during those 45s.
> >>
> >> This is what it looks like if I first start the load monitor (-r outputs
> >> 1 sample/second), then start to log in from a remote ssh client:
> >> # cpu -r
> >> CPU:  busy 0%  (system=0% user=0% nice=0% idle=100%)
> >> CPU:  busy 24%  (system=4% user=19% nice=0% idle=75%)
> >> CPU:  busy 100%  (system=1% user=98% nice=0% idle=0%)
> >> CPU:  busy 100%  (system=0% user=100% nice=0% idle=0%)
> >> <39 repeats of the above busy 100%>
> >> CPU:  busy 100%  (system=2% user=97% nice=0% idle=0%)
> >> CPU:  busy 100%  (system=8% user=91% nice=0% idle=0%)
> >> CPU:  busy 100%  (system=22% user=77% nice=0% idle=0%)
> >> CPU:  busy 100%  (system=0% user=100% nice=0% idle=0%)
> >> CPU:  busy 100%  (system=0% user=100% nice=0% idle=0%)
> >> CPU:  busy 67%  (system=8% user=58% nice=0% idle=32%)
> >> CPU:  busy 0%  (system=0% user=0% nice=0% idle=100%)
> >>
> >> Thanks for the tip on prebuilt busybox Rob, but would I need it in flat
> >> format. I don't think arm-elf-elf2flt can do that without reloc info or?
> >> And from the above I don't think it would add much info.
> >>
> >> My question is:
> >> Is 45s reasonable on a 192MHz cpu,
> >
> > No.  I had a 200mhz celeron that did 3.2 ssh logins per second ten years
> > ago.  (I did a VPN built on top of ssh, dynamic port forwarding, and
> > netcat, and had to benchmark it.)  Going from i686 to arm could cost you
> > some performance (ever since the pentium it's had multiple execution
> > cores, speculative execution, instruction reordering and such), but
> > there's no _way_ that's more than an order of magnitude in performance.
> > I could see 4 seconds, but but 45 seconds is pathological.  Something
> > is wrong.
> >
> > My next step would be "stick printfs in the source code and see where
> > the big delay is".
> >
> > Hmmm...  Do they still _make_ CPUs with no L1 cache?
> >
> > Rob
>
>




More information about the Dropbear mailing list