Post Reply 
 
Thread Rating:
  • 0 Votes - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Using the DroboFS crypto hardware acceleration
10-17-2012, 10:10 AM (This post was last modified: 10-18-2012 12:34 AM by ricardo.)
Post: #1
Using the DroboFS crypto hardware acceleration
I always wondered about the "Cryptographic Engine" that ships with the DroboFS's motherboard (Feroceon-MV78200).

According to the functional specification, page 22 [1]:
Quote:The Cryptographic Engine supports AES, DES, and 3DES encryption algorithms, and SHA1 and MD5 authentication algorithms.

So I wondered if this could be used in something like OpenSSL. It turns out it can [2].

To compile OpenSSL 1.0.1c with support for the hardware acceleration, three two things are needed:
1) copy the kernel header file ./crypto/ocf/cryptodev.h inside the crypto/ folder under the openssl source folder.
2) Use these arguments when configuring OpenSSL: -DHAVE_CRYPTODEV -DUSE_CRYPTODEV_DIGESTS

I compiled versions with -O3 and -Os for both -marm and -mthumb and here are the before/after results.

Before for -Os -mthumb:
Code:
$ sudo -s chmod o-rw /dev/crypto                                                                  
$ ./openssl speed -evp aes-128-cbc                                                                
Doing aes-128-cbc for 3s on 16 size blocks: 1110658 aes-128-cbc's in 3.01s
Doing aes-128-cbc for 3s on 64 size blocks: 318630 aes-128-cbc's in 3.01s
Doing aes-128-cbc for 3s on 256 size blocks: 82762 aes-128-cbc's in 3.01s
Doing aes-128-cbc for 3s on 1024 size blocks: 20894 aes-128-cbc's in 3.01s
Doing aes-128-cbc for 3s on 8192 size blocks: 2624 aes-128-cbc's in 3.01s
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc       5903.83k     6774.86k     7038.89k     7108.12k     7141.46k

Before for -O3 -mthumb:
Code:
$ sudo -s chmod o-rw /dev/crypto                                                                  
$ ./openssl speed -evp aes-128-cbc                                                                
Doing aes-128-cbc for 3s on 16 size blocks: 1333854 aes-128-cbc's in 3.01s
Doing aes-128-cbc for 3s on 64 size blocks: 396645 aes-128-cbc's in 3.01s
Doing aes-128-cbc for 3s on 256 size blocks: 97872 aes-128-cbc's in 3.01s
Doing aes-128-cbc for 3s on 1024 size blocks: 26297 aes-128-cbc's in 3.01s
Doing aes-128-cbc for 3s on 8192 size blocks: 3295 aes-128-cbc's in 3.01s
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc       7090.25k     8433.65k     8822.26k     8946.22k     8967.65k

Before for -O3 -marm:
Code:
$ sudo -s chmod o-rw /dev/crypto                                                                  
$ ./openssl speed -evp aes-128-cbc                                                                
Doing aes-128-cbc for 3s on 16 size blocks: 1676376 aes-128-cbc's in 3.01s
Doing aes-128-cbc for 3s on 64 size blocks: 498816 aes-128-cbc's in 3.01s
Doing aes-128-cbc for 3s on 256 size blocks: 130130 aes-128-cbc's in 3.01s
Doing aes-128-cbc for 3s on 1024 size blocks: 33163 aes-128-cbc's in 3.01s
Doing aes-128-cbc for 3s on 8192 size blocks: 4165 aes-128-cbc's in 3.01s
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc       8910.97k    10606.05k    11067.53k    11282.03k    11335.44k

So far, it seems that the optimization level has very little impact on the performance of OpenSSL. What happens if we enable the hardware optimization?

After for -Os -mthumb:
Code:
ricardo@DroboFS:~/tmp$ sudo -s chmod o+rw /dev/crypto                                                                  
ricardo@DroboFS:~/tmp$ ./openssl speed -evp aes-128-cbc                                                                
Doing aes-128-cbc for 3s on 16 size blocks: 103967 aes-128-cbc's in 0.17s
Doing aes-128-cbc for 3s on 64 size blocks: 88522 aes-128-cbc's in 0.11s
Doing aes-128-cbc for 3s on 256 size blocks: 56152 aes-128-cbc's in 0.09s
Doing aes-128-cbc for 3s on 1024 size blocks: 22848 aes-128-cbc's in 0.03s
Doing aes-128-cbc for 3s on 8192 size blocks: 3290 aes-128-cbc's in 0.01s
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc       9785.13k    51503.71k   159721.24k   779878.40k  2695168.00k

After for -O3 -mthumb:
Code:
$ sudo -s chmod o+rw /dev/crypto                                                                  
$ ./openssl speed -evp aes-128-cbc                                                                
Doing aes-128-cbc for 3s on 16 size blocks: 105181 aes-128-cbc's in 0.12s
Doing aes-128-cbc for 3s on 64 size blocks: 89682 aes-128-cbc's in 0.08s
Doing aes-128-cbc for 3s on 256 size blocks: 56550 aes-128-cbc's in 0.05s
Doing aes-128-cbc for 3s on 1024 size blocks: 22903 aes-128-cbc's in 0.05s
Doing aes-128-cbc for 3s on 8192 size blocks: 3286 aes-128-cbc's in 0.00s
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc      14024.13k    71745.60k   289536.00k   469053.44k         infk

After for -O3 -marm:
Code:
$ sudo -s chmod o+rw /dev/crypto                                                                  
$ ./openssl speed -evp aes-128-cbc                                                                
Doing aes-128-cbc for 3s on 16 size blocks: 104514 aes-128-cbc's in 0.10s
Doing aes-128-cbc for 3s on 64 size blocks: 83879 aes-128-cbc's in 0.05s
Doing aes-128-cbc for 3s on 256 size blocks: 56418 aes-128-cbc's in 0.02s
Doing aes-128-cbc for 3s on 1024 size blocks: 22909 aes-128-cbc's in 0.01s
Doing aes-128-cbc for 3s on 8192 size blocks: 3291 aes-128-cbc's in 0.00s
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc      16722.24k   107365.12k   722150.40k  2345881.60k         infk

If you are saying "wow!" now, I'm right there with you. From 3 seconds CPU time to 0.0 seconds. The performance of "-O3 -marm" is still quite a bit higher (4165 vs 3291, i.e., a 25% drop), but "-Os -mthumb" gets beaten (2624 vs 3290, i.e., a 25% increase).

Since it is a good idea to spare as much CPU time as we can on the FS, I'll be posting hardware-accelerated versions OpenSSH, OpenVPN and other apps that depend on OpenSSL as they get updated.


[1]: http://www.marvell.com/embedded-processo...http://www.marvell.com/embedded-processors/discovery-innovation/assets/FS_MV76100_78100_78200_Open
[2]: http://www.altechnative.net/2011/05/22/h...http://www.altechnative.net/2011/05/22/hardware-accelerated-ssl-on-marvell-kirkwood-arm-using-openssl-

Tap the full potential of the Drobo FS/5N with DroboApps.
To receive the latest updates about my DroboApps circle me on Google Plus.
Find all posts by this user
Quote this message in a reply
10-17-2012, 12:57 PM
Post: #2
RE: Using the DroboFS crypto hardware acceleration
Ricardo, this latest achievement is possibly your greatest yet.
Your knowledge of the DroboFS, related hardware, and how to compile and tune software is a marvel. Your willingness and ability to communicate clearly is rare and deeply appreciated.
We're unreasonably lucky to have you around; Drobo Inc. is luckiest of all.

Thank you, and keep up the great work. If you're ever near San Diego, the meal and drinks are on me. Smile

alter ego of rdo
DroboPro and various other stuff
Find all posts by this user
Quote this message in a reply
10-17-2012, 08:52 PM
Post: #3
RE: Using the DroboFS crypto hardware acceleration
The stuff you accomplish, ricardo, is nothing short of amazing. Smile

Drobo 5N | 250GB Samsung 850 EVO mSATA | 2 x 4TB Seagate, 3 x 4TB Hitachi | FS/EXT3 diskpack
Peak performance >100MBps read/write (based on FS disk pack, no jumbo frames, no SSD)
DroboPorts: Plex, Transmission, OpenSSH, NFS, nano, screen, bash
Find all posts by this user
Quote this message in a reply
10-18-2012, 03:27 AM
Post: #4
RE: Using the DroboFS crypto hardware acceleration
Thank you for the kind words. Smile

Tap the full potential of the Drobo FS/5N with DroboApps.
To receive the latest updates about my DroboApps circle me on Google Plus.
Find all posts by this user
Quote this message in a reply
10-20-2012, 05:05 PM
Post: #5
RE: Using the DroboFS crypto hardware acceleration
Status update: in a word, conflicted.

I noticed today that I missed a very important configuration option for openssl that enables very efficient assembler implementations of the same algorithms that the hardware accelerator offers.

So I ran a complete benchmark on openssl with and without the hardware accelerator. This time using -elapsed, so that the numbers should be more comparable. I added RC4 as a common baseline, since it isn't accelerated.

Without the accelerator:
Quote:$ ./openssl speed -elapsed -evp aes-128-cbc
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-128-cbc for 3s on 16 size blocks: 2325805 aes-128-cbc's in 3.02s
Doing aes-128-cbc for 3s on 64 size blocks: 755154 aes-128-cbc's in 3.01s
Doing aes-128-cbc for 3s on 256 size blocks: 204931 aes-128-cbc's in 3.01s
Doing aes-128-cbc for 3s on 1024 size blocks: 52761 aes-128-cbc's in 3.01s
Doing aes-128-cbc for 3s on 8192 size blocks: 6378 aes-128-cbc's in 3.01s
OpenSSL 1.0.1c 10 May 2012
built on: Sat Oct 20 02:02:05 CEST 2012
options:bn(64,32) rc4(ptr,char) des(idx,cisc,16,long) aes(partial) idea(int) blowfish(ptr)
compiler: /usr/local/arm-2007q1/bin/arm-none-linux-gnueabi-gcc -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DL_ENDIAN -march=armv5te -mtune=arm926ej-s -ffunction-sections -fdata-sections -DHAVE_CRYPTODEV -DTERMIO -O3 -Wall -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DAES_ASM -DGHASH_ASM
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-128-cbc 12322.15k 16056.43k 17429.35k 17949.26k 17358.33k

With the accelerator:
Quote:$ ./openssl speed -elapsed -evp aes-128-cbc
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-128-cbc for 3s on 16 size blocks: 100807 aes-128-cbc's in 3.02s
Doing aes-128-cbc for 3s on 64 size blocks: 85552 aes-128-cbc's in 3.01s
Doing aes-128-cbc for 3s on 256 size blocks: 54175 aes-128-cbc's in 3.01s
Doing aes-128-cbc for 3s on 1024 size blocks: 22418 aes-128-cbc's in 3.01s
Doing aes-128-cbc for 3s on 8192 size blocks: 3137 aes-128-cbc's in 3.01s
OpenSSL 1.0.1c 10 May 2012
built on: Sat Oct 20 02:02:05 CEST 2012
options:bn(64,32) rc4(ptr,char) des(idx,cisc,16,long) aes(partial) idea(int) blowfish(ptr)
compiler: /usr/local/arm-2007q1/bin/arm-none-linux-gnueabi-gcc -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DL_ENDIAN -march=armv5te -mtune=arm926ej-s -ffunction-sections -fdata-sections -DHAVE_CRYPTODEV -DTERMIO -O3 -Wall -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DAES_ASM -DGHASH_ASM
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-128-cbc 534.08k 1819.05k 4607.57k 7626.59k 8537.64k

With the accelerator, but without -elapsed:
Quote:$ ./openssl speed -evp aes-128-cbc
Doing aes-128-cbc for 3s on 16 size blocks: 100553 aes-128-cbc's in 0.09s
Doing aes-128-cbc for 3s on 64 size blocks: 85470 aes-128-cbc's in 0.15s
Doing aes-128-cbc for 3s on 256 size blocks: 54064 aes-128-cbc's in 0.14s
Doing aes-128-cbc for 3s on 1024 size blocks: 21661 aes-128-cbc's in 0.02s
Doing aes-128-cbc for 3s on 8192 size blocks: 3213 aes-128-cbc's in 0.02s
OpenSSL 1.0.1c 10 May 2012
built on: Sat Oct 20 02:02:05 CEST 2012
options:bn(64,32) rc4(ptr,char) des(idx,cisc,16,long) aes(partial) idea(int) blowfish(ptr)
compiler: /usr/local/arm-2007q1/bin/arm-none-linux-gnueabi-gcc -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DL_ENDIAN -march=armv5te -mtune=arm926ej-s -ffunction-sections -fdata-sections -DHAVE_CRYPTODEV -DTERMIO -O3 -Wall -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DAES_ASM -DGHASH_ASM
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-128-cbc 17876.09k 36467.20k 98859.89k 1109043.20k 1316044.80k

So what to make of this? It seems that, yes, the hardware accelerator does reduce CPU usage a lot. By a lot, I mean 2 orders of magnitude.

On the other hand, the raw throughput of the CPU (with the -D*_ASM flags) is at best one order of magnitude faster (for block sizes up to 256 bytes), and even for more realistic block sizes (1k) it is still more than double the throughput of the hardware accelerator.

At this point, the only advantage of having hardware acceleration is to reduce CPU usage. But if the cost of that is less throughput, I'm not really sure it makes sense.

In fact, the best throughput comes from using RC4 (Cipher arcfour):
Quote:$ ./openssl speed -evp rc4 -elapsed
You have chosen to measure elapsed time instead of user CPU time.
Doing rc4 for 3s on 16 size blocks: 5584794 rc4's in 3.01s
Doing rc4 for 3s on 64 size blocks: 1787197 rc4's in 3.02s
Doing rc4 for 3s on 256 size blocks: 499440 rc4's in 3.01s
Doing rc4 for 3s on 1024 size blocks: 126446 rc4's in 3.01s
Doing rc4 for 3s on 8192 size blocks: 15826 rc4's in 3.01s
OpenSSL 1.0.1c 10 May 2012
built on: Sat Oct 20 02:02:05 CEST 2012
options:bn(64,32) rc4(ptr,char) des(idx,cisc,16,long) aes(partial) idea(int) blowfish(ptr)
compiler: /usr/local/arm-2007q1/bin/arm-none-linux-gnueabi-gcc -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DL_ENDIAN -march=armv5te -mtune=arm926ej-s -ffunction-sections -fdata-sections -DHAVE_CRYPTODEV -DTERMIO -O3 -Wall -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DAES_ASM -DGHASH_ASM
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
rc4 29686.61k 37874.37k 42477.29k 43016.85k 43071.96k
It is at least twice as fast as AES (more like 2.5x).

And to add insult to the injury, OpenSSH does not like the hardware accelerator in the FS. When I try to connect to the server using AES-128 with the hardware acceleration enabled, this happens on the server side:
Quote:# sshd -ddd -e
(...)
debug1: SSH2_MSG_KEX_DH_GEX_REPLY sent [preauth]
debug2: kex_derive_keys [preauth]
debug2: set_newkeys: mode 1 [preauth]
cipher_init: EVP_CipherInit: set key failed for aes128-cbc [preauth]
debug1: do_cleanup [preauth]
debug1: monitor_read_log: child log fd closed
debug3: mm_request_receive entering
debug1: do_cleanup
debug1: Killing privsep child 30610
And it dies. If I remove hardware acceleration, then the connection goes through without any problems.

I've googled around to see if anyone has a workaround or patch for this, but so far no solution found. Worse than that, I've seem quite a few reports that the same kind of problems happen to other apps, such as lighttpd, openvpn, etc. The impression I had is that for this thing to work properly we would need a newer version of the kernel. You can guess what the odds of that happening are...

Tap the full potential of the Drobo FS/5N with DroboApps.
To receive the latest updates about my DroboApps circle me on Google Plus.
Find all posts by this user
Quote this message in a reply
10-23-2012, 06:44 AM (This post was last modified: 10-23-2012 06:45 AM by Paul.)
Post: #6
RE: Using the DroboFS crypto hardware acceleration
it's probably closely guarded as.....
wait for it..
"the kernels secret recipe" Big Grin

(btw i have XP home SP2, a Drobo v1 with 2x 1TB/2x 1.5TB WD greens, & a bkp Drobo v2 with the same + a DroboShare: unused)
& a DroboS v2 with 3xWD15EADS &2x1TB in DDR mode on win7, & a drobo5D (all usb)
  • btw i did a sustained (write) operation for about 6 hours, and got 13.2MB / sec ...objection? "sustained" :)
    (16.7MB/s on a v2 & 47-96MB/s drobo-s)
Find all posts by this user
Quote this message in a reply
10-23-2012, 06:59 AM
Post: #7
RE: Using the DroboFS crypto hardware acceleration
I know you are joking, but I have been investigating how I could upgrade the FS's kernel by myself. Yeah, I'm that crazy.

Tap the full potential of the Drobo FS/5N with DroboApps.
To receive the latest updates about my DroboApps circle me on Google Plus.
Find all posts by this user
Quote this message in a reply
10-23-2012, 08:21 AM
Post: #8
RE: Using the DroboFS crypto hardware acceleration
You're a good kind of crazy ricardo. Smile

--Brandon | WHS2011+Drive Bender/2x Drobo v2/Drobo S G1/ Drobo S G2/Transporter
Drobo provides fault-tolerance, it's NOT a substitute for regular backups.
Drobo Best Practices - Official and Community-sourced.
Find all posts by this user
Quote this message in a reply
10-23-2012, 09:32 AM (This post was last modified: 10-23-2012 09:33 AM by diamondsw.)
Post: #9
RE: Using the DroboFS crypto hardware acceleration
(10-23-2012 08:21 AM)bhiga Wrote:  You're a good kind of crazy ricardo. Smile

I have never before wanted to see a "Like" button on a web site.

Drobo 5N | 250GB Samsung 850 EVO mSATA | 2 x 4TB Seagate, 3 x 4TB Hitachi | FS/EXT3 diskpack
Peak performance >100MBps read/write (based on FS disk pack, no jumbo frames, no SSD)
DroboPorts: Plex, Transmission, OpenSSH, NFS, nano, screen, bash
Find all posts by this user
Quote this message in a reply
Post Reply 


Forum Jump: