
Modulo whaaat?š±
Squaring modulo n is just a fancy way of saying, āsquare this number, and then take the modulus n of it. For example, 42mod3, gives us 16mod3, and we know that 16%3=1. Easy!š¤·āāļø
Whatās not so easy, is squaring this š number š
0011010001001100010010011000101111011010000100100010001010101000010001111110101111010001001001000100000010100010000000000110101001010000110110110111010111111111011100100011001001111101100000110100010111101111111111011000011001010110000000011001110110101101
Squaring very large numbers, modulo n, is a backbone operation of modern day cryptography.
Itās used absolutely everywhere, from enabling secure communication to mining bitcoins.
But this post is not about the inner mechanics of squaring large numbers.
This post is just an attempt of estimating how many squarings per second
my Mac can peform.
Letās proceed.
Four Important Assumptions
Letās do some āback of the envelopeā calculations and try to guesstimate how man squarings mod n that our machine can perform per second. Weāre making the following four assumptions:
- Non-Specialized CPU Architecture: Performance might be affected by specialized hardware such as SIMD, but weāre going to use simple, off-the shelf consumer hardware.
- Non-Specialized Software: Squaring operation can be sped up by using specialized
software. For example, GMP, the GNU
Multiple Precision
Arithmetic Library, has a function called
mpz_pow
which is optimized for squaring operations. Thereās also something calledMontgomery reduction
which is software that can further speed up modular multiplication. Anyway, in our case, we wonāt use any specilized software. - Operation Cost relates to Work Size: The number of CPU cycles required to perform one single squaring mod n operation depends on the size of the modulus. The bigger the mod the longer it takes to square it. Modern industry application typically use very large mods, in the range of 2000 to 4000 bits. Weāre going to use a slightly smaller mod of 256. After all, this post is just about trying to gain some insight into what affects squaring mod n.
- Canāt Leverage Multi-Core: Squaring mod n is a sequential operation. Two, or more, cores wonāt matter because we canāt parallelize the squaring operation. So, even though a 1.3 GHz Dual-Core machine has a combined CPU speed of .3 *2 = 2.6 GHz, only one core can work on the problem. In our case, this places an upper bound on the CPU clock speed of 1.3 GHz.
Target Machine Specification
Letās look at the squaring power of the machine weāll use for this example. Itās a 2013 Mac Air, with the following spec:
- 1.3 GHz Dual-Core Intel Core i5
- 4GB 1600 MHz DDR3 RAM
1.3 GHz means that our CPU can run 1.3^9 cycles per second, per core. Dual-Core means that two cores can execute operations concurrently. That puts and upper bound on the theoretical CPU work capacity at 2.6 GHz, but only if the work is perfectly parallellized.
4 GB DDR3 RAM can affect overall system performance. It can be challenging to measure how many actual cycles per second that is available for squaring operations, since hundreds of other processes are active at the same time.
To simplify things, we ignore memory and effects if might have on squaring performance. Instead we focus exclusively on clockspeed.
Squaring Power
Googling around, it seems that modular squaring with a 2048-bit modulus takes approximately 2000-3000 CPU cycles on modern hardware.
Letās use that estimation and try to adapt it to our machine. And please remember, weāre doing back of the envelope calculation here so it could be way off. Since weāll be using a 256-bit mod, which is a quarter of a 2048-bit mod, letās reduce the 2000 CPU cycles down to 500 CPU cycles, as the guesstimated number of cycles to square a 2048-bit moduli.
Next, since a modern Apple MacBook Pro has a clockspeed of around 3.2 GHz (compared to our old 1.3 GHz Mac), letās take our 500 CPU cycles and half it down to 250 CPU cycles. So, with some pretty wild non-scientific assumptions we end up with and estimation that it will take our 1.3 GHz Mac to square a 256-bit moduli.
We are guesstimating that it takes our 1.3 GHz Mac around 250 CPU cycles to perform a single squaring modular n, for a 256-bit moduli.
Back of the envelope calculation
Using the upper bound of available CPU cycles (1.3 GHz) and the assumed avearge needed cycles per squaring mod n (250 cycles), we get: Squarings per second=Total CPU cycles per secondAssumed cycles per squaring
With machine spec: Squarings per second=1.3Ć109cycles/sec250cycles/squaring
We simpify:
Squarings per second=5,200,000
Interpretation:
with our simple back of the envelope calculation, we guesstimate that our 1.3 GHz machine can perform an average
of 5 million squaring mod n operations per second.
How many squarings mod 256 per second can your system perform?
Enter your systems CPU Clock Speed in GHz (for example 1.3):
Now what?
So, after a few assumptions we got an estimated number of modular squarings per second on a 256-bit modulus with our 1.3 GHz Dual-Core Intel i5 machine working overtime, of 5 million squarings mod n per second. Letās see how our estimated number compares to actual work by the machine itself.
Letās write a program to measure how close to the upper theoretical bound that our 1.3 GHz Dual-Core machine can get on a 256-bit moduli. Iāll use Rust, but I assume that any compiled code should more or less give the same result.
// Cargo.toml
// We need a library for working with huge numbers.
[dependencies]
num-bigint = "0.4"
num-traits = "0.2"
//main.rs
use num_bigint::BigUint;
use num_traits::{One, Zero};
use std::time::Instant;
fn main() {
// Size matters. We're using a prime in the 256-bit space that is close
// to the upper bound: 2^256.
let modulus = BigUint::parse_bytes(
b"FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF43", 16,
).expect("Invalid modulus, Sir! Are you really supposed to be here?");
let mut value = BigUint::one();
let mut count = 0;
// Perform squaring mod n operations for 1000 millisecs (1 second).
let start = Instant::now();
while start.elapsed().as_secs() < 1 {
value = (&value * &value) % &modulus;
count += 1;
}
println!("Modular Squarings Per Second: {}", count);
}
Right, letās see what we getā¦
If I run the Rust program above on my trusty but dusty old 1.3 GHz Mac, I get the following result:
Modular Squarings Per Second: 1192311
Which equals about 1.2 million squarings per second. š¤ Hmm, thatās far below the 5 million squarings the guy at the computer store sold me on. Whatās going on? If I look at the Activity Monitor I see that the CPU is idling at around 80%+. Iāve been using the computer for a few hours this morning so there are lots of processes which are simply hanging around.
Letās reboot the machine, and run the squaring binary first thing when we log in.
Now I get the following result:
Modular Squarings Per Second: 3252342 š
Nice! š„³ Thatās much better. Still not anywhere close to 5 million squarings, but that was never going to happen with a machine running a bloated consumer Operating System.
To wrap this up, I wrote this post to learn a little bit about squaring modular n per second. Iāll be the first to admit that the steps taken here are far from scientific.
Cheers!
Post Scriptum _ This post is a Work In Progress. Here are planned additions:
- Briefly look at Mooreās Law vs CPU Cycle Speed
- Cover square modulo p in mathematical terms
- Closer look at Montgoemery reduction
- Briefly cover GMP
- More tests on different machines
I donāt have a lot of time and I consider the post to be in pretty descent shape so it might take a while for these additions to happen.
Clipboard for planned changes:
https://www.karlrupp.net/2015/06/40-years-of-microprocessor-trend-data/