Squarings modulo n per second

Modulo whaaat?😱

Squaring modulo n is just a fancy way of saying, “square this number, and then take the modulus n of it. For example, $4^2\, \text{mod}\, 3$, gives us $16\, \text{mod}\,3$, and we know that $16\, \% \, 3 = 1$. Easy!🤷‍♂️

What’s not so easy, is squaring this 👇 number 😅

0011010001001100010010011000101111011010000100100010001010101000010001111110101111010001001001000100000010100010000000000110101001010000110110110111010111111111011100100011001001111101100000110100010111101111111111011000011001010110000000011001110110101101

Squaring very large numbers, modulo n, is a backbone operation of modern day cryptography. It’s used absolutely everywhere, from enabling secure communication to mining bitcoins.
But this post is not about the inner mechanics of squaring large numbers.

This post is just an attempt of estimating how many squarings per second my Mac can peform.
Let’s proceed.

Four Important Assumptions

Let’s do some “back of the envelope” calculations and try to guesstimate how man squarings mod n that our machine can perform per second. We’re making the following four assumptions:

Non-Specialized CPU Architecture: Performance might be affected by specialized hardware such as SIMD, but we’re going to use simple, off-the shelf consumer hardware.
Non-Specialized Software: Squaring operation can be sped up by using specialized software. For example, GMP, the GNU Multiple Precision Arithmetic Library, has a function called mpz_pow which is optimized for squaring operations. There’s also something called Montgomery reduction which is software that can further speed up modular multiplication. Anyway, in our case, we won’t use any specilized software.
Operation Cost relates to Work Size: The number of CPU cycles required to perform one single squaring mod n operation depends on the size of the modulus. The bigger the mod the longer it takes to square it. Modern industry application typically use very large mods, in the range of 2000 to 4000 bits. We’re going to use a slightly smaller mod of 256. After all, this post is just about trying to gain some insight into what affects squaring mod n.
Can’t Leverage Multi-Core: Squaring mod n is a sequential operation. Two, or more, cores won’t matter because we can’t parallelize the squaring operation. So, even though a 1.3 GHz Dual-Core machine has a combined CPU speed of .3 *2 = 2.6 GHz, only one core can work on the problem. In our case, this places an upper bound on the CPU clock speed of 1.3 GHz.

Target Machine Specification

Let’s look at the squaring power of the machine we’ll use for this example. It’s a 2013 Mac Air, with the following spec:

1.3 GHz Dual-Core Intel Core i5
4GB 1600 MHz DDR3 RAM

1.3 GHz means that our CPU can run 1.3^9 cycles per second, per core. Dual-Core means that two cores can execute operations concurrently. That puts and upper bound on the theoretical CPU work capacity at 2.6 GHz, but only if the work is perfectly parallellized.

4 GB DDR3 RAM can affect overall system performance. It can be challenging to measure how many actual cycles per second that is available for squaring operations, since hundreds of other processes are active at the same time.

To simplify things, we ignore memory and effects if might have on squaring performance. Instead we focus exclusively on clockspeed.

Squaring Power

Googling around, it seems that modular squaring with a 2048-bit modulus takes approximately 2000-3000 CPU cycles on modern hardware.

Let’s use that estimation and try to adapt it to our machine. And please remember, we’re doing back of the envelope calculation here so it could be way off. Since we’ll be using a 256-bit mod, which is a quarter of a 2048-bit mod, let’s reduce the 2000 CPU cycles down to 500 CPU cycles, as the guesstimated number of cycles to square a 2048-bit moduli.

Next, since a modern Apple MacBook Pro has a clockspeed of around 3.2 GHz (compared to our old 1.3 GHz Mac), let’s take our 500 CPU cycles and half it down to 250 CPU cycles. So, with some pretty wild non-scientific assumptions we end up with and estimation that it will take our 1.3 GHz Mac to square a 256-bit moduli.

We are guesstimating that it takes our 1.3 GHz Mac around 250 CPU cycles to perform a single squaring modular n, for a 256-bit moduli.

Back of the envelope calculation

Using the upper bound of available CPU cycles (1.3 GHz) and the assumed avearge needed cycles per squaring mod n (250 cycles), we get: $$ \mathsf{ \text{Squarings per second}} = \frac{ \mathsf{\text{Total CPU cycles per second}}}{\mathsf{\text{Assumed cycles per squaring}}}$$

With machine spec: $$ \mathsf{ \text{Squarings per second}} = \frac{ \mathsf{1.3\, \times \, 10^9\; \mathsf{\text{cycles/sec}}}}{250\; \text{cycles/squaring}}$$

We simpify:

$$\mathsf{ \text{Squarings per second} = 5,200,000}$$

Interpretation:
with our simple back of the envelope calculation, we guesstimate that our 1.3 GHz machine can perform an average of 5 million squaring mod n operations per second.

How many squarings mod 256 per second can your system perform?
Enter your systems CPU Clock Speed in GHz (for example 1.3):

Now what?

So, after a few assumptions we got an estimated number of modular squarings per second on a 256-bit modulus with our 1.3 GHz Dual-Core Intel i5 machine working overtime, of 5 million squarings mod n per second. Let’s see how our estimated number compares to actual work by the machine itself.

Let’s write a program to measure how close to the upper theoretical bound that our 1.3 GHz Dual-Core machine can get on a 256-bit moduli. I’ll use Rust, but I assume that any compiled code should more or less give the same result.

// Cargo.toml
//  We need a library for working with huge numbers.
[dependencies]
num-bigint = "0.4"
num-traits = "0.2"

//main.rs
use num_bigint::BigUint;
use num_traits::{One, Zero};
use std::time::Instant;

fn main() {
    // Size matters. We're using a prime in the 256-bit space that is close
    // to the upper bound: 2^256.
    let modulus = BigUint::parse_bytes(
        b"FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF43", 16,
    ).expect("Invalid modulus, Sir! Are you really supposed to be here?");

    let mut value = BigUint::one();
    let mut count = 0;

    // Perform squaring mod n operations for 1000 millisecs (1 second).
    let start = Instant::now();
    while start.elapsed().as_secs() < 1 {
        value = (&value * &value) % &modulus;
        count += 1;
    }

    println!("Modular Squarings Per Second: {}", count);
}

Right, let’s see what we get…

If I run the Rust program above on my trusty but dusty old 1.3 GHz Mac, I get the following result:

Modular Squarings Per Second: 1192311

Which equals about 1.2 million squarings per second. 🤔 Hmm, that’s far below the 5 million squarings the guy at the computer store sold me on. What’s going on? If I look at the Activity Monitor I see that the CPU is idling at around 80%+. I’ve been using the computer for a few hours this morning so there are lots of processes which are simply hanging around.

Let’s reboot the machine, and run the squaring binary first thing when we log in.
Now I get the following result:

Modular Squarings Per Second: 3252342 👀

Nice! 🥳 That’s much better. Still not anywhere close to 5 million squarings, but that was never going to happen with a machine running a bloated consumer Operating System.

To wrap this up, I wrote this post to learn a little bit about squaring modular n per second. I’ll be the first to admit that the steps taken here are far from scientific.
Cheers!

Post Scriptum _ This post is a Work In Progress. Here are planned additions:

Briefly look at Moore’s Law vs CPU Cycle Speed
Cover square modulo p in mathematical terms
Closer look at Montgoemery reduction
Briefly cover GMP
More tests on different machines

I don’t have a lot of time and I consider the post to be in pretty descent shape so it might take a while for these additions to happen.

Clipboard for planned changes:
https://www.karlrupp.net/2015/06/40-years-of-microprocessor-trend-data/

December 19, 2024 by CryptoPatrick