*“The problem of distinguishing prime numbers from composite numbers and of resolving the latter into their prime factors is known to be one of the mostf important and useful in arithmetic.”* − Carl Friedrich Gauss

In this article, I will give some background into some of the algorithmic aspects of primality testing, illustrate them using the Linux bc utility, and describe some of the advanced algorithms used in the famous Lucas-Lehmer (LL) primality test for Mersenne numbers and the author’s own implementation thereof in his Mlucas program. This software program is now available in a version optimized for the vector-arithmetic hardware functionality available in the ARMv8 processor family, specifically the ODROID-C2 SBC. Note however, that the software is also buildable on non-v8 ODROID SBC’s, but just not using the vector instructions. Since the Mlucas readme page (linked further down) provides detailed build instructions for a variety of hardware platforms including ODROID SBC’s, I shall focus here on the mathematics and algorithmics, at a level which should be understandable by anyone with a basic understanding in algebra and computer programming.

#### Primitive roots and primality

LL is example of what is referred to as a *nonfactorial primality test*, which refers to the fact that it requires no knowledge whatever about the factorization-into-primes of the input number *N*, or modulus, though we typically perform a pre-sieving “trial factorization” step to check such numbers for small prime factors before resorting to the LL test. Such tests rely on deep algebraic properties of number fields which are beyond the scope of this article, but in essence they amount to testing whether there exists a *primitive root* of the mathematical *group* defined by multiplication modulo *N*, which means a root of the maximal possible order *N*−1. (This sounds like very highbrow mathematics, but in fact reduces to very simple terms, as we shall illustrate.) Such a root exists if and only if (i.e. the converse holds) *N* is prime. For example, if we compute successive powers of 2 modulo *N* = 7, we get the sequence 2,4,1,… which repeats with length 3, meaning that either 7 is composite or 2 is not a primitive root (mod 7). If we instead try powers of 3 we get the sequence 3,2,6,4,5,1 which is of the maximal possible length *N*−1 = 6 for the multiplicative group (mod 7), thus we conclude that 7 is prime by way of having found 3 to be a primitive root. If we instead try the powering sequences resulting from the same two bases modulo the composite 15 we get 2^{k} (mod 15) = 2,4,8,1,… and 3^{k} (mod 15) = 3,9,-3,-9,…, for index k = 1,2,3,… . We note that both sequences repeat with periodicity 4, which is firstly less than *N*−1 and secondly does not divide N−1 (i.e. there is no chance of one of the recurring ones in the 2^{k} sequence landing in the *N*−1 slot), and that the 3^{k} sequence does not even contain a 1, whereas both of the (mod 7) sequences contain one or more ones, in particular both having 1 in the (*N*−1) slot. That is, for *N* = 7 we have a^{N-1} ≡ 1 (mod *N*) for all bases not a multiple of 7, where the triple-bar-equals is the symbol for modular equivalence, i.e. the 2 sides of the relation are equal when reduced modulo *N*. For the (mod 15) sequences, on the other hand, the non-occurrence of 1 as the *N*−1 power in either one suffices to prove 15 composite.

Computing such power-mod sequences for large moduli is of course not practical for very large N, so the idea of instead computing just the (*N*−1)th power of the base is crucial, because that requires the computation of a mere O(lg *N*) intermediates, where lg is the binary logarithm. The resulting test turns out to be rigorous only in the sense of correctly identifying composite numbers, being merely probabilistic for primes because it yields a false-positive ‘1’ result for a small percentage of composite moduli in addition to the prime ones, but can be modified to yield a deterministic (rigorous) test in for a usefully large fraction of these problematic false primes.

#### Fermat’s ‘little’ theorem and probable-primality testing

Both rigorous primality tests and the so-called probable-primality tests are based in one way or another on the property of existence of a primitive root, and it is useful to use the probable-prime variety to illustrate the kind of arithmetic required to implement such a test, especially for large moduli. The “father of number theory”, Pierre de Fermat, early in the 17th century had already observed and formalized into an actual theorem what we noted above, that for any prime p, if a is coprime to (has no factors in common with) *p* — since *p* prime this means *a* must not be a multiple of *p —* then

a^(p−1) ≡ 1 (mod p). [*]This is now referred to as Fermat’s “little theorem” to differentiate it from its more-famous (but of less practical importance) “last theorem”, about the solutions over the integers of the equation a

^{n}+b

^{n}= c

^{n}; the resulting test applied to numbers of unknown character is referred to, interchangeably, as a Fermat probable-prime or Fermat compositeness test. The first appellation refers to the fact that an integer

*N*satisfying a

^{N−1}≡ 1 (mod

*N*) for some base a coprime to

*N*is very likely to be prime, large-sample-statistically speaking, the second to the fact that numbers failing this criterion are certainly composite, even if an explicit factor of

*N*has not been found.

Note that the converse of [*] does not hold, that is, there are coprime integer pairs *a,N* for which a^{N−1} ≡ 1 (mod *N*) but where *N* is not a prime; for example, using the linux ‘bc’ utility one can see that the composite *N* = 341 = 11 × 31 satisfies the theorem for base a = 2, by invoking bc in a command shell and simply typing ‘2^340%341’. This underscores the importance of trying multiple bases if the first one yields a^{N−1} ≡ 1 (mod *N*): for prime *N*, every integer in 2,…,*N*-2† is coprime to *N*, and thus all these bases yield 1 for the powering result, whereas for composite *N*, trying a small number of bases nearly always suffices to reveal N as composite. We say “nearly always” because there exists a special class of integers known as *Carmichael numbers*, which pass the Fermat test to all bases which are coprime to *N*; the smallest such is 561 = 3×11×17. Carmichael numbers only reveal their composite character if we happen to choose a base a for the Fermat test which is not coprime to *N*, i.e. if we find a factor of *N*. In practice one should always first check *N* for small prime factors up to some bound (set by the cost of such trial factorization) before subjecting *N* to a Fermat-style probable-primality test.

† We skip both 1 and N−1 as potential bases because, being ≡±1 (mod N), these two “bookend” values of a both trivially yield a^{N−1} ≡ 1 for all odd N.

For base *a* = 2 there are precisely 10403 such Fermat pseudoprimes, a miniscule number compared to the nearly 200 million primes below that bound, so even using just this single base yields a remarkably effective way of determining if a number is likely to be prime. For example, one can combine such a Fermat base-2 pseudoprime test with a prestored table of the known composites less than 2^{32} which pass the test to produce a very efficient deterministic primality algorithm for numbers below that bound. In the more general context of testing numbers of arbitrary size, however, it is important to realize that there are certain classes of numbers, all of which are Fermat base-2 pseudoprimes, irrespective of whether they are prime or composite. The two best-known such classes are, firstly, the Mersenne numbers M(p) = 2^{p}−1 (for which we restrict the definition to prime exponents since that is required for a number of this form to have a chance of being prime); for example, 2^{11}−1 passes the test even though it factors as 23 × 89.

The second class of such numbers is the Fermat numbers F_{m} = 2^{2m}+1. It appears that the fact that the first five such numbers, F_{0} – F_{4} = 3,5,17,257,65537 are small enough to be amenable to pencil-and-paper trial division and are easily shown to be primes that way coupled with the fact that they all pass the test named in his honor may have led Fermat to make his famously incorrect conjecture that all such numbers are prime. This conjecture was refuted by Euler a few decades later, via his showing that 641 is a prime factor of F5, and subsequent work has led to a general belief that in all likelihood the smallest five such numbers are in fact the *only* primes in the sequence. Simply changing the base of the Fermat test to, say, 3, suffices to distinguish the primes from the composites in this number sequence, but it appears this idea never occurred to Fermat.

#### Efficient modular exponentiation

To perform the modular exponentiation, we use a general technique – or better, related set of techniques – known as a binary powering ladder. The name refers to the fact that the various approaches here all rely on parsing the bits of the exponent represented in binary form. By way of encouraging the reader to compute alongside reading, we make use of the POSIX built-in arbitrary precision calculator, bc, which is relatively slow compared to higher-end number-theoretic freeware programs such as Pari/GP and the Gnu multiprecision library, GMP, but can be exceedingly handy for this kind of basic algorithmic ‘rapid prototyping’. We invoke the calculator in default whole-number mode simply via ‘bc’ in a terminal; ‘bc -l’ invokes the calculator in floating-point mode, in which the precision can be adjusted to suit using the value of the ‘scale’ parameter (whose default is 20 decimal digits), and the ‘-l’ defines the standard math library, which contains a handful of useful functions including natural logarithm and exponent, trigonometric sine, cosine and arctangent (from which other things can be built, e.g. ‘4*a(1)’ computes π to the precision set by the current value of scale by using that arctan(1) = π/4), and the Bessel function of the first kind. The arithmetic base of bc’s inputs and outputs can be controlled by modifying the values of ibase and obase from their defaults of 10, for example to view 23 in binary, type ‘obase = 2; 23’ which dutifully outputs 10111; reset the output base back to its decimal default via ‘obase = 10’.

Note that for general – and in particular very large – moduli we cannot simply compute the power on the left-hand-side of [*] and reduce the result modulo *N*, since the numbers get large enough to overwhelm even our largest computing storage. Roughly speaking, the powers double in length for each bit in the exponent, so raising base 2 to a 64-bit exponent gives a result on the order of 2^{64}, or 18,446,744,073,709,551,616 bits, or over 2000 petabytes, nicely illustrating the famous wheat and chessboard problem in the mathematics of geometric series. To test a number of the size of the most-recently-discovered Mersenne prime, we need to do tens of millions of these kinds of size-doubling iterations, so how is that possible on available compute hardware? We again turn to the properties of modular arithmetic, one of the crucial ones of which is that in computing a large ‘powermod’ of this kind, we can do modular reduction at every step of the way, whenever it is convenient to do so. Thus in practice one uses a binary powering ladder to break the exponentiation into a series of squarings and multiplications, each step of which is followed by a modular reduction of the resulting intermediate.

We will compare and contrast two basic approaches to the binary-powering computation of a^{b} (mod *n*), which run through the bits of the exponent b in opposite directions. Both do one squaring per bit of b as well as some more general multiplications whose precise count depends on the bit pattern of *b*, but can never exceed that of the squarings. The right-to-left method initializes an accumulator y = 1 and a current-square *z* = *a*, then for each bit in *n* beginning with the rightmost (ones) bit, if the current bit = 1, we up-multiply the accumulator by the current square z, then again square z to prepare for the next bit to the left. Here is a simple user-defined bc function which illustrates this – for simplicity’s sake we have omitted some basic preprocessing which one would include for sanity-checking the inputs such as zero-modulus and nonnegative-exponent checks:

/* * right-to-left-to-right binary modpow, a^b (mod n): */ define modpow_rl(a,b,n) { auto y,z; y = 1; z = a%n; while (b) { if(b%2) y = (y*z)%n; z = (z*z)%n; b /= 2; } return (y); }We leave it as an exercise for the reader to implement a simple optimization which adds a lookahead inside the loop such that the current-square update is only performed if there is a next leftward bit to be processed, which is useful if the base a is large but the exponent is small. After pasting the above code into your bc shell, try a base-2 Fermat pseudoprime test of the known Mersenne prime M(2203): ‘n=2^2203-1; modpow_rl(2,n-1,n)’ (this will take a few seconds). Now retry with a composite exponent of similar size, say 2205, and note the non-unity result indicating that n = 2

^{2205}−1 is also composite. Since 2205 has small prime factors 3,5 and 7, we can further use the bc modulo function ‘%’ to verify that this number is exactly divisible by 7,31 and 127.

Of course, we already noted that a base-2 Fermat pseudoprime test returns ‘likely prime’ for all Mersenne numbers, that is, for all 2^{p}−1 with p prime, we can use the above modpow function to check this, now further modifying the return value to a binary 0 (composite) or 1 (pseudoprime to the given base): ‘n=2^2207-1; modpow_rl(2,n-1,n) == 1’ returns 1 even though the Mersenne number in question has a small factor, 123593 = 56×2207+1. Repeating the same Fermat pseudoprime test but now to base 3 correctly reveals this number to be composite. Note the associated jump in bc runtime – this appears to reflect some special-case optimizations in bc’s internal logic related to recognizing arguments which are powers of 2 – we see a similar speed disparity when repeating the pseudoprime test using bases 4 and 5, for example.

Our next powering algorithm processes the bits in the opposite direction, left-to-right. This method initializes the accumulator *y* = *a*, corresponding to the leftmost set bit, then for each rightward bit we square y, and if the current bit = 1, we up-multiply the accumulator by the powering base *a*. In a coding language like C we could – either via compiler intrinsics or via a small assembly-code macro – implement the bits() functions via a binary divide-and-conquer approach or by efficiently accessing whatever hardware instruction might be available for leading-zeros-counting, and our bit-reversal function reverse() could be efficiently implemented using a small table of 256 precomputed bit-reversed bytes, a loop to do bytewise swaps at the left and right ends of our exponent, and a final step to right-shift the result from 0-7 bits, depending on where in the leftmost set byte the leftmost set bit occurs. In bc we have no such bitwise functionality and so must roll out own inefficient emulation routines, but as our focus is on large-operand modpow, the time cost of such bitwise operations is negligible compared to that of the modular multiplications:

define bits(n) { auto ssave, r; ssave = scale; scale = 0; /* In case we're in floating-point mode */ r = length(n)*3321928095/1000000000; while ( 2^r > n ) { r -= 1; } scale = ssave; return(r+1); } define reverse(n,nbits) { auto tmp; tmp = 0; while(nbits) { tmp = 2*tmp + (n % 2); n /= 2; nbits -= 1; } return(tmp); } /* left-to-right binary modpow, a^b (mod n): */ define modpow_lr(a,b,n) { auto y,len; len = bits(b); b = reverse(b,len); y = a%n; b /= 2; while(--len) { y = (y*y)%n; if(b%2) y = (a*y)%n; b /= 2; } return(y); }The need for bit-reversal may also be avoided by implementing the algorithm recursively, but as bc is, shall we say, less than spiffy when it comes to efficient support for recursion, we prefer a nonrecursive algorithm in the left-to-right case. We urge readers to paste the above into their bc shell, use it to again try the base-2 and base-3 Fermat-pseudoprime tests on 2

^{2207}−1, and compare those runtimes to the ones for the right-to-left algorithm. In my bc shell the left-to-right method runs in roughly half the time on the aforementioned composite Mersenne number, despite the fact that the guts of the loops in our RL and LR powering functions look quite similar, each having one mod-square and one mod-multiply.

The reason for the speedup in the LR method becomes clear when we examine precisely what operands are involved in the two respective mod-multiply operations. In the RL powering, we multiply together the current power accumulator y and the current mod-square z, both of which are the size of the modulus n. In the LR powering, we multiply together the current power accumulator y and the base a, with the latter typically being order of unity, or O(1) in standard asymptotic-order notation.

In fact, for small bases, we can replace the the a×y product by a suitably chosen series of leftward bitshift-and-accumulates of y if that proves advantageous, but the bottom line – with a view to the underlying hardware implementation of such arbitrary-precision operations via multiword arithmetic – is that in the LR algorithm we multiply the vector y by the scalar base a, which is linear in the vector length in terms of cost. For general exponents whose bits are split roughly equally between 0 and 1 we only realize this cost savings for the 1-bits, but our particular numerical example involves a Mersenne number all of whose bits are 1, thus the exponent of the binary powering *n*−1 has just a single 0 bit in the lowest position, and if vector-times-vector mod-square and mod-multiply cost roughly the same (as is the case for bc), replacing the latter by a vector-times-scalar cuts the runtime roughly in half, as observed.

#### Deterministic primality proving

While the Fermat-pseudoprime test is an efficient way to identify if a given number is likely prime, our real interest is in rigorously establishing the character, prime or composite, of same. Thus it is important to supplement such probable-primality tests with deterministic alternatives whenever feasible. If said alternative can be performed for similar computational cost that is ideal, because computing a^{N−1} (mod N) using the modular left-to-right binary powering approach requires the computation of O(lg N) modular squarings of N-sized intermediates and a similar-sized number of modular multiplications of such intermediates by the base a, which is generally much smaller than N, meaning these supplementary scalar multiplies do not signify in terms of the overall large-N asymptotic algorithmic cost. There is reason to believe – though this has not been proven – that a cost of one modular squaring per bit of N is in fact the optimum achievable for a non-factorial primality test.

Alas, it seems that there is only a limited class of numbers of very special forms for which a deterministic primality test of similarly low computational cost exists – the most famous such are again the aforementioned two classes. For the Fermat numbers we use a generalization of the Fermat pseudoprime test due to Euler, in which one instead computes a^{(N−1)/2}(mod *N*) and compares the result to ± 1, with the proper sign depending on a particular algebraic property of N. For the Fermat numbers, it suffices to take *a* = 3 and check whether 3^{(Fm−1)/2} = 3^{22m−1} ≡ −1 (mod F_{m}), which requires precisely 2^{m}−1 mod-squarings of the initial seed 3. The sufficiency of this in determining the primality of the Fermat numbers is known as Pépin’s theorem. The reader can use either of the above bc modpow functions to perform the Pépin primality test on a Fermat number up to as many as several kilobits in size; for example to test the 2-kilobit F11, ‘n = 2^(2^11)+1; modpow_lr(3,(n-1)/2,n) == n-1’. Note that the LR and RL algorithms run in roughly the same time on the Fermat numbers, since the power computed in the modular exponentiation of Pépin test is a power of 2, thus has just a single 1-bit in the leftmost position.

For the Mersenne numbers M(*p*) = 2^{p}−1, the primality test is based on the algebraic properties of so-called Lucas sequences after the French mathematician Édouard Lucas, but was refined later by number theorist Derrick Lehmer into a simple algorithmic form known as the Lucas-Lehmer primality test: beginning with any one of an algebraically permitted initial seed values x (the most commonly used of which is 4), we perform precisely p−2 iterative updates of the form *x* = *x*^{2}−2 (mod M(p)); the Mersenne number in question is prime if and only if the result is 0. For example, for *p* = 5, we have unmodded iterates 4,14,194,37634; in modded form these are 4,14,8,0, indicating that the final iterate 37634 is divisible by the modulus 31 and thus that this modulus is prime. As with our probable-primality tests and the Pépin test, we have one such iteration per bit of the modulus, give or take one or two.

To give a sense of the relative efficiencies of such specialized-modulus tests compared to deterministic primality tests for general moduli, the fastest-known of the latter are based on the arithmetic of elliptic curves and have been used to prove primality of numbers having around 20,000 decimal digits, whereas the Lucas-Lehmer and Pépin tests have, as of this writing, been performed on numbers approaching 200 *million* digits. Since, as we shall show below, the overall cost of the specialized tests is slightly greater than quadratic in the length of the input, this factor-of-10,000 size disparity translates into a proportionally larger effective difference in test efficiency. In terms of the specialized-modulus tests, the speed difference is roughly equivalent to a hundred-millionfold disparity in testing efficiency based on the sizes of the current record-holders for both kinds of tests.

#### Fast modular multiplication

Despite the superficially different forms of the above two primality tests, we note that for both the crucial performance-gating operation, just as is the case in the Fermat pseudoprime test, is *modular multiplication*, which is taking the product of a pair of input numbers and reducing the result modulo a third. (The additional subtract-2 operation of the LL test is negligible with regard to these large-operand asymptotic work scalings). For modest-sized inputs we can use a standard digit-by-digit “grammar school” multiply algorithm, followed by a division-with-remainder by the modulus, but again, for large numbers we must be quite a bit more clever than this.

The key insight behind modern state-of-the-art large-integer multiplication algorithms is due to Schönhage and Strassen (see also http://numbers.computation.free.fr/Constants/Algorithms/fft.html for a more mathematical exposition of the technique), who recognized that multiplication of two integers amounts to *digitwise convolution* of the inputs. This insight allows any of the well-known high-efficiency convolution algorithms from the realm of digital signal processing to be brought to bear on the problem. Reflecting the evolution of the modern microprocessor, the fastest-known implementations of such algorithms use the Fast Fourier Transform (FFT) and thus make use of the floating-point hardware of the processor, despite the attendant roundoff errors which cause the convolution outputs to deviate from the pure-integer values they would have using exact arithmetic.

In spite of this inexactitude, high-quality FFT software allows one to be remarkably aggressive in how much roundoff error one can sustain without fatally corrupting the long multiplication chains involved in large-integer primality testing: for example, the Lucas-Lehmer test which discovered the most-recent record-sized Mersenne prime, 2^{77232917}−1, used a double-precision FFT which divided each of the 77232917-bit iterates into 2^{22} input words of either 18 or 19 bits in length (precisely speaking 1735445 = 77232917 (mod 2^{22}) of the larger and the remaining 2458859 words of the smaller size), thus used slightly greater than 18 bits per input ‘digit’ of the discrete convolution. This gave floating-point convolution outputs having fractional parts (i.e. accumulated roundoff errors) as large as nearly 0.4, which is remarkably close to the fatal “I don’t know whether to round this inexact floating-point convolution output up or down” 0.5 error level.

Nonetheless, multiple verify runs using independently developed FFT implementations at several different (and larger, hence having much smaller roundoff errors) transform lengths, on several different kinds of hardware, confirmed the correctness of the initial primality test. For an n-bit modulus, the large-n-asymptotic compute cost of such an FFT-multiply is O(*n* lg *n*), though realizing this in practice, especially when the operands get large enough to exceed the sizes of the various-level data caches used in modern microprocessors, is a far-from-insignificant algorithmic and data-movement challenge. Since our various primality tests require O(*n*) such multiply steps, the work estimate for an FFT-based primality test is O(*n*^{2} lg *n*), which is only a factor lg *n* larger than the quadratic cost of a **single** grammar-school multiply. The upshot is that to write a world-class primality-testing program, one must write (or make use of) a world-class transform-based convolution.

Roughly 20 years ago, I was on the faculty of engineering at Case Western Reserve University in Cleveland, Ohio, and was looking for an interesting way to motivate the teaching of the Fast Fourier Transform algorithm from signal processing to the students in my undergraduate computational methods class. Some online searching turned up the use of FFT in large-integer multiplication and the rest was history, as the saying goes.

The second major algorithmic speedup in modern primality testing came roughly a generation following the Schönhage-Strassen algorithm, and involved the required FFT length needed to multiply two integers of a given size. In order to perform a modular multiply of a pair of *n*-bit inputs, in general we must first exactly compute the product, which is twice as many bits in length, then reduce the product (i.e. compute the remainder) with respect to the modulus in question. For specialized ‘binary-friendly’ moduli such as the Mersenne and Fermat numbers the reduction can be done efficiently using bitwise arithmetic, but to compute the double-width product still imposes an overhead cost, since it requires us to zero-pad our FFT-convolution inputs with *n* 0 bits in the upper half and use an FFT length twice as large as our eventual reduced outputs otherwise would dictate.

This all changed in 1994, when the late Richard Crandall and Barry Fagin published a paper showing how for the special cases of Mersenne and Fermat-number moduli such zero-padding could be avoided by way of cleverly weighting the transform inputs in order to effect an “implicit mod” operation. This breakthrough yielded a more-than-factor-of-2 speedup in primality testing of these two classes of moduli, and the associated kinds of discrete weighted transforms have subsequently been extended to several other interesting classes of moduli. For a simple worked example of how an FFT can be used for such an implicit mod, we refer the reader to the author’s post at mersenneforum.org. Alas, the various other important algorithmic aspects involved in high-efficiency mod mul on modern processor hardware: non-power-of-2 transform length support, nonrecursive in-place FFTs requiring no cache-unfriendly bit-reversal reordering of the input vectors, optimized data movement to permit efficient multithreading, etc, are beyond the scope of the current article, but we assure the interested reader/programmer that there is more, much more, of interest involved. We now turn to the various considerations which led to the special effort of implementing assembly-code support for the ODROID platform and its 128-bit vector-arithmetic instruction set, and finally to the nuts and bolts of compiling and running the resulting LL-test code on the Odroid platform.

#### Why the ARM platform?

I spent most of my development time over the past 5 years first parallelizing my existing Mlucas FFT code schema using the POSIX pthreads framework plus some added core-affinity code based on the various affinity-supporting extensions provided in various major Linuxes and MacOS. Once I had a working parallel framework which supported both the scalar-double (generic C) and x86 SSE2 128-bit vector-SIMD builds, I upgraded my SIMD inline-assembly-macro libraries through successive updates to Intel’s vector instruction set: first 256-bit AVX, then AVX2 which firstly promoted the vector-integer math to the same footing as the floating-point by extending the former to the full 256-bit vector width and secondly added support for 3-operand fused multiply-add (FMA3) floating-point arithmetic instructions. While working on the most-recent such effort – that needed to support Intel’s 512-bit AVX512 instruction set (which first appeared in the market in a somewhat-barebones but still very usable ‘foundation instructions subset’ form in early 2017 in the Knights Landing workstation family) last year I was also considering a first foray into adding SIMD support to a non-x86 processor family. The major considerations for undertaking such an effort, which are typically 4-5 months’ worth of focused coding, debug and performance-tuning, were as follows:

- Large install base and active developer community;
- Support for standard Linux+GCC compiler/debugger toolflow and POSIX pthreads parallelization;
- Well-designed instruction set, preferably RISC-style, with at least as many vector and general-purpose registers as Intel AVX’s 16-vector/16-GPR, and preferably as many registers (32 of each kind) as Intel’s AVX512;
- Support for FMA instructions;
- Competitiveness with leading high-end processor families (e.g. Intel CPUs and nVidia GPUs) in performance-per-watt and per-hardware-dollar terms;
- Likelihood of continued improvements in processor implementations.

ARM (specifically the ARMv8 128-bit vector-SIMD instructions and the various CPUs implementing it) quickly emerged as the leading candidate. High power efficiency, a generous 32+32 register set, and an excellent instruction set design, much better than Intel SSE2 with its Frankensteinian bolted-together aspects (consider the almost-comically-constricted support of 64-bit vector integer instructions and the lack of a ROUND instruction in the first SSE2 iteration as just two of many examples here). Once I’d purchased my little ODROID C2 and worked through the expected growing pains of the new-to-me instruction mnemonics and inline-assembly syntax differences versus x86 SIMD, the development effort went very smoothly. One early worry, in the form of the somewhat-more-restricted FMA3 syntax versus Intel’s, proved unfounded, the impact of said restrictiveness being easily mitigated via some simple rearrangement of operand order, in which regard the generous set of 32 vector registers came in very handy. One hopes ARM has a 256-bit upgrade of the v8 vector instructions on their roadmap for the not-too-distant future!

#### Setting up the software

The software is designed to be as easy to build as possible under as wide variety of Linux distributions as possible. Having had too many bad and time-wasting experiences with the standard configure-and-make build paradigm for Linux freeware I deliberately chose to depart from it in somewhat-radical fashion, instead putting a lot of effort into crafting my header files to as-far-as-possible automate the process of identifying the key aspects of the build platform during preprocessing, taking any of the software’s supported hardware-specific identifiers (e.g. user wants to build for x86 CPUs which support the AVX2/FMA3 vector-instruction subset, or more pertinently for ODROID’ers, the ARMv8 128-bit SIMD vector-arithmetic instructions) from the user at compile time, in addition to choosing whether to build a single-threaded or a multithreaded binary.

Building thus reduces to the following 3-step sequence, which is laid out in the Mlucas homepage: Inserting any applicable architecture-specific flags, compile all files. On my ODROID-C2, I want to activate the inline assembler targeting ARMv8 and I want to be able to run in parallel on all 4 processor cores, so the compile command is:

$ gcc -c -O3 -DUSE_ARM_V8_SIMD -DUSE_THREADS ../src/*.c >& build.logUse grep on the build.log file resulting from the compile step for errors:

$ grep -i error build.logIf no build-log errors, link a binary:

$ gcc -o Mlucas *.o -lm -lpthread -lrtRun the standard series of self-tests for the ‘medium’ range of FFT lengths covering all current and not-too-distant-future GIMPS user assignments: ‘./Mlucas -s m’. On an ODROID that will spend several hours testing the various ways to combine the individual complex-arithmetic radices which make up each of the various FFT lengths covered by the self-test, and capturing the ones which give the best performance on the user’s particular hardware to a master configuration file called mlucas.cfg, which is plaintext despite the nonstandard suffix.

The most common kinds of build errors encountered in the above typically can be traced to users targeting an instruction set not supported by their particular CPU or occasionally some particularity of the build platform which needs some tweaks to the preprocessor logic in the crucial platform.h header file. Since ODROID SBC’s come preloaded with uniform Linux OS distributions, the only issues likely to arise relate to the precise instruction set (v8 SIMD or not) supported by the particular ODROID model being used.

On my ODROID-C2, running the 128-bit SIMD build of the software on all 4 cores gives total throughput roughly equivalent to that of a build using the x86 SSE2 128-bit vector instructions on a single core of a 2GHz Intel Core2 based laptop. This translates to roughly 1/20th the total throughput achievable on the author’s Intel Haswell quad using the 256-bit AVX2 instruction set, so the 2 kinds of platforms are not really comparable on the basis of per-core throughput. ARM-based platforms like ODROID rather shine in terms of per-watt-hour and per-hardware-cost throughput. In terms of an ODROID user of the software having a realistic chance at discovering the next Mersenne prime, well, we need to make up with numbers what the higher-end hardware platforms (Intel CPUs, nVidia GPUs) do with huge transistor counts and power-hungry cores – i.e., we need to be the army ants to their lions. And of course, speed upgrades such as that promised by the recently-announced ODROID-N1 can only help. It is an exciting time to be micro-computing!

## Be the first to comment