PRNG optimizations

Say you wanted to generate a stream of uniformly distributed random integers in the range [0,N) where N > 0. Java's Random.nextInt(N) is an example of a convenient API that does just that. Furthermore lets say you willing to trade off the quality of the distribution in order to gain better performance -- lower latency for each call to nextInt(N). Since we care about speed, we'll assume the implementation can't use floating point.

A good starting point might be Marsaglia's xor-shift pseudo-random number generator (PRNG), which can generate a new 32-bit value in [0,4G) in just a few cycles. Once you have that 32-bit value in hand you next need to reduce it to the [0,N) range. The most obvious approach is to use the "MOD" operator. The MOD (and DIV) instructions can be relatively expensive, however. Somewhat perversely, it's not uncommon to find that the MOD instruction is more expensive than generating the value via the Marsaglia xor-shift PRNG, and thus dominates performance of your nextInt(N) implementation. Furthermore, some processors have only one processing unit per core that can handle MOD instructions, so you might encounter both scaling and latency problems. If N happens to be a constant known at compile-time then your compiler might be sufficiently clever to apply strength reduction techniques, such as those described by Granlund and Montgomery in Division by Invariant Integers using Multiplication (PLDI 1994). Hacker's Delight (2E) is another good resource. Alternatively, you could explicitly apply the optimization yourself. But lets assume that N is a variable.

One approach to avoid the MOD instruction is to use a degenerate scaled or fixed-point representation. Say "v" is the value returned by the Marsgalia xor-shift PRNG. We can calculate and return (((v & 0x3FF) * N) >> 10). 0x3FF and 10 are related as 0x3FF is equal to ((2^10)-1), of course. Note that we traded MOD for multiplication, which is usually profitable. While this approach is tempting -- and might be viable in some circumstances -- it suffers from range overflow concerns and quantization (really additional discretization). Note that only 1024 possible values can be returned. We might change 0x3FF and 10 to 0x7FF and 11, respectively, allowing 2048 distinct values, but if we keep increasing those values the "interesting" bits will be truncated by overflow in the 32-bit multiplication and the distribution will suffer.

Another approach is to borrow an idea from rejection sampling. For discussion lets assume N is 10. Ignoring overflow, we can easily compute M where M is the least integer power of 2 greater than or equal to N. Since N is 10, M will be 16. See "Hacker's Delight" 2E, Figure 3-3, or the following article. Critically, given N, we can quickly compute M in constant time with no memory references and no conditional branches. Our nextInt(N) implementation can first compute M from N and then invoke the Marsaglia xor-shift PRNG (or PRNG of your choice) to obtain a 32-bit value "v". Next, we mask "v" with (M-1) yielding a value in [0,M). If that value is less than N then we're done and can return -- we accept the value. But if the value is greater than or equal to N then we reject the value and loop, calling the PRNG again and retrying until we finally find a value less than N. We're effectively running Bernoulli trails until the 1st success, and the probability P of success (N/M) will always be greater than 0.5. The number of trials (iterations) until the 1st success has a geometric distribution so the average running time -- iterations required -- is reasonable if you're willing to tolerate the non-deterministic nature of rejection sampling. If you're concerned about the worst-case you could fall back to the MOD instruction after some number of rejections within a nextInt() episode. One downside to this approach is that the branch to exit the loop might not be very well predicted.

Whether or not this approach is profitable depends on the cost of MOD relative to the additional work to compute M and the cost of the retries. Somewhat sadly, it appears useful on a range of systems.

(Beware that the low-order bits from a Marsaglia xor-shift PRNG may have a skewed distribution over short intervals, but you can remedy this with a shift or by using a slightly better but still cheap PRNG).

PRNG optimizations

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112