close

Moved

Moved. See https://slott56.github.io. All new content goes to the new site. This is a legacy, and will likely be dropped five years after the last post in Jan 2023.

Showing posts with label fortran. Show all posts
Showing posts with label fortran. Show all posts

Tuesday, January 11, 2022

The Old Days -- ca. 2000 -- Empirical Tests of Random Numbers (Python and Chi-Square Testing)

See The Old Days -- ca. 1974 Random Numbers Before Python for some background.

We'll get to Python after reminiscing about the olden days. I want to provide some back story on why sympy has had a huge impact on ordinary hacks like myself.

What we're talking about is how we struggled with randomness before

  1. /dev/random
  2. The Mersenne Twister Pseudo-Random Number Generator (PRNG)

Pre-1997, we performed empirical tests of PRNG's to find one that was random enough for our application. Maybe we were doing random samples of data to compare statistical measures. Maybe we were writing a game. What was important was a way to create a sequence of values that passed a battery of statistical tests.

See https://link.springer.com/chapter/10.1007%2F978-1-4612-1690-2_7 for the kind of material we salivated over. 

While there are an infinite number of bad algorithms, some math reveals that the Linear Congruential Generator (LCG) is simple and effective. Each new number is based on the previous number: \(X_{n+1} = (X_n \times a + c) \bmod m\). There's a multiply and an add, modulo some big number. The actual samples are often a subset of the bits in \(X_{n}\). 

After the Mersenne Twister became widely used, we essentially stopped looking at alternative random number algorithms. Before then -- well -- things weren't so good.

Here are some classics that I tested.

  • The ACM Collected Algorithms (CALGO) number 294 is a random-number generator. This is so obsolete, I have trouble finding links to it. It was a 28-bit generator.
  • The ACM Collected Algorithms (CALGO) number 266 has code still available. See toms/266
  • The Cheney-Kincaid generator is available. See random.f plus dependencies.

These formed a kind of benchmark I used when looking at Python's built-in Mersenne Twister.

Nowadays, you can find a great list of LCM PRNG's at  https://en.wikipedia.org/wiki/Linear_congruential_generator

Python Empirical Testing

One of the early questions I had was whether or not the random module in Python stacked up against these older RNG's that I was a little more familiar with.

So, I wrote a big, fancy random number testing tool in Python. 

When? Around 2000. I started this in the Python 1.6 and 2.1 era. I have files showing results from Python 2.3 (#2, Jul 30 2003). This is about when I stopped fooling around with this and moved on to trusting that Python really did work and was -- perhaps -- the best approach to working with randomly-sampled data for statistical work. 

The OO design for the test classes was Lavish Over The Top (LOTT™) OO:

  • Too Many Methods
  • Too Many Superclasses
  • No Duck Typing

We won't look at that code. It's regrettable and stems from trying to make Python into C++.

What I do want to look at is the essential Chi-Squared test methodology. This is some cool stuff.

Comparing Expected and Actual

The chi-squared metric is a way to compare actual and expected distributions. You can read about it on your own time. It's a way to establish if data is random or there's something else going on that's not random. i.e., a trend or a bias. 

The empirical tests for PRNG's that Knuth defines all come with chi-squared values that bracket acceptable levels of randomness. For the purposes of writing a working set of tests the magic chi-squared values supplied by Knuth are fine. Magical. But fine. Really. Trust them.

If you make modifications, you'd use your statistics text-book. You'd open to the back where it had a Chi-Squared table. That table gave you chi-squared values for a given degree of freedom and a given probability of being random.

Or, You could look for the NIST handbook online. It has a section on chi-squared testing. See https://www.itl.nist.gov/div898/handbook/eda/section3/eda3674.htm. Same drill. Degrees of freedom and probability map to a chi-squared threshold.

But.

Were do these magical Chi-Squared values come from? This gets interesting in a useless-sidebar kind of way.

Chi-Squared Values

There's a really, really terse summary of chi-squared numbers here: https://www.danielsoper.com/statcalc/formulas.aspx?id=11. This is all you need to know. It may be too terse to help you learn about it, but it's a handy reference.

We need to evaluate two functions: partial gamma and gamma. These are defined as integrals. And they're nasty levels of complexity. Nasty.

This kind of nasty:

\[\gamma (s,z)=\int _{0}^{z}t^{s-1} e^{-t} dt\].

\[\Gamma (z)=\int _{0}^{\infty} t^{z-1} e^{-t} dt\].

These are not easy things to evaluate. Back to the ACM Collected Algorithms (CALGO) to find ways to evaluate these integrals. There are algorithms in CALGO 435 and 654 that are expressed as Fortran for evaluating these. This ain't all, of course, we need Stirling Numbers and Bernoulli Numbers. So there's a lot going on here.

A lot of this can be transliterated from Fortran. The resulting code is frankly quite ugly, and requires extensive test cases. Fortran with GOTO's requires some cleverness to unwind the conceptual for/while/if constructs.

OR.

Enter Sympy

In the 20+ years since I implemented my empirical PRNG tests "the hard way," sympy has come of age.

Check this out

from sympy import Sum, rf
from sympy.abc import k, s, z
from sympy.functions import exp
from sympy import oo
Sum(z**s * exp(-z) * z**k / rf(s, k+1), (k, 0, oo)).simplify()

I could use this in Jupyter Lab to display a computation for the partial gamma function.

\[z^{s}e^{-z}\sum _{k=0}^{\infty }{\dfrac {z^{k}}{s^{\overline {k+1}}}}\]

This requires a fancy Rising Factorial computation, the \(s^{\overline {k+1}}\) term. This is available in sympy as the rf(s, k+1) expression.

It turns out that sympy offers lowergamma() and gamm() as first-class functions. I don't even need to work through the closed-form simplifications.

I could do this...

def gammap(s: float, z: float) -> float:
    return (z**s * exp(-z) * Sum(z**k / rf(s, k+1), (k, 0, oo))).evalf()

def gamma(z: float) -> float:
    return integrate(t**(z-1) * exp(-t), (t, 0, oo)).doit()

It works well. And it provides elegant documentation. But I don't need to. I can write this, instead,

def chi2P(chi2: float, degF: int) -> float:
   return lowergamma(degF/2, chi2/2) / gamma(degF/2)

This is used to compute the probability of seeing a chi-squared value. 

For the frequency test, as an example. We partition the random numbers into 16 bins. These gives us 15 degrees of freedom. We want chi-squared values between 7.2578125 and 25.0.

Or.

Given a chi-squared value of 6.0, we can say the probability of 0.02 is suspiciously low, less than 0.05 level that we've decided signifies mostly random. The data is "too random"; that is to say it's too close to the ideal distribution to be trusted.

The established practice was to lookup a chi-squared value because you couldn't easily compute the probability of that value. With sympy, we can compute the probability. It's slow, so we have to optimize this carefully and not compute probabilities more frequently than necessary.

We can, for example, compute chi-squared values for a number of seeds, take the max and min of these and compute the probability of those two boundary values. This will bracket the probability that the pseudo random number generator is producing suitably random numbers.

This also applies to any process we're measuring with results that might vary randomly or might indicate a consistent problem that requires evaluation.

Using sympy eliminates the complexity of understanding these beautifully hand-crafted antique algorithms. It acts as a kind of super-compiler. From Math to an intermediate AST to a concrete implementation.

Tuesday, January 4, 2022

The Old Days -- ca. 1974 -- Random Numbers before Python

See "The Old Old Days -- ca. 1968" for my first exposure to an actual computer. Nothing about Python there. But. It's what motivated me to get started learning to code -- I was fascinated by games that involved randomization. Games with cards or dice.

After filling in a little background, I will get to the Python part of this. First, however, I want to compare the olden days with what we have now.

From 1969 to 1974 I had access to the high school's IBM 1620. This means programming in IBM's SPS assembler, or using the NCE Load-and-Go Fortran compiler. See https://www.cs.utexas.edu/users/EWD/transcriptions/EWD00xx/EWD37.html for a scathing review of the problems with this machine.

See http://www.bitsavers.org/pdf/ibm/1620/GC20-1603-10_1620_Catalog_of_Programs_Jan71.pdf Page 36 has this:

BERJAYA

That's a quick overview of my earliest programming language. What's essential here is the NCE Fortran used 4-digit integers.

I'll repeat that for those skimming, and wondering what the Python connection is.

Four. Digit. Integers.

That's four decimal digits. Decimal digits required at least 4 hardware bits. IBM 1620 digits also had flags and signs, so, there were maybe 6 bits per digit. 24 bits of hardware used to represent just under 14 bits of useful information.

My interest is in simulation and randomness. So. I have this question of how to create random sequences of numbers limited to 4-digit integers.

PRNG Algorithms

There are a number of classic Pseudo-Random Number Generator (PRNG) algorithms from the early days before Mersenne Twister took over in 1997.

We used to be super-careful to emphasize the letter P in PRNG because the numbers we're really random. They just behaved randomish. This is contrasted with real randomness, also known as entropy. For example, /dev/random device driver has a fair amount of entropy. I think it's comparable to a person throwing dice across a table. I think it's as random as a noise-generating diode with a sample-and-hold circuit to pluck out random values from the noise.

Pre-Mersenne-Twister -- pre-1997 -- we worried a lot about random number generation. See Knuth, Donald E. The Art of Computer Programming, Volume 2, Seminumerical Algorithms, Addison-Wesley, 1969. Section 3.3.2. covers empirical testing of random number generators. Section 3.3.1. covers the Chi-squared test for fit between actual and expected frequency distributions.

Back in the olden days, it was stylish to perform an empirical test (or ten) to confirm we really had "good" random numbers. The built-in libraries that came with your compiler or OS could not be trusted without evidence.

One of the classic (bad) PRNG's is the "Middle-Squared" method. See https://en.wikipedia.org/wiki/Middle-square_method. I learned about this in the 70's. And used it in the old NCE Fortran. 

With Four. Digit. Integers.

Did I mention that the Fortran compiler used four decimal digits for integers? That means plucking the middle two digits out of a four-digit number. How random can that be?

Not very. The longest possible sequence is 100 numbers. If, by some miracle, you found a seed number with the right properties and only two digits.

Nowadays I can, in Python, do a quick middle-squared analysis for all 100 seed values.

This kind of thing.

def csqr4(value: int) -> list[int]:
    """The 4 decimal digit center-squared PRNG."""
    sequence = []
    while value not in sequence:
        sequence.append(value)
        value = (((value**2) // 10) % 100)
    return sequence

Which you can run and see that all of my early attempts at games and simulations were doomed. The seed values of 76, 42, and 69 provided kind of long sequences of almost random-seeming numbers. Otherwise, pfft, this was junk computer science. 50% of the seeds provide 5 or fewer numbers before repeating. 

For blackjack, a few random numbers for shuffling might be enough. For other games, the lack of randomness made the outcomes trivially predictable. 

What's funny is how far the state of the art has moved since then.

  1. Hardware now has more than 20,000 decimal digits (about 10K bytes) of storage.
  2. Software with algorithms that are really, really clever.

It's hard to understate these two advances, particularly, the second one. I'll return to the algorithm thing a lot in the next few posts.

My focus was on games and randomization. Ideally, simple stuff. But... under the hood, it's not simple. I've spent some time (not much, and not in much depth) looking at the tip of this iceberg. 

It served me as an incentive to dive just a little more deeply into a topic, like math or a programming language or a statistical tool.

Tuesday, April 7, 2020

Why Isn't COBOL Dead? Or Why Didn't It Evolve?

Here's part of the question:
Why didn't COBOL evolve more successfully?
FORTRAN, OTOH, has survived precisely because it--and more importantly, related tools, esp compilers--has evolved to solve/overcome many (certainly not all!) of the sorts of pain-points you describe, while retaining the significant performance edge that (IMHO, ICBW) prevents challengers (e.g., Python) from dislodging it for tasks like (e.g.) running dynamical models (esp weather forecasting).
In short, why is FORTRAN still OK? Why is COBOL not still OK?

Actually, I'd venture to say the stories of these languages are essentially identical. They're both used because they have significant legacy implementations.

There's a distinction, that I think might be relevant to the "revulsion factor."

Folks don't find Fortran quite so revolting because it's sequestered into libraries where we don't really have to look at it. It's often wrapped into SciPy. The GCC compiler system handles it and we're happy.

COBOL, however, isn't sequestered into libraries with tidy Python wrappers and Conda installers. COBOL is the engine of enterprise applications.

Also. COBOL is used by organizations that suffer from high amounts of technical inertia, which makes the language a kind of bellwether for the rest of the organization. The organization changes slowly (or not at all) and the language changes at an even more tectonic pace.

This is a consequence of very large organizations with regulatory advantages. Governments, for example, regulate themselves into permanence. Other highly-regulated industries like banks and insurance companies can move slowly and tolerate the stickiness of COBOL.

Also.

For a FORTRAN library function that does something useful, it's not utterly mysterious. There's often a crisp mathematical definition, and a way to test the implementation. There are no quirks.

For a COBOL program that does something required by law, there can still be absolutely opaque mysteries and combinations of features without acceptable unit test cases. This isn't for lack of trying. It's the nature of "application" vs. "subroutine."

The special case and exceptions have to live somewhere. They live in the application.

For FORTRAN, the exceptions are in the Python wrapper using numpy using FORTRAN.

For COBOL, the exceptions are in the COBOL  Somewhere.