close

Moved

Moved. See https://slott56.github.io. All new content goes to the new site. This is a legacy, and will likely be dropped five years after the last post in Jan 2023.

Showing posts with label #Python3. Show all posts
Showing posts with label #Python3. Show all posts

Tuesday, November 22, 2022

Testing with PySpark

This isn't about details of pySpark. This is about the philosophy of testing when working with a large, complex framework, like pySpark, pandas, numpy, or whatever. 

BLUF

Use data subsets. 

Write unit tests for the functions that process the data.

Don't test pyspark itself. Test the code you write.

Some History

I've worked with folks -- data scientists specifically -- without a deep background in software engineering.

When we said their model-building applications needed a test case, they supplied the test case they used to validate the model.

Essentially, their test script ran the entire training set. Built the model. Did extensive statistical testing on the resulting decisions made by the model. The test case asserted that the stats were "good." In fact, they recapitulated the entire model review process that had gone on in the data science community to get the model from "someone's idea" to a "central piece of the business." 

The test case ran for hours and required a huge server loaded up with GPUs. It cost a fortune to run. And. It tended to timeout the deployment pipeline.

This isn't what we mean by "test." Our mistake.

We had to explain that a unit test demonstrates the code works. That was all. It shouldn't involve the full training set of data and the full training process with all the hyperparameter tuning and hours of compute time. We don't need to revalidate your model. We want to know the code won't crash. We'd like 100% code coverage. But the objective is little more than show it won't crash when we deploy it.

It was difficult to talk them down from full training sets. They couldn't see the value in testing code in isolation. A phrase like "just enough data to prove the thing could plausibly work with real data" seemed to resonate. 

A few folks complained that a numpy array with a few rows didn't really show very much. We had to explain (more than once) that we didn't really want to know all the algorithmic and performance nuances. We mostly wanted to know it wouldn't crash when we applied it to production data. We agreed with them the test case didn't show much. We weren't qualified to revalidate the model; we were only qualified to run their training process for them. If they had done enough work to be sure we *could* run it.

(It was a bank. Software deployments have rules. An AI model-building app is still an app. It still goes through the same CI/CD pipeline as demand deposit account software changes. It's a batch job, really, just a bit more internally sophisticated than the thing that clears checks.)

Some Structure

I lean toward the following tiers of testing:

  1. Unit tests of every class and function. 100% code coverage here. I suggest using pytest and pytest-cov packages to tracking testing and make sure every line of code has some test case. For a few particularly tricky things, every logic path is better than simply testing lines of code. In some cases, every line of code will tend to touch every logic path, but seems less burdensome.
  2. Use hypothesis for the more sensitive numeric functions. In “data wrangling” applications there may not be too many of these. In the machine learning and model application software, there may be more sophisticated math that benefits from hypothesis testing.
  3. Write larger integration tests that mimic pyspark processing, using multiple functions or classes to be sure they work together correctly, but without the added complication of actually using pySpark. This means creating mocks for some of the libraries using unittest.mock objects. This is a fair bit of work, but it pays handsome dividends when debugging. For well-understood pyspark APIs, it should be easy to provide mocked results for the app components under test to use. For the less well-understood parts, the time spent building a mock will often provide useful insight into how (and why) it works the way it does. In rare cases, building the mock suggests a better design that's easier to test.
  4. Finally. Write a few overall acceptance tests that use your modules and also start and run a small pyspark instance from the command line. For this, I really like using behave, and writing the acceptance testing cases using the Gherkin language. This enforces a very formal “Given-When-Then” structure on the test scenarios, and allows you to write in English. You can share the Gherkin with users and other stakeholders to be sure they agree on what the application should do.

Why?

Each tier of testing builds up a larger, and more complete picture of the overall application. 

More important, we don't emphasize running pySpark and testing it. It already works. It has it's own tests. We need to test the stuff we wrote, not the framework.

We need to test our code in isolation.

We need to test integrated code with mocked pySpark.

Once we're sure our code is likely to work, the next step is confirmation that the important parts do work with pySpark. For life-critical applications, the integration tests will need to touch 100% of the logic paths. For data analytics, extensive integration testing is a lot of cost for relatively little benefit.

Even for data analytics, testing is a lot of work. The alternative is hope and prayer. I suggest starting with small unit tests, and expanding from there.

Tuesday, November 15, 2022

Generators as Stacks of Operations

See https://towardsdatascience.com/building-generator-pipelines-in-python-8931535792ff 

I'm delighted by this article. 

I was shown only the first, horrible, example. I think the idea was to push back on the idea of complex generators. I fumed.

Then I read the entire article.

Now I'm fuming at someone who posted the first example -- apparently having failed to read the rest of the post.

This idea of building a stack of iterators is very, very good.

The example (using simple operations) can be misleading. A follow-on example doing something like file parsing might be helpful. But, if you go too far, you wind up writing an entire book about Functional Programming in Python.

Tuesday, October 25, 2022

Some Functional Programming in Python material

This is bonus content for the forthcoming Functional Python Programming 3rd edition book. It didn't make it into the book because -- well -- it was just too much of the wrong kind of detail.

See this "Tough TCO" document for some thoughts on Tail-Call Optimization that can be particularly difficult. This isn't terribly original, but I think it's helpful for folks working through more complex problems from a functional perspective.

"Why a PDF?" I've been working with with LaTeX, and the switching to other ways of editing and presenting code seemed like too much work.

Tuesday, July 19, 2022

I've got a great Proof-of-Concept. How do I go forward with it?

This is the best part about Python -- you can build something quickly. And it really works.

But. 

What are the next steps?

While there are a *lot* of possibilities, I'm focused on an "enterprise work group" application that involves a clever web service/RESTful API built in Flask. Maybe with NLP.

Let me catalog a bunch of things you might want to think about to "productionize" your great idea. Here's a short list to get started.

  • File System Organization
  • Virtual Environments
  • Unit Testing
  • Integration Testing
  • Acceptance Testing
  • Static Analysis
  • Tool Chain
  • Documentation
Let's dive into each one of these. Then we'll look at Flask deployments.

File System Organization

When you're gotten something to work, the directory in which it works is sometimes not organized ideally. There are a lot of ways to do this, but what seems to work well is a structure like the following.

- Some parent directory. Often in Git
  - src -- your code is here
  - tests -- your tests are here
  - docs -- your documentation will be here
  requirements.txt -- the list of packages to install. Exact, pinned version numbers
  requirements-dev.txt -- the list of packages used for maintenance and development
  environment.yml -- another list of packages in conda format
  pyproject.toml -- this has your tox setup in it
  Makefile -- sometimes helpful

Note that a lot of packages you see have a setup.py.  This is **only** needed if you're going open source your code. For enterprise projects, this is not the first thing you will focus on. Ignore it, for now.

Virtual Environments

When you're developing in Python you may not even worry about virtual environments. You have Python. It works. You downloaded NLP and Flask. You put things together and they work.

The trick here is the Python ecosystem is vast, and you have (without really observing it closely) likely downloaded a lot of projects. Projects that depend on projects. 

You can't trust your current environment to be reliable or repeatable. You'll need to use a virtual environment manager of some kind.

Python's built-in virtual environment manager venv is readily available and works nicely. See https://docs.python.org/3/library/venv.html  It's my second choice. 

My first choice is conda. Start with minicondahttps://docs.conda.io/en/latest/miniconda.html. Use this to assemble your environment and retest your application to be sure you've got everything.

You'll be creating (and destroying) virtual environments until you get it right. They're cheap. They don't impact your code in any way. Feel free to make mistakes.

When it works, build conda's environment.yml file and the requirements.txt files. This will rebuild the environment.  You'll use them with tox for testing.

If you don't use conda, you'll omit the environment.yml.  Nothing else will change.

Unit Testing

Of course, you'll need automated unit tests. You'll want 100% code coverage. You *really* want 100% logic path coverage, but that's aspirational. 100% code coverage is a lot of work and uncovers enough problems that the extra testing for all logic paths seems unhelpful.

You have two built-in unit testing toolsets: doctest and unittest. I like doctest. https://docs.python.org/3/library/doctest.html

You'll want to get pytest and the pytest-cov add-on package. https://docs.pytest.org/en/6.2.x/contents.html  https://pytest-cov.readthedocs.io/en/latest/.  

Your test modules go in the tests directory. You know you've done it right when you can use the pytest command at the command line and it finds (and runs) all your tests. 

This is part of your requirements-dev.txt file.

Integration Testing

This is unit testing without so many mocks. I recommend using pytest for this, also. The difference is that your "fixtures" will be much more complex. Files. Databases. Flask Clients. Certificates. Maybe starting multiple services. All kinds of things that have a complex setup and perhaps a complex teardown, also.

See https://docs.pytest.org/en/6.2.x/fixture.html#yield-fixtures-recommended for good ways to handle this more complex setup and teardown.

Acceptance Testing

Depending on the community of users, it may be necessary to provide automated acceptance tests. For this, I recommend behave. https://behave.readthedocs.io/en/stable/ You're can write the test cases in the Gherkin language. This language is open-ended, and many stakeholders can contribute to the test cases. It's not easy to get consensus sometimes, and a more formal Gherkin test case lets people debate, come to an agreement, and prioritize the features and scenarios they need to see.

This is part of your requirements-dev.txt file.

Static Analysis

This is an extra layer of checking to be sure best practices are being followed. There are a variety of tools for this. You *always* want to process your code through blackhttps://black.readthedocs.io/en/stable/ 

Some folks love isort for putting the imports into a canonical order.  https://pycqa.github.io/isort/

Flake8 should be used to be sure there's no obviously bad programming practices. https://flake8.pycqa.org/en/latest/

I'm a huge fan of type hints. I consider mypy to be essential. https://mypy.readthedocs.io/en/stable/  I prefer "--strict" mode, but that can be a high bar. 

Tool Chain

You can try to manage this with make. But don't.

Download tox, instead.  https://tox.wiki/en/latest/index.html  

The point of tox is to combine virtual environment setup with testing in that virtual environment. You can -- without too much pain -- define multiple virtual environments. You can then test the various releases of the various packages your project depends on in various combinations. This is how to manage a clean upgrade. 

1. Figure out the new versions.

2. Setup tox to test existing and new.

3. Run tox.

I often set the tox commands to run black first, then unit testing, then static analysis, ending with mypy --strict.

When the code is reformatted by black, it's technically a build failure. (You should have run black manually before running tox.) When tox works cleanly, you're ready to commit and push and pull request and merge.

Documentation

Not an after-thought.

For human documents, use Sphinx. https://www.sphinx-doc.org/en/master/ 

Put docstrings in every package, every module, every class, every method, and every function. Summarize *what* and *why*. (Don't explain *how*: people can read your code.) 

Use the autodoc feature to create the API reference documentation from the code. Start with this.

Later, you can write a README, and some explanations, and installation instructions, and all the things other people expect to see.

For a RESTful API, be sure to write an OpenAPI specification and be sure to test against that spec. https://www.openapis.org. While a lot of the examples are complicated, you can easily use a small subset to describe your documents, the validation rules, and the transactions. You can add the security details later. They're part of your web server, but they don't need an extensive OpenAPI documentation at the beginning.

Flask Deployments

Some folks like to define a flask application that can be installed in the Python virtual environment. This means the components are on the default sys.path without any "extra" effort. (It's a fair amount of effort to begin with. I'm not sure it's worth it.)

When you run a flask app, you'll be using some kind of engine. NGINX, uWSGI, GUnicorn, etc. (GUnicorn is very nice. https://gunicorn.org). 

See https://flask.palletsprojects.com/en/2.0.x/deploying/wsgi-standalone/.

In all cases, these engines will "wrap" your Flask application. You'll want to make your application visible by setting the PYTHONPATH environment variable, naming your src directory. Do not run from your project's directory.

You will have the engine running in some distinct /opt/the_app or /Users/the_app or /usr/home/the_app or some such directory, unrelated to where the code lives. You'll use GUnicorns command-line options to locate your app, wherever it lives on the filesystem. GUnicorn will use PYTHONPATH to find your app. Since web servers often run as nobody, you'll need to make sure your code base is readable. But. Not. Writable.

Tuesday, June 21, 2022

My Shifting Understanding and A Terrible Design Mistake

I've been fascinated by Literate Programming forever. 

I have two utterly divergent takes on this.

See https://github.com/slott56/PyLit-3 for one.

See https://github.com/slott56/py-web-tool for another.

And yet, I've still done a really bad design job. Before we get to the design, a little bit of back story.

Back Story

Why two separate literate programming projects? Because it's not clear what's best. It's a field without too many boundaries and a lot of questions about the value produced.

PyLit I found, forked, and upgraded to Python 3. I didn't design it. It's far more clever than something I'd design.

Py-Web-Tool is something I wrote based on using a whole bunch of tools that follow along behind the original WEB tools. Nothing to do with web servers or web.py.

The Problem Domain

The design problem is, in retrospect, pretty obvious. I set it out here as a cautionary tale.

I'm looking at the markup languages for doing literate programming. The idea is to have named blocks of code in your document, presented in an order that makes sense to your reader. A tool will "weave" a document from your source. It will also "tangle" source code by rearranging the code snippets from presentation order into compiler-friendly order.

This means you can present your core algorithm first, even though it's buried in the middle of some module in the middle of your package. 

The presentation order is *not* tied to the order needed by your language's toolchain.

For languages like C this is huge freedom. For Python, it's not such a gigantic win.

The source material is a "web" of code and information about the code. A web file may look like this:

Important insight.

@d core feature you need to know about first @{
    def somecode() -> None:
        pass
@}

And see how this fits into a larger context?

@d something more expansive @{
def this() -> None:
    pass
    
def that() -> None:
    pass
    
@<core feature you need to know about first@>
@}

See how that works?

This is easy to write and (relatively) easy to read. The @<core feature you need to know about first@> becomes a hyperlink in the published documentation. So you can flip between the sections. It's physically expanded inline to tangle the code, but you don't often need to look at the tangled code.

The Design Question

The essential Literate Programming tool is a compiler with two outputs:

  • The "woven" document with markup and such
  • The "tangled" code files which are code, largely untouched, but reordered.

We've got four related problems.

  1. Parsing the input
  2. An AST we can process
  3. Emitting tangled output from the AST
  4. Emitting woven output form the AST

Or, we can look at it as three classic problems: deserialization, AST representation, and serialization. Additionally, we have two distinct serialization alternatives.

What did I do?

I tackled serialization first. Came up with a cool bunch of classes and methods to serialize the two kinds of documents.

Then I wrote the deserialization (or parsing) of the source WEB file. This is pretty easy, since the markup is designed to be as trivial as possible. 

The representation is little more than glue between the two.

What a mistake.

A Wrong Answer

Focusing on serialization was an epic mistake.

I want to try using Jinja2 for the markup templates instead of string.Template

However. 

My AST was such a bad hack job it was essentially impossible to use it. It was a quagmire of inconsistent ad-hoc methods to solve a specific serialization issue.

As I start down the Jinja road, I found a need to be able to build an AST without the overhead of parsing.

Which caused me to realize that the AST was -- while structurally sensible -- far from the simple ideal.

What's the ideal?

The Right Answer

This ideal AST is something that lets me build test fixtures like this:

example = Web(
   chunks=[
       TextChunk("\n"),
       NamedCodeChunk(name="core feature you need to know about first", lines=["def someconme() -> None: ...", "pass"])),
       TextChunk("\nAnd see how this fits into a larger context?\n"),
       NamedCodeChunk(name="something more expansive", lines=[etc. etc.])
   ]
)

Here's my test for usability: I can build the AST "manually" without a parser. 

The parser can build one, also, but I can build it as a sensible, readable, first-class Python object.

This has pointed me to a better design for the overall constructs of the WEB source document. Bonus. It's helping me define Jinja templates that can render this as a sensible woven document.

Tangling does not need Jinja. It's simpler. And -- by convention -- the tangled code does not have anything injected into it. The woven code is in a markup language (Markdown, RST, HTML, LaTeX, ASCII DOC, whatever) and some markup is required to create hyperlinks and code sections. Jinja is super helpful here. 

TL;DR

The essence of the problem is rarely serialization or deserialization.  It's the internal representation.


Tuesday, June 14, 2022

A LaTeX Thing I Did -- And A ToDo:

When writing about code in LaTeX, the essential strategy is to use an environment to format the code so it stands out from surrounding text. There are a few of these environments available as LaTeX add-on packages. The three popular ones are:

These are nice for making code readable and distinct from the surrounding text.

A common way to talk about the code is to use inline verbatim \verb|code| sections. I prefer inline \lstinline|code|, but, my editor prefers \verb. (I have trouble getting all the moving parts of minted installed properly, so I use listings.)

Also. And more important. 

There's the \lstinputlisting[language=Python, firstline=2, lastline=12]{some_module.py} command. This lets an author incorporate examples from working, tested modules. Minted doesn't seem to have this, but it might work with an \input command. Don't know. Haven't tried.

Let's talk about workflow.

Workflow

The idea behind these tools is you have code and after that, you write about the code. I call this code first.

Doing this means you can include code snippets from a file.

Which is okay, but, there's another point of view: you have a document that contains the code. This is closer to the Literate Programming POV. I call this document first. I've got all the code in the document you're reading, I've just broken it up and spread it around in an order to serve my purpose as a writer, not serve the limitations of a parser or compiler.

There is a development environment -- WEB -- to create code that can be run through the Weave and Tangle tools to create working code and usable documentation. This is appealing in many ways. 

For now, I'm settling for the following workflow:

  1. Write the document with code samples. Use \lstlisting environment with explicit unique labels for each snippet. The idea is to focus on the documentation with explanations.
  2. Write a Jinja template that references the code samples. This is a lot of {{extract['lst:listing_1']}} kind of references. There's a bit more that can go in here, we'll return to the templates in a moment.
  3. Run a tool to extract all the \lstlisting environments to a dictionary with the label as the key and the block of text as the value. This serializes nicely as a JSON (or TOML or YAML) file. It can even be pickled, but I prefer to be able to look at the file to see what's in it.
  4. The tool to populate the template is a kind of trivial thing to build a Jinja environment, load up the template, fill in the code samples, and write the result.
  5. I can then use tox (and doctest and pytest and mypy) to test the resulting module to be sure it works.

This tangles code from a source document. There's no weave step, since the source is already designed for publication. This does require me to make changes to the LaTeX document I'm writing and run a make test command to extract, tangle, and test. This is not a huge burden. Indeed, it's easy to implement in PyCharm, because the latest release of PyCharm understands Makefiles and tox. Since each chapter is a distinct environment, I can use tox -e ch01 to limit the testing to only the chapter I'm working on.

I like this because it lets me focus on explanation, not implementation details. It helps me make sure that all the code in the book is fully tested. 

The Templates

The template files for an example module have these three kinds of code blocks:

  1. Ordinary Listings. These fall into two subclasses.
    1. Complete function or class definitions.
    2. Lines of code taken out of context.
  2. REPL Examples. 

These have three different testing requirements. We'll start with the "complete function or class definitions."  For these, the template might look like the following

{{extract['lst:listing_1']}}

def test_listing_1() -> None:
    assert listing_1(42)
    assert not listing_1(None)

This has both the reference to the code in the text of the book and a test case for the code.

For lines of code out of context, we have to be more careful. We might have this.

def some_example(arg: int) -> bool:
    {{extract['lst:listing_2']}}

def test_listing_2() -> None:
    assert listing_2(42)
    assert not listing_2(None)

This is similar to a complete definition, but it has a fiddly indentation that needs to be properly managed, also. Jinja's generally good about not inserting spaces. The template, however, is full of what could appear to be syntax errors, so the code editor could have a conniption with all those {} blocks of code. They happen to be valid Python set literals, so, they're tolerated. PyCharm's type checking hates them.

The REPL examples, look like this.

REPL_listing_3 = """
{{extract['lst:listing_3']}}
"""

I collect these into a __test__ variable to make them easy for doctest to find. The extra fussiness of  a __test__ variable isn't needed, but it provides a handy audit for me to make sure everything has a home.

The following line of code is in most (not all) templates.

__test__ = {
    name: value
    for name, value in globals().items() 
    if name.startswith("REPL")
}

This will locate all of the global variables with names starting with REPL and put them in the __test__ mapping. The REPL names then become the test case names, making any test failures easier to spot.

My Goal

I do have some Literate Programming tools that I might be able to leverage to make myself a Weaver that produces useful LaTeX my publisher can work with. I should do this because it would be slightly simpler. The problem is my Web/Weave/Tangle tooling has a bunch of dumb assumptions about the weave and tangle outputs; a problem I really need to fix.

See py-web-tool.

The idea here is to mimic other WEB-based tooling. These are the two primary applications:

  • Weave. This makes documentation in a fairly transparent way from the source. There are a bunch of substitutions required to fill in HTML or LaTeX or Markdown or RST around the generic source. Right now, this is pretty inept and almost impossible to configure.
  • Tangle. This makes code from the source. The point here is the final source file is not necessarily built in any obvious order. It's a tangle of things from the documentation, put into the order required by parser or compiler or build system or whatever.

The weaving requires a better way to provide the various templates that fill in missing bits. Markdown, for example, works well with fenced blocks. RST uses a code directive that leads to an extra level of indentation that needs to be carefully excised. Futher, most markup languages have a mountain of cruft that goes around the content. This is unpleasantly complex, and very much subject to odd little changes that don't track against the content, but are part of the evolution of the markup language.

My going-in assumption on tangling was the document contained all the code. All of it. Without question or exception. For C/C++ this means all the fiddly little pre-processor directives that add no semantic clarity yet must be in the code file. This means the preprocessor nonsense had to be relegated to an appendix of "yet more code that just has to be there."

After writing a tangler to pull code from a book into a variety of contexts, I'm thinking I need to have a tangler that works with a template engine. I think there would be the following two use cases:

  • No-Template Case. The WEB source is complete. This works well for a lot of languages that don't have the kind of cruft that C/C++ has. It generally means a WEB source document will contain definition(s) for the final code file(s) as a bunch of references to the previously-explained bits. For C/C++, this final presentation can include the fiddly bits of preprocessor cruft.
  • Template Case. A template is used to with the source to create the tangled output. This is what I have now for pulling book content into a context where it is testable. For the most part, the template files are quite small because the book includes test cases in the form of REPL blocks. This presents a bit of a problem because it breaks the "all in one place" principle of a WEB project. I have a WEB source file with the visible content plus one or more templates with invisible content.

What I like about this is an attempt to reduce some of the cruftiness of the various tools. 

I think my py-web-tool might be expanded to handle my expanded understanding of literate programming. 

I have a book to finish, first, though. Then I can look at improving my workflow. (And yes, this is backwards from a properly Agile approach.)

Tuesday, April 12, 2022

Pelican and Static Web Content

In Static Site Blues I was wringing my hands over ways to convert a ton of content from a two different proprietary tools (the very old iWeb, and the merely old Sandvox) into something I could work with.

After a bit of fiddling around, I'm delighted with Pelican.

First, of course, I had to extract all the iWeb and Sandvox content. This was emphatically not fun. While both used XML, they used it in subtly different ways. Apple's frameworks serialize internal state as XML in a way that preserves a lot of semantic details. It also preserves endless irrelevant details.

I wound up with a Markdown data structure definition, plus a higher-level "content model" with sites, pages, blogs, blog entries and images. Plus the iWeb extractor and the Sandvox extractor. It's a lot of code, much of which lacks solid unit test cases. It worked -- once -- and I was tolerant of the results.

I also wound up writing tools to walk the resulting tree of Markdown files doing some post-extraction cleanup. There's a lot of cleanup that should be done.

But.

I can now add to the blog with the state of my voyaging. I've been able to keep Team Red Cruising up to date.

Eventually (i.e., when the boat is laid up for Hurricane Season) I may make an effort to clean up the older content and make it more consistent. In particular, I need to add some annotations around anchorages to make it possible to locate all of the legs of all of the journeys. Since the HTML is what most people can see, that means a class identifier for lat-lon pairs. 

As it is, the blog entries are *mostly* markdown. Getting images and blockquotes even close to readable requires dropping to HTML to make direct use of the bootstrap CSS. This also requires some comprehensive cleanup to properly use the Bootstrap classes. (I think I've may have introduced some misspelled CSS classes into the HTML that aren't doing anything.)

For now, however, it works. I'm still tweaking small things that require republishing *all* the HTML. 

Tuesday, February 8, 2022

Desktop Notifications and EPIC DESIGN FAIL

I was asked to review code that -- well -- was evil.

Not like "shabby" or "non-pythonic". Nothing so simple as that.

We'll get to the evil in a moment. First, we have to suffer two horrible indignities.

1. Busy Waiting

2. Undefined Post-Conditions.

We'll beat all three issues to death separately, starting with busy waiting.

Busy Waiting

The Busy Waiting is a sleep-loop. If you're not familiar, it's this:

while something has not happened yet AND we haven't timed out:
    time.sleep(2)
    

Which is often a dumb design. Busy waiting is polling. It's a lot of pointless doing something while waiting for something else.

There are dozens of message-passing and event-passing frameworks. Any of those is better than this.

Folks complain "Why install ZMQ when I could instead write a busy-waiting loop?"

Why indeed?

For me, the primary reason is to avoid polling at fixed intervals, and instead wait for the notification. 

The asyncio module, confusing as it is, is better than polling. Because it dispatches events properly.

This is minor compared with the undefined post-conditions.

Undefined Post-Conditions

With this crap design, there are two events. There's a race between them. One will win. The other will be silently lost forever.

If "something has not happened" is false, the thing has happened. Yay. The while statement ends.

If "something has not happened" is true and the timeout occurs, then Boo. The while statement ends.

Note the there are two, unrelated post-conditions: the thing has happened OR the timeout occurred. Is it possible for both to happen? (hint: yes.)

Ideally, the timeout and the thing happening are well-separated in time.

Heh.

Otherwise, they're coincident, and it's a coin-toss as to which one will lead to completion of the while statement. 

The code I was asked to review made no provision for this unhappy coincidence. 

Which leads us to the pure evil.

Pure Evil

What's pure evil about this is the very clear statement that there are not enough desktop notification apps, and there's a need for another.

I asked for justification. Got a stony silence.

They might claim "It's only a little script that runs in the Terminal Window," which is garbage. There are already lots and lots of desktop apps looking for asynchronous notification of events.

Email is one of them.

Do we really need another email-like message queue?  

(Hint: "My email is a lot of junk I ignore" is a personal problem, not a software product description. Consider learning how to create filters before writing yet another desktop app.)

Some enterprises use Slack for notifications. 

What makes it even worse (I said it was pure evil) was a hint about the context. They were doing batch data prep for some kind of analytics/Machine Learning thing. 

They were writing this as if Luigi and related Workflow managers didn't exist.

Did they not know? If they were going to invent their own, they were off to a really bad start. Really bad.

Tuesday, January 25, 2022

No one wins at Code Golf vs. This is more noise than signal

Looking at code. Came to a 20-line block of code that did exactly this.

sorted(Path.cwd().glob("some_pattern[1-9]*.*"), reverse=True)

Twenty lines. Seriously. 

To be fair, 8 of the 20 lines were comments. 3 were blank. Which leaves 9 lines of code to perform the task of a one-liner.

I often say "no one wins at code golf" as a way to talk people out of trying to minimize Python code into vanishingly small black holes where no information about the code's design escapes.

However. Blowing a line of code into 9 lines seems to be just as bad. 

I'll spare you the 9 lines. I will say this, though, the author was blissfully ignorant that Path objects are comparable. So. There were needless conversions. And. Even after commenting on this, they seemed to somehow feel (without evidence of any kind) that Path objects were incomparable.

This is not the first time I've seen folks who like assembler-style code. There is at most one state-change or attribute reference on each line of code. The code has a very voluble verticality (VVV™).

This seems as wrong as code golf.  Neither style provides meaningful code. 

How can we measure "meaningful"?

Of the 8 lines of comments, the English summary, the "reverse alphabetic order" phrase is only a few words. Therefore, the matching code can be an equally terse few symbols. I think code can parallel natural language.

Tuesday, January 18, 2022

How to Test a Random Number Generator

Nowadays, we don't have the same compelling reasons to test a random number generator. The intervening decades have seen a lot of fruitful research. Good algorithms.

Looking back to my 1968 self, however, I still feel a need to work out the solution to an old problem. See The Old Days -- ca. 1968 for some background on this.

What could I have done on that ancient NCE Fortran -- with four digit integers -- to create random numbers? Step 1 was to stop using the middle-squared generator. It doesn't work.

Step 2 is to find a Linear Congruential Generator that works. LCG's have a (relatively) simple form:

\[X_{n+1} = (X_n \times a + c) \bmod m\]

In this case, the modulo value, m, is 10,000. What's left is step 3: find a and c parameters.

To find suitable parameters, we need battery of empirical tests. Most of them are extensions to the following class:

from collections import Counter
from typing import Hashable
from functools import cache

class Chi2Test:
    """The base class for empirical PRNG tests based on the Chi-2 testing."""
    
    #: The actual distribution, created by ``test()``.
    actual_fq : dict[Hashable, int]
    
    #: The expected distribution, created by ``__init__()``.
    expected_fq: dict[Hashable, int]
    
    #: The lower and upper bound on acceptable chi-squared values.
    expected_chi_2_range: tuple[float, float]
    
    def __init__(self):
        """
        A subclass will override this to call ``super().__init__()`` and then
        create the expected distribution.
        """
        self._chi2 = None
    
    def test(self):
        """
        A subclass will override this to call ``super().test()`` and then
        create an actual distribution, usually with a distinct seed value.
        """
        self._chi2 = None
        
    @property
    def chi2(self) -> float:
        """Return chi-squared metric between actual and expected observations."""
        if self._chi2 is None:
            a_e = (
                (self.actual_fq[k], self.expected_fq[k]) 
                for k in self.expected_fq 
                if self.expected_fq[k] > 0
            )
            v = sum((a-e)**2/e for a, e in a_e)
            self._chi2 = v
        return self._chi2

    @property
    def pass_test(self) -> bool:
        return self.expected_chi_2_range[0] <= self.chi2 <= self.expected_chi_2_range[1]

This defines the essence of a chi-squared test. There's another test that isn't based on chi-squared. The serial correlation where a correlation coefficient is computed between adjacent pairs of samples. We'll ignore this special case for now. Instead, we'll focus on the battery of chi-squared tests. 

Linear Congruential Pseudo-Random Number Generator

We'll also need an LC PRNG that's constrained to 4 decimal digits.

It looks like this:

class LCM4:
    """Constrained by the NCE Fortran 4-digit integer type."""
    def __init__(self, a: int, c: int) -> None:
        self.a = a
        self.c = c
    def seed(self, v: int) -> None:
        self.v = v
    def random(self) -> int:
        self.v = (self.a*self.v % 10_000 + self.c) % 10_000
        return self.v

This mirrors the old NCE Fortran on the IBM 1620 computer. 4 decimal digits. No more. 

We can use this to generate a pile of samples that can be evaluated. I'm a fan of using generators because they're so efficient. The use of a set to create a list seems weird, but it's very fast.

def lcg_samples(rng: LCM4, seed: int, n_samples: int = N_SAMPLES) -> list[int]:
    """
    Generate a bunch of sample values. A repeat implies a cycle, and we'll stop early.

    >>> lcg_samples(LCM4(1621, 3), 1234)[:12]
    [317, 3860, 7063, 9126, 3249, 6632, 475, 9978, 4341, 6764, 4447, 8590]

    """
    rng.seed(seed)
    def until_dup(f: Callable[..., Hashable], n_samples: int) -> Iterator[Hashable]:
        seen: set[Hashable] = set()
        while (v := f()) not in seen and len(seen) < n_samples:
            seen.add(v)
            yield v
    return list(until_dup(rng.random, n_samples))

This function builds a list of values for us. We can then subject the set of samples to a battery of tests. We'll look at one test as an example for the others. They're each devilishy clever, and require a little bit of coding smarts to get them to work correctly and quickly.

Frequency Test

Here's one of the tests in the battery of chi-squared tests. This is the frequency test that examines values to see if they have the right number of occurrences. We pick a domain, d, and parcel numbers out into this domain. We use \(\frac{d \times X_{n}}{10,000}\) because this tends to leverage the left-most digits which are somewhat more random than the right-most digits.

class FQTest(Chi2Test):
    expected_chi_2_range = (7.261, 25.00)

    def __init__(self, d: int = 16, size_samples: int = 6_400) -> None:
        super().__init__()
        #: Size of the domain
        self.d = d
        #: Number of samples expected
        self.size_samples = size_samples
        #: Frequency for Chi-squared comparison
        self.expected_fq = {e: int(self.size_samples/self.d) for e in range(self.d)}
    
    def test(self, sequence: list[int]) -> None:
        super().test()
        self.actual_fq = Counter(int(self.d*s/10_000) for s in sequence)

We can apply this test to some samples, compare with the expectation, and save the chi-squared value. This lets us look at LCM parameters to see if the generator creates suitably random values.

The essential test protocol is this:

samples = lcg_samples(LCM4(1621, 3), seed=1234)
fqt = FQTest()
fqt.test(samples)
fqt.chi2

The test creates some samples, applies the frequency test. The next step is to examine the chi-squared value to see if it's in the allowable range, \(7.261 \leq \chi^2 < 25\).

The search space

Superficially, it seems like there could be 10,000 choices of a and 10,000 choices of c parameter values for this PRNG. That's 100 million combinations. It takes a bit of processing to look at all of those. 

Looking more deeply, the values of c are often small prime numbers. 1 or 11 or some such. That really cuts down on the search. The values of a have a number of other constraints with respect to the modulo value. Because 10,000 has factors of 4 and 5, this suggests values like \(20k + 1\) will work. Sensible combinations are defined by the following domain:

combinations = [
    (a, c)
    for c in (1, 3, 7, 11,)
    for a in range(21, 10_000, 20)
]

This is 2,000 distinct combinations, something we can compute on our laptop. 

The problem we have trying to evaluate these is each combination's testing is compute-intensive. This means we want to use as many cores of our machine as we have available. We don't want this to process each combination serially on a single core. A thread pool isn't going to help much because the OS doesn't scatter threads among all the cores. 

Because the OS likes to scatter processes among all the cores, we need a process pool.

Here's how to spread the work among the cores:

    from concurrent.futures import ProcessPoolExecutor, as_completed

    combinations = [
        (a, c)
        for c in (1, 3, 7, 11)
        for a in range(21, 10_000, 20)
    ]

    with Progress() as progress:
        setup_task = progress.add_task("setup ...", total=len(combinations))
        finish_task = progress.add_task("finish...", total=len(combinations))

        with ProcessPoolExecutor(max_workers=8) as pool:
            futures = [
                pool.submit(evaluate, (a, c))
                for a, c in progress.track(combinations, task_id=setup_task, total=len(combinations))
            ]
            results = [
                f.result()
                for f in progress.track(as_completed(futures), task_id=finish_task, total=len(combinations))
            ]

This will occupy *all* the cores of the computer executing the `evaluate()` function. This function applies the battery of tests to each combination of a and c. We can then check the results for combinations where the chi-squared results for each test are in the acceptable ranges for the test.

It's fun.

TL;DR

Use a=1621 and c=3 can generate acceptable random numbers using 4 decimal digits.

Here's some output using only a subset of the tests.

(rngtest2) % python lcmfinder.py
setup ... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
finish... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
2361  1  11.46  14.22  46.64  63.76   2.30  11.33   2.16   2.16 
 981  3  10.28  15.24  52.56  66.32   2.28  11.08  10.47  10.47 
1221  3  10.19  14.12  48.72  62.08   3.03  10.08   2.59   2.59 
1621  3  11.70  14.91  47.12  69.52   2.23   9.69   0.86   0.86 

The output shows the a and c values followed by the minimum and maximum chi-squared values for each test. The chi-squared values are in pairs for the frequency test, serial pairs test, gap test, and poker test. 

Each test uses about two dozen seed values to generate piles of 3,200 samples and subject each pile of samples to a battery of tests. The seed values, BTW, are range(1, 256, 11); kind of arbitrary. Once I find the short list of candidates, I can test with more seeds. There are only 10,000 seed values, so, this can be done in finite time.

For example, a=1621, c=3, had chi-squared values between 11.70 and 14.91 for the frequency test. Well within the 7.261 to 25.0 range required. The remaining numbers show that it passed the other tests, also.

For completeness, I intend to implement the remaining half-dozen or so tests. Then I need to make sure the sphinx-produced documentation looks good. I've done this before. http://slott.itmaybeahack.com/_static/rngtest/rngdoc.html It's kind of an obsession, I think.

Looking back to my 1968 self, this would have been better than the middle-squared nonsense that caused me to struggle with bad games that behaved badly.

Tuesday, January 11, 2022

The Old Days -- ca. 2000 -- Empirical Tests of Random Numbers (Python and Chi-Square Testing)

See The Old Days -- ca. 1974 Random Numbers Before Python for some background.

We'll get to Python after reminiscing about the olden days. I want to provide some back story on why sympy has had a huge impact on ordinary hacks like myself.

What we're talking about is how we struggled with randomness before

  1. /dev/random
  2. The Mersenne Twister Pseudo-Random Number Generator (PRNG)

Pre-1997, we performed empirical tests of PRNG's to find one that was random enough for our application. Maybe we were doing random samples of data to compare statistical measures. Maybe we were writing a game. What was important was a way to create a sequence of values that passed a battery of statistical tests.

See https://link.springer.com/chapter/10.1007%2F978-1-4612-1690-2_7 for the kind of material we salivated over. 

While there are an infinite number of bad algorithms, some math reveals that the Linear Congruential Generator (LCG) is simple and effective. Each new number is based on the previous number: \(X_{n+1} = (X_n \times a + c) \bmod m\). There's a multiply and an add, modulo some big number. The actual samples are often a subset of the bits in \(X_{n}\). 

After the Mersenne Twister became widely used, we essentially stopped looking at alternative random number algorithms. Before then -- well -- things weren't so good.

Here are some classics that I tested.

  • The ACM Collected Algorithms (CALGO) number 294 is a random-number generator. This is so obsolete, I have trouble finding links to it. It was a 28-bit generator.
  • The ACM Collected Algorithms (CALGO) number 266 has code still available. See toms/266
  • The Cheney-Kincaid generator is available. See random.f plus dependencies.

These formed a kind of benchmark I used when looking at Python's built-in Mersenne Twister.

Nowadays, you can find a great list of LCM PRNG's at  https://en.wikipedia.org/wiki/Linear_congruential_generator

Python Empirical Testing

One of the early questions I had was whether or not the random module in Python stacked up against these older RNG's that I was a little more familiar with.

So, I wrote a big, fancy random number testing tool in Python. 

When? Around 2000. I started this in the Python 1.6 and 2.1 era. I have files showing results from Python 2.3 (#2, Jul 30 2003). This is about when I stopped fooling around with this and moved on to trusting that Python really did work and was -- perhaps -- the best approach to working with randomly-sampled data for statistical work. 

The OO design for the test classes was Lavish Over The Top (LOTT™) OO:

  • Too Many Methods
  • Too Many Superclasses
  • No Duck Typing

We won't look at that code. It's regrettable and stems from trying to make Python into C++.

What I do want to look at is the essential Chi-Squared test methodology. This is some cool stuff.

Comparing Expected and Actual

The chi-squared metric is a way to compare actual and expected distributions. You can read about it on your own time. It's a way to establish if data is random or there's something else going on that's not random. i.e., a trend or a bias. 

The empirical tests for PRNG's that Knuth defines all come with chi-squared values that bracket acceptable levels of randomness. For the purposes of writing a working set of tests the magic chi-squared values supplied by Knuth are fine. Magical. But fine. Really. Trust them.

If you make modifications, you'd use your statistics text-book. You'd open to the back where it had a Chi-Squared table. That table gave you chi-squared values for a given degree of freedom and a given probability of being random.

Or, You could look for the NIST handbook online. It has a section on chi-squared testing. See https://www.itl.nist.gov/div898/handbook/eda/section3/eda3674.htm. Same drill. Degrees of freedom and probability map to a chi-squared threshold.

But.

Were do these magical Chi-Squared values come from? This gets interesting in a useless-sidebar kind of way.

Chi-Squared Values

There's a really, really terse summary of chi-squared numbers here: https://www.danielsoper.com/statcalc/formulas.aspx?id=11. This is all you need to know. It may be too terse to help you learn about it, but it's a handy reference.

We need to evaluate two functions: partial gamma and gamma. These are defined as integrals. And they're nasty levels of complexity. Nasty.

This kind of nasty:

\[\gamma (s,z)=\int _{0}^{z}t^{s-1} e^{-t} dt\].

\[\Gamma (z)=\int _{0}^{\infty} t^{z-1} e^{-t} dt\].

These are not easy things to evaluate. Back to the ACM Collected Algorithms (CALGO) to find ways to evaluate these integrals. There are algorithms in CALGO 435 and 654 that are expressed as Fortran for evaluating these. This ain't all, of course, we need Stirling Numbers and Bernoulli Numbers. So there's a lot going on here.

A lot of this can be transliterated from Fortran. The resulting code is frankly quite ugly, and requires extensive test cases. Fortran with GOTO's requires some cleverness to unwind the conceptual for/while/if constructs.

OR.

Enter Sympy

In the 20+ years since I implemented my empirical PRNG tests "the hard way," sympy has come of age.

Check this out

from sympy import Sum, rf
from sympy.abc import k, s, z
from sympy.functions import exp
from sympy import oo
Sum(z**s * exp(-z) * z**k / rf(s, k+1), (k, 0, oo)).simplify()

I could use this in Jupyter Lab to display a computation for the partial gamma function.

\[z^{s}e^{-z}\sum _{k=0}^{\infty }{\dfrac {z^{k}}{s^{\overline {k+1}}}}\]

This requires a fancy Rising Factorial computation, the \(s^{\overline {k+1}}\) term. This is available in sympy as the rf(s, k+1) expression.

It turns out that sympy offers lowergamma() and gamm() as first-class functions. I don't even need to work through the closed-form simplifications.

I could do this...

def gammap(s: float, z: float) -> float:
    return (z**s * exp(-z) * Sum(z**k / rf(s, k+1), (k, 0, oo))).evalf()

def gamma(z: float) -> float:
    return integrate(t**(z-1) * exp(-t), (t, 0, oo)).doit()

It works well. And it provides elegant documentation. But I don't need to. I can write this, instead,

def chi2P(chi2: float, degF: int) -> float:
   return lowergamma(degF/2, chi2/2) / gamma(degF/2)

This is used to compute the probability of seeing a chi-squared value. 

For the frequency test, as an example. We partition the random numbers into 16 bins. These gives us 15 degrees of freedom. We want chi-squared values between 7.2578125 and 25.0.

Or.

Given a chi-squared value of 6.0, we can say the probability of 0.02 is suspiciously low, less than 0.05 level that we've decided signifies mostly random. The data is "too random"; that is to say it's too close to the ideal distribution to be trusted.

The established practice was to lookup a chi-squared value because you couldn't easily compute the probability of that value. With sympy, we can compute the probability. It's slow, so we have to optimize this carefully and not compute probabilities more frequently than necessary.

We can, for example, compute chi-squared values for a number of seeds, take the max and min of these and compute the probability of those two boundary values. This will bracket the probability that the pseudo random number generator is producing suitably random numbers.

This also applies to any process we're measuring with results that might vary randomly or might indicate a consistent problem that requires evaluation.

Using sympy eliminates the complexity of understanding these beautifully hand-crafted antique algorithms. It acts as a kind of super-compiler. From Math to an intermediate AST to a concrete implementation.

Tuesday, January 4, 2022

The Old Days -- ca. 1974 -- Random Numbers before Python

See "The Old Old Days -- ca. 1968" for my first exposure to an actual computer. Nothing about Python there. But. It's what motivated me to get started learning to code -- I was fascinated by games that involved randomization. Games with cards or dice.

After filling in a little background, I will get to the Python part of this. First, however, I want to compare the olden days with what we have now.

From 1969 to 1974 I had access to the high school's IBM 1620. This means programming in IBM's SPS assembler, or using the NCE Load-and-Go Fortran compiler. See https://www.cs.utexas.edu/users/EWD/transcriptions/EWD00xx/EWD37.html for a scathing review of the problems with this machine.

See http://www.bitsavers.org/pdf/ibm/1620/GC20-1603-10_1620_Catalog_of_Programs_Jan71.pdf Page 36 has this:

BERJAYA

That's a quick overview of my earliest programming language. What's essential here is the NCE Fortran used 4-digit integers.

I'll repeat that for those skimming, and wondering what the Python connection is.

Four. Digit. Integers.

That's four decimal digits. Decimal digits required at least 4 hardware bits. IBM 1620 digits also had flags and signs, so, there were maybe 6 bits per digit. 24 bits of hardware used to represent just under 14 bits of useful information.

My interest is in simulation and randomness. So. I have this question of how to create random sequences of numbers limited to 4-digit integers.

PRNG Algorithms

There are a number of classic Pseudo-Random Number Generator (PRNG) algorithms from the early days before Mersenne Twister took over in 1997.

We used to be super-careful to emphasize the letter P in PRNG because the numbers we're really random. They just behaved randomish. This is contrasted with real randomness, also known as entropy. For example, /dev/random device driver has a fair amount of entropy. I think it's comparable to a person throwing dice across a table. I think it's as random as a noise-generating diode with a sample-and-hold circuit to pluck out random values from the noise.

Pre-Mersenne-Twister -- pre-1997 -- we worried a lot about random number generation. See Knuth, Donald E. The Art of Computer Programming, Volume 2, Seminumerical Algorithms, Addison-Wesley, 1969. Section 3.3.2. covers empirical testing of random number generators. Section 3.3.1. covers the Chi-squared test for fit between actual and expected frequency distributions.

Back in the olden days, it was stylish to perform an empirical test (or ten) to confirm we really had "good" random numbers. The built-in libraries that came with your compiler or OS could not be trusted without evidence.

One of the classic (bad) PRNG's is the "Middle-Squared" method. See https://en.wikipedia.org/wiki/Middle-square_method. I learned about this in the 70's. And used it in the old NCE Fortran. 

With Four. Digit. Integers.

Did I mention that the Fortran compiler used four decimal digits for integers? That means plucking the middle two digits out of a four-digit number. How random can that be?

Not very. The longest possible sequence is 100 numbers. If, by some miracle, you found a seed number with the right properties and only two digits.

Nowadays I can, in Python, do a quick middle-squared analysis for all 100 seed values.

This kind of thing.

def csqr4(value: int) -> list[int]:
    """The 4 decimal digit center-squared PRNG."""
    sequence = []
    while value not in sequence:
        sequence.append(value)
        value = (((value**2) // 10) % 100)
    return sequence

Which you can run and see that all of my early attempts at games and simulations were doomed. The seed values of 76, 42, and 69 provided kind of long sequences of almost random-seeming numbers. Otherwise, pfft, this was junk computer science. 50% of the seeds provide 5 or fewer numbers before repeating. 

For blackjack, a few random numbers for shuffling might be enough. For other games, the lack of randomness made the outcomes trivially predictable. 

What's funny is how far the state of the art has moved since then.

  1. Hardware now has more than 20,000 decimal digits (about 10K bytes) of storage.
  2. Software with algorithms that are really, really clever.

It's hard to understate these two advances, particularly, the second one. I'll return to the algorithm thing a lot in the next few posts.

My focus was on games and randomization. Ideally, simple stuff. But... under the hood, it's not simple. I've spent some time (not much, and not in much depth) looking at the tip of this iceberg. 

It served me as an incentive to dive just a little more deeply into a topic, like math or a programming language or a statistical tool.

Tuesday, November 23, 2021

Processing Apple Numbers Files

Apple's freebie tools -- Pages, Numbers, Keynote, Garage Band, etc. -- are wonderful things. I really like Numbers. I'm tolerant of Pages. I've used Pages to write books and publish them to the Apple Bookstore. (Shameless Plug: Pivot to Python.)

These tools have a significant problem. Protobuf.

Some History

Once upon a time, Numbers used an XML-based format. This was back in '09, I think. At some point, version 10 of Numbers (2013?) switched to Protobuf.

I had already unwound XLSX and ODS files, which are XML. I had also unwound Numbers '09 in XML. I had a sense of what a spreadsheet needed to look like.

The switch to Protobuf also meant using Snappy compression. Back in 2014? I worked out my own version of the Snappy decompression algorithm in pure Python. I think I knew about python-snappy but didn't want the complex binary dependency. I wrote my own instead.

I found the iWorkFileFormat project. From this, and a lot of prior knowledge about the XML formats, I worked out a way to unpack the protobuf bytes into Python objects. I didn't leverage the formal protobuf definitions; instead I lazily mapped the objects to a dictionary of keys and bytes. If a field had a complex internal structure, I parsed the subset of bytes.

(I vaguely recall the Protobuf definitions are in XCode somewhere. But. I didn't want to write a protobuf compiler to make a pure-Python implementation. See the protobuf project for what I was looking for, but didn't have at the time.)

Which brings us to today's discovery.

State of the Art

Someone has taken the steps necessary to properly unpack Numbers files. See numbers-parser. This has first-class snappy and protobuf processing. It installs cleanly. It has an issue, and I may try to work on it.

I'm rewriting my own Stingray Reader with intent to dispose of my own XLSX, ODS, and Numbers processing. These can (and should) be imported separately. It's a huge simplification to stand on the shoulders of giants and write a dumb Facade over their work.

Ideally, all the various spreadsheet parsing folks would adopt some kind of standard API. This could be analogous to the database API used by SQL processing in Python. The folks with https://www.excelpython.org or http://www.python-excel.org might be a place to start, since they list a number of packages.

The bonus part? Seeing my name in the Credits for numbers-parser. That was delightful.

At some point, I need to make a coherent pitch for a common API with permits external JSON Schema as part of extracting data from spreadsheets.

First. I need to get Stingray Reader into a more final form.

Tuesday, November 16, 2021

Reading Spreadsheets with Stingray Reader and Type Hinting

See Spreadsheets, COBOL, and schema-driven file processing, etc. for some history on this project.

Also, see Stingray-Reader for the current state of this effort.

(This started almost 20 years ago, I've been refining and revising a lot.)

Big Lesson Up Front

Python is very purely driven by the idea that everything you write is generic with respect to type. Adding type hints narrows the type domain, removing the concept of "generic".

Generally, this is good.

But not universally.

Duck Typing -- and Python's generic approach to types -- is made visible via Protocols and Generics.

An Ugly Type Hinting Problem

One type hint complication arises when writing code that really is generic. Decorators are a canonical example of functions that are generic with respect to the function being decorated. This, then, leads to kind of complicated-looking type hints.

See the mypy page on declaring decorators. The use of a TypeVar to show that how a decorator's argument type matches the the return type is big help. Not all decorators follow the simple pattern, but many do.

from typing import TypeVar
F = TypeVar('F', bound=Callable[..., Any])
def myDecorator(function: F) -> F:
    etc.

The Stingray Reader problem is that a number of abstractions are generic with respect to an underlying instance object.

If we're working with CSV files, the instance is a tuple[str].

If we're working with ND JSON objects, the instance is some JSON type.

If we're working with some Workbook (e.g., via xlrd, openpyxl, or pyexcel) then, the instance is defined by one of these external libraries.

If we're working with COBOL files, then the instances may be str or may be bytes. The typing.AnyStr type provides a useful generic definition.

Traditional OO Design Is The Problem

Once upon a time, in the dark days, we had exactly one design choice: inheritance. 

Actually, we had two, but so many writers get focused on "explaining" OO programming, that they tend to gloss over composition. The focus on the sort-of novel concept of inheritance.

And this leads to folks arguing that inheritance shouldn't be thought of as central. Which is a kind of moot argument over what we're doing when we're writing about OO design. We have to cover both. Inheritance has more drama, so it becomes a bit more visible than composition. Indeed, inheritance creates a number of design constraints, and that's where the drama surfaces.

Any discussion of design patterns tends to be more balanced. Many patterns -- like Strategy and State -- are compositional patterns. Inheritance actually plays a relatively minor role until you reach interesting boundary cases.

Specifically.

What do you do when you have a Strategy class hierarchy and ONE of those strategies has an unique type hint for a parameter? Most of the classes use one type. One unique subclass needs a distinct type. For example, this outlier among the Strategy alternatives uses a str parameter instead of float.

Do you push that type distinction up to the top of the hierarchy? Maybe define it as edge_case: Optional[Union[str, float]] = None?

You can't simply change the parameter's value in one subclass with impunity. mypy will catch you at this, and tell you you have Liskov Substitution problems.

Traditionally, we would often take this to mean that we have a larger problem here. We have a leaky abstraction. Some implementation details are surfacing in a bad way and we need more abstract classes.

It's A Protocol ("Duck Typing")

When I started rewriting Stingray Reader, I started with a fair number of abstract classes. These classes were supposed to have widely varying implementations, but common semantics. 

Applying a schema definition to a CSV file means that data values can be converted from strings to something more useful,

Applying a schema to a JSON file means doing a validation pass to be sure the loaded object meets the schema's expectations.

Applying a schema to a Workbook file is a kind of hybrid between CSV processing and JSON processing. The workbook's values will have been unpacked by the interface module. Each row will look like a list[Any] that can be subject to JSON schema validation. 

Apply a schema to COBOL means using the schema details to figure out how to unpack the bytes. This is suddenly a lot more complex than the other cases.

The concepts of inheritance and composition aren't really applicable. 

This is something even more open-ended. It's a protocol. 

We want a common interface and common semantics. But. We're not really going to leverage any common code. 

This unwinds a lot abstract superclasses, replacing them with Protocol class definitions.

class Workbook(abc.ABC):
    @abc.abstractmethod
    def sheet(self, name: str, schema: Schema) -> Sheet:
        ...
    def row_iter(self) -> Iterator[list[Union[str, bytes, int, float, etc.]]]:
        ...

The above is not universally useful. Liskov Substitution has to apply. In some cases, we don't have a tidy set of relationships. Here's the alternative

class Workbook(Protocol):
    def sheet(self, name: str, schema: Schema) -> Sheet:
        ...
    def row_iter(self) -> Iterator[list[Any]]:
        ...

This gives us the ability to define classes that adhere to the Workbook Protocol but don't have a simple, strict subclass-superclass-Liskov substitution relationship.

It's A Generic Protocol

It turns out, this isn't quite right. What's really required is a Generic[Type], not the simple Protocol.

class Workbook(Generic[Instance]):
    def sheet(self, name: str, schema: Schema) -> Sheet:
        ...
    def row_iter(self) -> Iterator[list[Instance]]:
        ...

This lets us create Workbook variants that are highly type-specific, but not narrowly constrained by inheritance rules.

This type hinting technique describes Python code that really is generic with respect to implementation type details. It allows a single Facade to contain a number of implementations.

Tuesday, November 2, 2021

Welcome to Python: Some hints for ways to explain how truly bad the language is

As an author with many books on Python, I'm captivated by people's hot takes on why Python is so epically bad. Really Bad. Uselessly Bad. Profoundly Broken. etc.

I'll provide some hints on topics that get repeated a lot. If you really need to write a blog post about how bad Python is, please try to take a unique approach on any of these common complaints.  If you have a blog post half-written, skip to the tl;dr section to see if your ideas are truly unique.

Whitespace

Please don't waste time complaining about having to use whitespace in your code. I'm sure it's a burden on your soul to configure your editor to indent in groups of four spaces. I'm sorry it's so painful. But. Python isn't the only language with whitespace.

The shell scripting language has semantic whitespace. (It's not used for indentation, but please try cat$HOME/.bashrc (without any spaces) and tell me what happens. Spaces matter in a lot of languages. 

Even in C, some whitespace is semantic. The rest of the whitespace is for humans to read your code.

If you're *sure* that indentation is a fatal problem, please provide an example in the language of your choice where the {}'s or the case/esac was *required* because ordinary, readable indentation didn't -- somehow -- express the nesting.

The example can be the basis for a Python Enhancement Proposal (PEP) to fix the whitespace problem you've identified.

The self Instance Variable

Using self everywhere is simpler than using this in those obscure special cases where it's ambiguous. Python developers are sure that being uniformly explicit is a terrible burden on your soul. If you really feel that obscure special cases are required, consider writing a pre-processor to sort out the special cases for us.

I'm sure there's a way to inject another level of name resolution into the local v. global choices. Maybe local-self-global or self-local-global could be introduced. 

Please include examples. From this a Python Enhancement Proposal can be drafted to clarify what the improvement is.

No Formal Constants

Python doesn't waste too much time on keywords, like const, to alter the behavior of assignment. Instead, we tend to rely on tools to check our code.

Other languages have compilers to look for assignment to consts. Python has tools like flake8, pyflakes, pylint, and others, to look for this kind of thing. Conventionally, variables at the module level with ALL_CAPS names are likely to be constants. Multiple assignment statements would be a problem. Got it.

"Why can't the language check?" you ask. Python doesn't normally have a separate compile pass to pre-check the code. But. As I said above, you can use tools to create a pre-checking pass. That's what most of us do.

"But what if someone accidentally overwrites a constant?" you insist. Many folks would suggest some better documentation to explain the consequences an clarify how unit tests will fail when this happens. 

"Why should I write unit tests to be sure a constant wasn't changed?" you demand. I'm not really insisting on it. But you said you had developers who would "accidentally" overwrite a constant in an assignment statement, and you couldn't use tools like pylint to check for it. So. I suggested another choice. If you don't like that, use enums. Or write documentation and explain which global items can be changed and which can't be changed.

BTW. If you have global variables that are NOT constants, consider this a code smell. 

If you really need a mixture of constants and variables as module globals, you can use the enum module to create named attribute values of a class definition. You get constants and a namespace. It's pretty sweet.

Lack of Privacy

It appears to be an article of faith that a private keyword is unconditionally required.

Looking at the history of OO languages, it looks like private seems to have been introduced with C++. Not every OO language has the same notion of private the C++ has. CLU has no concept of private. Smalltalk considers instance variables equivalent to C++ protected, not private. Eiffel has a particularly sophisticated feature exportation that doesn't involve a trivial private/public distinction.

Since many languages that aren't C++ or Java have a variety of approaches, it appears private isn't required. The next question, then, is it necessary?

It really helps to have a concrete example of a place where a private method or attribute was absolutely essential. And it helps to do this in a way that a leading _ on the variable name -- every time it's used -- is more confusing than a keyword like private somewhere else in the code.

It also helps when the example does not involve a hypothetical Idiot Developer who (a) doesn't read the documentation and (b) doesn't understand the _leading_underscore variable, and can still manage to use the class. It's not that this developer doesn't exist; it's questionable whether or not a complex language feature is better than a little time spent on a code review. 

It helps when the example does not include the mysterious Evil Genius Developer who (a) reads the documentation, and (b) leverages the _leading_underscore variable to format one of the OS disks or something. This is far-fetched because the Evil Genius Developer had access to the Python source, and didn't need a sophisticated subclassing subterfuge. They could simply edit the code to remove the magical privacy features.

No Declarations

Python is not the only language where variables don't have type declarations. In some languages, there are implied types associated with certain kinds of names. In other languages, there are naming conventions to help a reader understand what's going on.

It's an article of faith that variable declarations are essential. C programmers will insist that a void * pointer is still helpful even though the thing to which it points is left specifically undeclared. 

C (and C++) let you cast a pointer to -- well -- anything. With resulting spectacular run-time crashes. Java has some limitations on casting. Python doesn't have casting. An object is a member of a class and that's the end of that. There's no wiggle-room to push it up or down the class hierarchy.

Since Python isn't the only language without variable declarations, it raises the question: are they necessary?

It really helps to have a concrete example of a place where a variable declaration was absolutely essential for preventing some kind of behavior that could not be prevented with a pylint check or a unit test. While I think it's impossible to find a situation that's untestable and can only be detected by careful scrutiny of the source, I welcome the counter-example that proves me wrong.

And. Please avoid this example.

for data in some_list_of_int:
    if data == 42:
        print("data is int")
for data in some_list_of_str:
    if data == "bletch":
        print("data is str")

This requires reusing a variable name. Not really a good look for code. If you have an example where there's a problem that's not fixed by better variable names, I'm looking forward to it.

This will change the world of Python type annotations. It will become an epic PEP.

Murky Call-By-Value Semantics

Python doesn't have primitive types. There are no call-by-value semantics. It's not that the semantics are confusing: they don't exist. Everything is a reference. It seems simpler to avoid the special case of a few classes of objects that don't have classes.

The complex special cases surrounding unique semantics for bytes or ints or strings or something requires an example. Since this likely involves a lot of hand-waving about performance (e.g., primitive types are faster for certain things) then benchmarking is also required. Sorry to make you do all that work, but the layer of complexity requires some justification.

No Compiler (or All Errors are Runtime Errors)

This isn't completely true. Even without a "compiler" there are a lot of ways to check for errors prior to runtime. Tools like flake8, pyflakes, pylint, and mypy can check code for a number of common problems. Unit tests are another common way to look for problems. 

Code that passes a unit test suite and crashes at runtime doesn't seem to be a language problem. It seems to be a unit testing problem.

"I prefer the compiler/IDE/something else find my errors," you say. Think of pylint as the compiler. Many Python IDE's actually do some static analysis. If you think unit tests aren't appropriate for finding and preventing problems, perhaps programming isn't your calling.

tl;dr

You may have some unique insight. If you do, please share that.

If on the other hand, you're writing about these topics, please realize that Python has been around for over 30 years. These topics are not new. For the following, please try to provide something unique:

  • Whitespace
  • The self Instance Variable
  • No Formal Constants 
  • Lack of Privacy
  • No Declarations
  • Murky Call-By-Value Semantics
  • No Compiler (or All Errors Are Run-Time Errors)

It helps to provide a distinctive spin on these problems. It helps even more when you provide a concrete example. It really helps to write up a Python Enhancement Proposal. Otherwise, we can seem dismissive of Yet Another Repetitive Rant On Whitespace (YARROW).

Tuesday, October 26, 2021

Python is a Bad Programming Language. Wait, wut?

It may help to read Python is a Bad Programming Language, but it's not very useful. 

I shouldn't be tempted by click-bait headlines. But.  I am drawn in by bad articles on Python.

In particular, any post claiming Python is deficient causes me to look for the concrete PEP's that fix the problems.

Interestingly, there never seem to be any PEP's in any article that bashes Python. This post is yet another example of complaining without offering any practical solutions. 

BLUF

The article has a complaining tone, but, I can't figure out some of the complaints. It lifts up a confusing collection of features from other languages as if these features are somehow universally desirable. No justification is provided. The author seems to rely exclusively on Stack Overflow answers for information about Python. There are no PEP's proposed to fix Python. There aren't even any examples.

Point-by-Point

I will try to address each point. It's difficult, because some of the points are hard to discern. There's a lot of "Who thought that was a good idea?" which isn't really a specific point that can be refuted. It's a kind of rhetorical flourish that seems to work best with folks that already agree.

Let's start.

A Fragmented Language

This is the result of profound confusion. It's hard to find anyone recommending Python 2 anywhere. The supplied link is 9 years old, making this comment extremely misleading.  (I'm being charitable. A nine-year old link on Stack Overflow requires some curation. This is not a Python problem.)

Ugly Object-Orientation

The inconsistent use of this in C++ and Java is lifted up as somehow good. The consistent use of the self instance variable in Python is somehow less good; perhaps because it's consistent.

"See how I have to both declare and initialize them in the constructor? Another example of Python stupidity." Um. No, I don't actually see you declare them anywhere. I guess you're unaware of what declare means in languages like C++ and why declare isn't a thing in Python.

Somehow using the private keyword is better than __ name mangling. I'm unclear on why it's better, it's simply stated in a way that makes it sound like a long keyword used once is better because it's better. No additional reason or justification is offered. The idea of using __ to emphasize the privacy is somehow inconceivable.

The private and protected keywords are in C++, C#, and Java to optimize recompilation. To an extent, this also permits distribution of libraries in the form of "headers" and obfuscated binaries. None of this makes sense in a Python context.  A single example of how the private keyword would be helpful in Python is missing from the original post. There are huge complications of the protected keyword, also; these make the keywords more trouble than they are worth, and any example needs to cover these issues, also.

"In general, when you point out any flaws in their language, Python developers will act hostile and condescending." Sorry, this complaint in the original post sounds hostile and condescending. I'll try to ignore the tone and stick to what content I can find.

Whitespace

"...how is using whitespace any better than curly braces?" has an answer. But. Somehow it can't be chased down and included in the original post. Whitespace (like name mangling) is described as wrong because it's wrong, with no further justification provided.

An example where braces seem to be essential for sorting out syntax would be nice. The entire Python community is waiting for any example where braces were necessary and the indentation wasn't already clear.  

"And only in Python will the difference between tabs and spaces cause the interpreter to have a heart attack." Um. A syntax error is a heart attack? I wish I was able to type code without syntax errors. I am humbled thinking about the idea of seeing syntax errors so rarely. I have my editor set up to use spaces instead of tabs, and haven't had a problem in 20 years of using Python. 

Dynamic Typing

The opening quote, "Dynamic typing is bad," is stated as if it's axiomatic. The rest of the paragraph seems like vitriol rather than justification. "Some Python programmers have realized that dynamic typing is bad" requires some justification; a link to some documentation to support the claim would be helpful. An example would be good.

I can only assume that code like this is important and needs to be flagged by the compiler or something.

for data in some_list:
    if data == 42:
        print("data is int")
for data in some_other_list:
    if data == "wait":
        print("see the type of data changed.")
        

This seems like poor programming to begin with. Expecting the compiler to reject this seems weak. It seems better to not reuse variable names in the first place.

Constants

Not sure what the point is here. There's no justification for demanding the inconsistent behavior of a one-time-only assignment statement. There's no reference how how folks can use enums to define constant-like names and values. 

The concluding paragraph "The Emperor Has Not Clothes" is some kind of summary. It says "Python will only grow in popularity as more and more software is written in it," which does seem to be true. I think that might be the single most useful sentence.

What Have We Learned?

First, reading a few Stack Overflow posts can be misleading. Python now is not Python from nine years ago.

  1. Everyone says to use Python3. Really. If you have found a Python2 tutorial, stop now. Don't follow it. 
  2. The consistent use of the self variable seems simpler than trying to understand the rules for the this variable.
  3. Variables aren't declared, they're assigned values. It's as simple as it can be and avoids the clutter of variable declarations.
  4. We can read the source; the complexities of private (or protected) instance variables doesn't really help.
  5. Python's use of whitespace is very simple; most people can indent their code correctly. Anyone who's tried to debug C++ code that's correctly indented but missing a (nearly invisible) } will agree that the indentation is easier to get right.
  6. AFAICT, the reason dynamic typing might be bad is when a function or class reuses the same variable name for multiple different types of data. This seems wrong to reuse a variable name for multiple types. A small effort at inspecting the code can prevent this.
  7. Constants are easily implemented via enum. But. They appear to be useless in a dynamic language where the source is trivially available to be changed. I'm not sure why they seem important to people. And this article provides no help there.

Bottom line: Without concrete PEPs to fix things, or examples of what better might look like, this is click-bait whining. 

Starting from C# or Java to locate deficiencies is just as wrong as starting from Dartmouth Basic or FORTH as the standard against which Python is measured.

Tuesday, October 19, 2021

Code so bad it causes me physical pain

Here's the code.

def get_categories(file):
    """
    Get categories.
    """
    verify_file(file)

    categories = set()

    with open(file, "r") as cat_file:
        while line := cat_file.readline().rstrip():
            categories.add(line)

    return categories

To me this was terrible. truly and deeply horrifying. Let me count the ways.

  1. The docstring repeats the name of the function providing no additional information. 
  2. The verify_file() function checks are pure, useless LBYL code. It seemed designed to map a lot of detailed exceptions to a RuntimeError. Which is misleading.
  3. The use while and readline() to iterate through the lines of a file is -- I guess -- reasonable if we're working Pascal or Modula-2. But we're not. Use of the walrus operator isn't really getting any bonus points because -- well -- this is terrible.
  4. While pathlib is used elsewhere in this module, it's not used here. This function works with a filename string, assigned to the file parameter.

Actually, taking a step back, it's not that the author is being malicious. They just missed all the features of files and sets. And -- somehow -- were able to learn about the walrus operator while never figuring out how files work.

This is something like:

source = Path("some_file.txt")
with source.open() as source_file:
    categories = set(source_file)

And that's it. 

It Gets Worse

This was part of some category mapping application. 

They've got a CSV file with some string values. And they want to map those string values to summary category values. 

Most folks think of a dictionary for a mapping from one string to another string.

The code I was sent -- I kid you not -- used a list of two-tuples. I'll repeat that for those who are skimming. It use A LIST OF TWO-TUPLES INSTEAD OF A DICTIONARY.  It used a colossally bad search through an unsorted list of tuples to find matches. (The only search that would have been worse was random probes instead of iteration.)

It really did.

It can't even show you that code, it's such a horrifyingly bad design.

They had a question. Was the looping over a list of two-tuples ineffective? That's why they asked for help. 

It was like they had never heard of a dictionary. Nor seen a tutorial with a dictionary. Nor read a book that mentioned dictionaries. They had managed to learn enough Python to see the walrus operator without hearing of dictionaries.

A list of two-tuples, when provided to the dict() function, will make a dictionary. They were ignorant of this.

A dictionary that does O(1) lookups and avoids looping over a list of two-tuples. This was a mystery to them..

When someone doesn't know the Python dictionary exists, what is the appropriate response?

How can you politely say "Find another tutorial and do the ENTIRE thing all of it!"

That's Not All

There's this nugget of "You can't be serious."

category_counts = {element: 0 for element in categories}

And

category_counts[category] += 1

Yes. They used a dictionary to count instances of the categories. They did not understand collections.defaultdict or collections.Counter. But they understood a dictionary well enough to use it here. But not use it elsewhere for the central functionality of the app.

So. They couldn't use a dictionary, but could use a dictionary.

They couldn't use the csv module, so they wrote their own (bad) CSV parser. 

It's almost impossible to write a polite code review.

Tuesday, October 12, 2021

Legacy Software is a Sticky Mess

I'll get to legacy software. First, however, some backstory on observability.

Sailors will sometimes create "Float Plans". Like aircraft flight plans, they have an itinerary to make it slightly easier to find us when something goes wrong. Unlike airspace, which is tightly controlled by the FAA, the seas are more-or-less chaos.

The practice then, is to create float plan and give it to a trusted shore-side party, go out sailing, check in periodically, and cancel the whole thing when you're done sailing. If you miss a check-in, they can call an appropriate Search-And-Rescue agency like the US Coast Guard or BASRA or local cops with jurisdiction over a lake or river.

How much detail should be in this plan? For a long or complex trip, it doesn't seem sensible to say "Going to the Bahamas" as your float plan. That's a little thin on details. The bare minimum is to provide an Estimated Time of Arrival (ETA). But. When you summarize 36 hours of sailing to a single ETA, you invite observability problems. It's a sailboat, and you could be becalmed. Things are fine, you're just going to be late. 

Late, of course is relative. Simply late means you missed your ETA. If you're becalmed to the point where you're running low on supplies, then, this can become a bit of a problem.

The general policy followed by SAR is to allow several hours past the ETA before activating SAR resources. (The US Coast announces overdue mariners on the VHF radio so others can keep a lookout for them and render assistance.)

If you have a one-checkin-plan that summarizes 36 hours of sailing with a single ETA, you're going to be waiting for many hours after the ETA for help. So. Total systems failure after the first hour means 35 hours of drifting before someone will even alert SAR folks. And then the SAR folks will wait several hours after the ETA in case you're only slow.

What seems better is to have a sequence of waypoints with ETA's at each waypoint. That way you have incremental evidence of success or failure, and you're not waiting a LOOOONG time for your one-and-only ETA to pass without a check-in.

This leads us to software. And legacy software.

Creating the Plan

To create a sensible plan, you have waypoints as Latitude, Longitude pairs. These are angles on a sphere, not distances on a plane, so computing the length of a leg isn't a simple hypotenuse. 

It is a lot like a hypotenuse. For short distances, we can assume the earth is more-or-less flat. We can then use a relatively simple conversion (cosine of the latitude) to compress the longitudes toward the poles. We can convert lat and lon to distances and use a hypotenuse and get a really close answer.

def range_bearing(p1: LatLon, p2: LatLon, R: float = NM) -> tuple[float, Angle]:
    """Rhumb-line course from :py:data:`p1` to :py:data:`p2`.

    See :ref:`calc.range_bearing`.
    This is the equirectangular approximation.
    Without even the minimal corrections for non-spherical Earth.

    :param p1: a :py:class:`LatLon` starting point
    :param p2: a :py:class:`LatLon` ending point
    :param R: radius of the earth in appropriate units;
        default is nautical miles.
        Values include :py:data:`KM` for kilometers,
        :py:data:`MI` for statute miles and :py:data:`NM` for nautical miles.
    :returns: 2-tuple of range and bearing from p1 to p2.

    """
    d_NS = R * (p2.lat.radians - p1.lat.radians)
    d_EW = (
        R
        * math.cos((p2.lat.radians + p1.lat.radians) / 2)
        * (p2.lon.radians - p1.lon.radians)
    )
    d = math.hypot(d_NS, d_EW)
    tc = math.atan2(d_EW, d_NS) % (2 * math.pi)
    theta = Angle(tc)
    return d, theta

This means we can't trivially write down a list of waypoints. We need to do some fancy math to compute distances.

For years and years. (Since our first "big" trip in 2007.) I've used spreadsheets in various forms to work out the waypoints, distances, estimated time enroute (ETE) and ETA.

The math isn't too far beyond what a spreadsheet can do. But. There's a complication.

Complications

File formats are a complication.

There are KML files, GPX files, and CSV files that are used by various pieces of software. This is only the tip of the iceberg, because some of Navionics devices have an even more interesting USR file that contains everything in your chartplotter. It's cool. But complicated.

The file formats are -- clearly -- way outside the box for a spreadsheet.

Python to the rescue.

Since I'm a Python hack (and have been since well before 2007) I've got all kinds of file conversion tools. See https://github.com/slott56/navtools

But.

And here's where legacy enters the picture. (Music Cue.)

Fear that rattles in men's ears
And rears its hideous head
Dread ... Death ... in the wind ...

Spreadsheets.

Up until yesterday, the final planning tool was a spreadsheet with waypoints and times. Mac OS X Numbers is GREAT for this. I can pile in boat information, crew information, safety information, the itinerary, and SAR contact details in one spreadsheet, save it as a PDF, and email it to my shore-side contacts.

The BEST part of this was tinkering with the departure time while we waited for weather. We could plug in the day we're leaving, get revised ETA's for the waypoints, push the document, and take off. 

(We use an old Spot Navigator to provide notifications at midnight to show progress. We're going to upgrade to a SpotX so we can send messages a little more flexibly.)

The Legacy Spreadsheet

The legacy spreadsheet has a lot of good UX features. It's really adequate for some user stories. Save as PDF rocks.

However.

For the more advanced route planning, it isn't ideal. Specifically, spreadsheets can be weak on multiple "what-if" scenarios. 

The genesis of spreadsheets (I'm old, I was there, I remember VisiCalc) was "what-if" analysis. Change an assumption and follow the consequences through the lattice of dependent cells. These are hard to save. You can "Save As" to make a copy of the spreadsheet. You can save pages within a single spreadsheet. These are terrible because you can't really make a more fundamental change very easily. You have to make the same change to all the copies in your pile of "what-if" alternatives. 

To be very specific. I often need to plan for different boat speeds. We have a sailboat; wind and water matter a lot. Slow is about 5 knots. Fast is about 6 knots. Our theoretical top speed is 8 knots, but we've rarely seen that without a river flowing along with us. Sailing at that speed means a lot of sail wrestling, something we'd rather not do.

Fine. That's 3 scenarios, one for each speed: 5, 5.5, and 6. No big deal.

Until we add a waypoint. Or move a waypoint. Now we have to reset all three spreadsheets with a different itinerary. Since it's a different number of rows, we have the usual copy-and-paste problems in spreadsheets. 

What's Better?

Jupyter notebooks crush the life out of spreadsheets.

Here's the revised workflow.

  1. Create the route. Use tools like OpenCPN so the route can be exported as a GPX or CSV file.
  2. Use a notebook to parse the route file, creating an internal Route object.
  3. Manipulate the Route object, providing different ETA's and speed assumptions. These assumptions lead to multiple cells in the notebook. They can all share details so that one fundamental change leads to lots and lots of recomputation of itineraries. We can include all kinds of headings and markdown notes and thoughts and considerations.
  4. Finalize a route that's part of the plan. Still working in the confines of a longish notebook.
  5. Emit a Markdown file with Vessel Identification, Itinerary, Notes, and SAR Contact sections. Run pandoc to make a PDF. (This is the foundation for the nbconvert utility.)

This workflow creates two categories results:

One result is a Notebook with all of the planning details and thoughts and contingencies and considerations. 

The other result(s) are float plan documents as PDF's that can be shared widely.

Why did this take so long?

I used spreadsheets from 2007 to 2021. Why switch now? Some reasons.

Legacy solutions are sticky. This has a lot of consequences. I built up "expertise" in making the legacy work. I had become an "expert" in working around the hinky little problems with multiple what-if scenarios and propagating changes from the route into the what-ifs. For example, I limited the number of what-if scenarios I would consider because more than two got confusing.

New solutions are sometimes invisible. I only learned about Jupyter Notebooks about three years ago. I did not realize how powerful they were. I've since rearranged my thinking.

I've known about RST and Markdown and Pandoc for years. But. Getting from spreadsheet-like flexibility to a Markdown document was never a clear step. Without something like Jupyter Lab.

Pulling it all together

Does it require some kind of catalyst to force change?

Is it a slow accretion of evidence that the legacy software isn't working?

I'm pretty sure I had a long, slow Aha! moment as I realized that the Numbers spreadsheet was a large pain in the ass and a notebook would be simpler. It took a few days of fiddling to become really, really sure Numbers was not working out.

I think one of the biggest issues was a third "what-if" scenario. It was helpful to visualize arrival times. But. It was a huge pain in the neck to fiddle with the spreadsheets to get the right waypoints in there and summarize the alternatives.

I think the lesson here is to avoid automating anything unless you actually are the user.

If an organization wants software, a developer needs to do the job manually to *really* understand what the pain points are. Users develop expertise in the wrong things. And they want automation where the benefits are minor. Automating the spreadsheet-to-PDF is wrong. Replacing the spreadsheet is right.