close

Moved

Moved. See https://slott56.github.io. All new content goes to the new site. This is a legacy, and will likely be dropped five years after the last post in Jan 2023.

Showing posts with label #python. Show all posts
Showing posts with label #python. Show all posts

Tuesday, August 23, 2022

Books! Books! More Channels!

I started with the Apple Books platform because it's an easy default for me. 

Pivot to Python

A Guide for professionals and skilled beginners

https://books.apple.com/us/book/pivot-to-python/id1586977675 

I've recently updated this to fix some cosmetic problems with title pages, the table of contents and stuff like that. The content hasn't changed. Yet. It's still an introduction to Python for folks who already know how to program, they want to pivot to programming in Python. Quickly.

But wait, there's more. 

Unlearning SQL

When your only tool is a hammer, every problem looks like a nail

https://books.apple.com/us/book/unlearning-sql/id6443164060

Many folks know some Python, but struggle with the architectural balance between writing bulk processing in SQL or writing it in Python. For too many developers, SQL is effectively the only tool they can use. With a variety of tools, it becomes easier to solve a wider variety of problems effectively.

Google Play

Now, I'm duplicating the books on Google Play. Here's Unlearning SQL:

https://play.google.com/store/books/details?id=23WAEAAAQBAJ

I've made a clone of Pivot to Python, also.

https://play.google.com/store/books/details/Steven_F_Lott_Unlearning_SQL?id=23WAEAAAQBAJ&hl=en_US&gl=US

Both books are (intentionally) short to help experts make rapid progress.

Tuesday, August 16, 2022

Enterprise Python -- Some initial thoughts

In the long run, I think there's a small book here. See 8 reasons Python will rule the enterprise — and 8 reasons it won’t | InfoWorld. The conclusion, "Teams need to migrate slowly into the future, and adopting more Python is a way to do that," seems to be sensible. Some of the cautionary tales along the way, however, don't make as much sense.

TL;DR. There are no reasons to avoid Python. Indeed, the 8 points suggest that Python is perhaps a smart decision. 

I want to focus on the negatives part of this because some of them are wrong. I think there's a "technology hegemony" viewpoint where everything in an enterprise must be exactly the same. This tends to prevent creative solutions to problems and mires an enterprise into fighting problems that are inherent in bad technology choices. Also, I think there's an enterprises are run by idiots subtext.

1. Popularity. Really this is about having polyglot software portfolio. The reasoning appears to be that a polyglot software portfolio is impossible to maintain because (1) no one can learn an old language, and (2) software will never be rewritten from an obsolete language to a modern language. If these are both true, it appears the  organization is full of idiots. The notion that a polyglot tech stack must devolve into chaos seems to ignore the endless chain of management decisions that are required to create chaos. Leaving obsolete tech in place isn't a consequence of the tech, or the tech's lack of compatibility, it's a management decision to enshrine bad ideas, frozen in amber, forever.

2. Scripting Languages. Specifically, the spreadsheet is already the de facto scripting language of choice, and nothing can be done about it. Nothing. No one can learn to use Jupyter Lab to do business analytics. If this is true, it appears that the organization is full of idiots. Python will not replace all spreadsheets. A pandas data frame will replace an opaque macro-filled nightmare with code that can be unit tested. Imagine unit testing a spreadsheet. Consider the possibilities of expanding business analysis work to include a few test cases; not 100% code coverage, but a few test cases to confirm the analytical process was implemented consistently. 

3. Dynamic Languages. Specifically, dynamic languages are useless for reliable software because there's no comprehensive type checking across some interfaces. Which begs the question of why there are software failures in statically typed languages. More importantly, complaining about dynamic languages raises important questions about integration and acceptance testing procedures in an organization in general. All languages require extensive test suites for all developed code. All languages benefit from static analysis. Sometimes the compiler does this, sometimes external tools do the linting. Sometimes folks use both the compiler and linters to check types. If we are sure dynamic languages will break, are we equally sure statically typed languages cannot break? Or, do we take steps to prevent problems? I think we tend to take a lot of steps to make sure software works.

4. Tooling. I can't figure this point out. But somehow C++ or Java have better tools for managing large source code bases. There are no details behind this claim, so I'm left to guess. I would suggest that the "incremental recompilation" problem of large C++ (and Java) code bases is its own nightmare. Folks go to great lengths to architect C++ so that an implementation change does not require recompilation of everything. While this could be seen as "evolved to handle the jobs that enterprise coders need done", I submit that there's a deeper problem here, and stepping away from the compiler is a better solution than complex architectures. See Lakos Large-Scale C++ Software Design for some architectural features that don't solve any enterprise problem, but solve the scaling problem of big C++ applications. This bumps into the micro-services/monolith discussion, and the question of carefully testing each interface. None of which has anything to do with Python specifically.

5. Machine Learning and Data Science. These are fads, apparently. I'm not sure I can respond to this, since it has little to do with Python. Of course, Python has one of the most complete data science toolsets, so perhaps avoiding data science makes it easier to avoid Python.

6. Rapid Growth. The growth of Python is rapid, and there's no promise of endless backwards compatibility. This is a consequence of active development and learning. I think it's better than the endless backwards compatibility that leads to JavaScript's list of WATs. Or the endless confusion between java.util.date and Joda-Time. The idea that no one will ever look at the Enterprise code base for Common Vulnerabilities and Exposures seems to indicate a lack of concern for reliability or security. Since the entire compiled code base has to be checked for vulnerabilities, why not also check the Python code base for ongoing upgrades and changes and enhancements? Is code really written once and never looked at again? If so, it sounds like an organization run by idiots.

7. Python Shipped With Some OS's. There's a long story of woe that stems from relying on the OS-supplied Python. The lesson learned here is Never Rely on the OS Python; Always Install Your Own. This doesn't seem like a reason to avoid Python in the enterprise. It seems like an important lesson learned for all software that's not part of the OS: always install your own. I've been using Miniconda to spin up Python environments and absolutely love it.

8. Open Source Software. Agreed. Nothing to do with Python specifically. Everything to do with tech stack and architecture. The question of using Open Source in the first place doesn't seem difficult. It's a well-established way to reduce start-up costs for software development.

Of the eight points, two seem to be completely generic issues. Yes, Machine Learning is new, and yes, choices must be made. The two questions around scripting and dynamic languages seem specious; all programming requires careful design and testing. The Python shipped with the OS is a non-concern; the lesson learned is clear.

We have have three remaining points:

  1. Polyglot Portfolio (From pursuing popularity.) This is already the case in most Enterprises, and needs to be managed through aggressive retiring of old software. I may have taken decades to build that old app, but it often takes months to rewrite it in a new language. The legacy app provides acceptance test cases; it's often filled with cruft and detritus of old decisions. 
  2. Tooling. Agreed. Tooling is important. Not sure that Java or C++ have a real edge here, but, tooling is important.
  3. Growth and Change. Python's rapid evolution requires active management. An enterprise must adopt a YBYO (You Build it You Own it) attitude so that every level of management is aware of the components they're responsible for. CVE's are checked, Python PEP's are checked. Tools like tox or nox are used to build (and rebuild) virtual environments.
If these seem like a high bar, perhaps there are deeper issues in the enterprise. If adding Yet Another Language is a problem, then it's time to start retiring some languages. If Adding Another Tool is a problem, it's worth examining the existing tool chain to see why it's such a burden. If the idea of change is terrifying, perhaps the ongoing change is not being watched carefully enough.

Tuesday, August 9, 2022

Tragedy Averted

I almost made a terrible blunder.

See https://github.com/slott56/py-web-tool for some background. This is a "Literate Programming" tool. I started fooling around with this kind of thing back in '05 (maybe even earlier.) This is not the blunder. The whole idea of literate programming is not very popular. I'm a fan of Jupyter{Book} as the state of the art in sophisticated literate programming, if you're interested in it.

In my case, I started this project so long ago, I used docutils. This was long before Sphinx arrived on the scene. I never updated my little project to use Sphinx. The point was to have a kind of pure literate programming tool that could work with a variety of markup languages, including (but not limited to) RST.

Recently, I learned about PlantUML. The idea of a text description of a diagram is appealing. I don't really need to draw it; I just need to specify what's in it and let graphviz do the rest. This tool is very, very cool. You can capture ideas quickly. You can refine and expand on ideas until you reach a point where code makes more sense than a picture of code. 

For some things, you can gather data and draw a picture of things *as they are*. This is particularly valuable for cloud-based infrastructure where a few queries leads to PlantUML source that is depicted very nicely.

Which leads to the idea of Literate Programming including UML diagrams. 

Doesn't sound too difficult. I can create an extension to docutils to introduce a UML directive. The resulting RST would look like this:

..  uml::

    left to right direction
    skinparam actorStyle awesome

    actor "Developer" as Dev
    rectangle PyWeb {
        usecase "Tangle Source" as UC_Tangle
        usecase "Weave Document" as UC_Weave
    }
    rectangle IDE {
        usecase "Create WEB" as UC_Create
        usecase "Run Tests" as UC_Test
    }
    Dev --> UC_Tangle
    Dev --> UC_Weave
    Dev --> UC_Create
    Dev --> UC_Test

    UC_Test --> UC_Tangle

This could be handy to have the diagrams as part of the documentation that tangles the working the code. One source for all of it. 

I started down the path of researching docutils extensions. Got pretty far. Far enough that I had an empty repository and everything. I was about ready to start creating spike solutions.

Then.

[music cue] *duh duh duuuuuuh*

I found that Sphinx already has an extension for PlantUML. I almost started reading the code to see how it worked.

Then I realized how dumb that was. It already works. Why read the code? Why not install it?

I had a choice to make.

  1. Continue building my own docutils plug-in.
  2. Switch to Sphinx.

Some complications:

  • My Literate Programming tool produces RST that *may* not be compatible with Sphinx.
  • It's yet another dependency in a tool that started out with zero dependencies. I've added pytest and tox. What next? 

What to do?

I have to say that Git is amazing. I can make a branch for the spike. If it works, pull request. If it doesn't work, delete the branch. This continues to be game-changing to me. I'm old. I remember when we had to back up the whole project directory tree before making this kind of change.

It worked. My tool's RST (with one exception) worked perfectly with Sphinx. The one exception was an obscure directive, .. class:: name, used to provide an HTML class name for the following block. This always should have been the docutils .. container:: name directive. With this fix, we're good to go.

I'm happy I avoided the trap of reimplementing something. Instead of that, I upgraded from "bare" docutils with my own CSS to Sphinx with it's sophisticated templates and HTML Themes.

Tuesday, August 2, 2022

Books! Books! Books!

First, there's 

Pivot to Python

A Guide for professionals and skilled beginners

https://books.apple.com/us/book/pivot-to-python/id1586977675 

I've recently updated this to fix some cosmetic problems with title pages, the table of contents and stuff like that. The content hasn't changed. Yet. It's still an introduction to Python for folks who already know how to program, they want to pivot to programming in Python. Quickly.

But wait, there's more. 

Unlearning SQL

When your only tool is a hammer, every problem looks like a nail

https://books.apple.com/us/book/unlearning-sql/id6443164060

This is all new. It's written for folks who know Python, and are struggling with the architectural balance between writing bulk processing in SQL or writing it in Python. For too many developers, SQL is effectively the only tool they can use. With a variety of tools, it becomes easier to solve a wider variety of problems.

Tuesday, July 12, 2022

The Enterprise COBOL Conundrum

Enterprise COBOL is both a liability and an asset. There's tangible value hidden in the code.

See https://github.com/slott56/looking-at-cobol  

I've tweaked the presentation a little. 

The essential ingredients in coping with COBOL are these:

  • Use something like Stingray Reader to parse COBOL DDE's and process the data in the native format.
  • Analyze the Job Control Language (JCL) to work out the directed acyclic graph (DAG) that leads to file and database updates. These "master" files and databases are the data artifacts that matter most. This is the value-creating processing. There aren't many of these files. 
  • Create a process to clone those files, and write Python data access modules to process the data. This is a two-way process. You'll be shipping files from your Z/OS world to another server running Python. In some cases, files will need to come back to Z/OS to permit legacy processing to continue. 
  • Work backwards through the DAG to understand the COBOL apps that update the master files. These can be rewritten as Python apps that consume transactions and update master files/databases. Transfer transaction files out of Z/OS to a server doing the Python processing. Either update a shared database or send updated master files back to Z/OS if there's further processing that needs an updated master. 
  • Continue working backwards through the DAG, replacing COBOL with Python until you've found source files for the transactions. Expect to find transaction validation programs as well as transaction analytics or reporting. The validations are useful; the analytics and reporting can be replaced with simpler, more modern tools.
  • When there's no more legacy processing that depends on a given master file or database, then the Z/OS can be formally decommissioned. Have a party.

This is relatively low risk work. It's high value. The COBOL code encodes enterprise knowledge. Preserving this knowledge in a more modern language is a value-maintaining exercise. Indeed, the improved clarity may be a value-creating exercise.

Tuesday, July 5, 2022

Revised Understanding --> Revised Data Structures --> Revised Type Hints

My literate programming tool, pyWeb, has moved to version 3.1 -- supporting modern Python.

Next up, version 3.2. This is a massive reworking of the data structures involved. The rework lets me use Jinja2 for templates. There's a lot of fiddliness to getting the end-of-line spacing right. Jinja has the following:

{% for construct in container -%}
{{construct}}
{%- endfor %} 

The easy-to-overlook hyphens suppress spacing, allowing the construct to be spread onto multiple lines without introducing extra newlines into the output. This makes it a little easier to debug the templates.

It now works. But. Until I get past strict type checks, there's no reason for calling it done.

Found 94 errors in 1 file (checked 3 source files)

The bulk of the remaining problems seem to be new methods where I forgot to include a type hint. The more pernicious problems are places where I have inconsistent hints and Liskov substitution problems. The worst a places where I had a last-minute change change and switched from str to int and did not actually follow-through and make required changes.

The biggest issue?

When building an AST, it's common to have a union of a wide variety of types. This union often has a discriminator value to separate NamedChunk from OutputChunk. This is "type narrowing" and there are a variety of approaches. I think my best choice is a TypeGuard declaration. This is new to me, so I've got to do some learning before I can properly define the required type guard function(s). (See https://mypy.readthedocs.io/en/stable/type_narrowing.html#user-defined-type-guards)

I'm looking forward (eagerly) to finishing the cleanup. 

The problem is that I'm -- also -- working on the updates to Functional Python Programming. The PyWeb project is a way to relax my brain from editing the book. 

Which means the pyWeb updates have to wait for Chapter 4 and 5 edits. (Sigh.)

Tuesday, June 28, 2022

Massive Rework of Data Structures

As noted in My Shifting Understanding and A Terrible Design Mistake, I had a design that focused on serialization instead of proper modeling of the objects in question.

Specifically, I didn't start with a suitable abstract syntax tree (AST) structure. I started with an algorithmic view of "weaving" and "tangling" to transform a WEB of definitions into documentation and code. The weaving and tangling are two of the three distinct serializations of a common AST. 

The third serialization is the common source format that underpins the WEB of definitions. Here's an example that contains a number of definitions and a tangled output file.

Fast Exponentiation
===================

A classic divide-and-conquer algorithm.

@d fast exp @{
def fast_exp(n: int, p: int) -> int:
    match p:
        case 0: 
            return 1
        case _ if p % 2 == 0:
            t = fast_exp(n, p // 2)
            return t * t
        case _ if p % 1 == 0:
            return n * fast_exp(n, p - 1)
@| fast_exp
@}

With a test case.

@d test case @{
>>> fast_exp(2, 30)
1073741824
@}

@o example.py @{
@< fast exp @>

__test__ = {
    "test 1": '''
@< test case @>
    '''
}
@| __test__
@}

Use ``python -m doctest`` to test.

Macros
------

@m

Names
-----

@u

This example uses RST as the markup language for the woven document. A tool can turn this simplified document into complete RST with appropriate wrappers around the code blocks. The tool can also weave the example.py file from the source document.

The author can focus on exposition, explaining the algorithm. The reader gets the key points without the clutter of programming language overheads and complications.

The compiler gets a tangled source.

The key point is to have a tool that's (mostly) agnostic with respect to programming language and markup language. Being fully agnostic isn't possible, of course. The @d name @{code@} constructs are transformed into markup blocks of some sophistication. The @<name@> becomes a hyperlink, with suitable markup. Similarly, the cross reference-generating commands, @m and @u, generate a fair amount of markup content. 

I now have Jinja templates to do this in RST. I'll also have to provide LaTeX and HTML. Further, I need to provide generic LaTeX along with LaTeX I can use with PacktPub's LaTeX publishing pipeline. But let's not look too far down the road. First things first.

TL;DR

Here's today's progress measurement.

==================== 67 failed, 13 passed, 1 error in 1.53s ====================

This comforts me a great deal. Some elements of the original structure still work. There are two kinds of failures: new test fixtures that require TestCase.setUp() methods, and tests for features that are no longer part of the design.

In order to get the refactoring to a place where it would even run, I had to incorporate some legacy methods that -- it appears -- will eventually become dead code. It's not totally dead, yet, because I'm still mid-way through the refactoring. 

But. I'm no longer beating back and forth trying to see if I've got a better design. I'm now on the downwind broad reach of finding and fixing the 67 test cases that are broken. 

Tuesday, May 25, 2021

Python's Protocol Annotation vs. Duck Typing

Let's talk about profound confusion.

I got an email with a subject of this, "Python's Protocol Reduces Reliance on Duck Typing". The resulting conversation led to this nugget: "... my current project could use protocols in Python, and thus I didn't need to rely on duck typing and instead could use that as my type."

I'm unclear on what "reliance" really meant here.

Python depends (heavily) on duck typing. Because type annotations are optional, this cannot change. It's unlikely to ever change.

Here's the bottom line: Duck Typing Won't Go Away.

Indeed, there's more: Duck Typing Isn't Bad.

Python doesn't "rely" on the type annotations. They're a bonus feature to make sure you aren't lying about the types and how they're used.

Protocols are how duck typing works. When we leverage duck typing among classes, we're implicitly relying on the classes all supporting a common protocol. Numbers, for example, implement a ton of methods; this collection of common methods (e.g., __add__(), etc.) define a protocol.

With mypy, we can create our own distinct protocols as named types.

I don't get the "reducing reliance" business when protocols make duck typing work. And. Sadly. I couldn't figure out where the confusion arose.  

Follow-up

I asked for clarification and got nothing useful in response. The person sending the email seemed to be working from a summary of another conversation, or something. I couldn't figure it out.

I can try to assume they used to have this.

class Something:
    def useful_method(self, x: str) -> int:
        # whatever
        
class CloselyRelated:
    def useful_method(self, x: str) -> int:
        # another polymorphic thing
        
Polymorphic = Union[Something, CloselyRelated]

# many classes and functions relying on Polymorphic

And they've realized that there may be a better way.

But. I haven't really got much to go on.

The better approach often involves something like this:

class Polymorphic(Protocol): 
    def useful_method(self, x: str) -> int:
        ...

We can define a protocol to help locate the essential features of a parameter or a result type.

But. I don't really know what was going on.

And I couldn't figure out why the word "Reliance" was used.

Tuesday, September 1, 2020

A Comprehensive Introduction to Python

Python 101, by Michael Driscoll. 545 pages, available from leanpub.com in a variety of formats. Available soon in hardcover.

The modern Python programming language is a large topic. A book on a programming language has to be seen as a collection of several large topics.

At its core, a book on a programming language has to cover the syntax of the language. What’s for more important is covering the underlying semantics of the various constructs. Software captures knowledge, and it’s essential for a book on a programming language to make it clear how the language expresses knowledge.

For a programming expert, a fifteen page technical report can be enough to get started with a new language. When I was first learning to program, that’s all there was. For the vast majority of people who come in contact with programming, there’s a lot more information required.

This leads to a number of interesting tradeoffs when writing about a programming language. How much of a book should be devoted to installing the language tools? How much should it cover the other tools required to create software? I think Python 101 makes good choices.

In the modern era of open-source software, the volume and sophistication of the available tools can be daunting. An author must consider how many words to invest in text editors, debuggers, performance measurement, testing, and documentation. These are all important parts of producing software, they’re often tied closely with a language, but these additional tools aren’t really the language itself.

A language like Python offers a rich collection of built-in data types. A book’s essential job is to cover the data structures (and algorithms) that are first-class parts of the Python language. A focus on data puts the various syntactic elements (like statements) into perspective. The break statement, for example, can’t really be discussed in isolation. It’s part of the conversation about for statements and conditional processing in if statements. Because Python 101 follows this data-first approach, I think it can help build comprehensive Python skills.

The coverage of built-in data structures in a modern language needs to include file objects. While Python reads strings and bytes, the standard library provides ways to read HTML, CSV, JSON, and XML documents. Additional packages provide access to Excel spreadsheet files. While, technically, not part of the language, these are essential parts of the problem domain a programming language like Python is designed to address. Because these are part of the book, a reader will be empowered to solve practical problems.

There was a time when a programming “paradigm” was part of a book’s theme. Functional programming, procedural programming, and object-oriented programming approaches spawned their own libraries. Some languages have a strong bias. Other languages, like Python, lack a strong bias. A developer can work with functions, using material from the first seventeen chapters of Python 101 and be happy and successful. Moving into class definitions can be helpful for simplifying certain kinds of programs, but it’s not required, and a good book on Python should treat classes as a sensible alternative to functions for handling more complex object state and bundle operations with the state.

Moving beyond the language itself, a book can only pick a few topics that can be called “advanced.” This book looks at some of the language internals, exposed via introspection. It touches on some of the standard library modules for managing subprocesses and threads. It covers tools like debuggers and profilers. It expands to cover development environments like the Jupyter Notebook, also. I’d prefer to reduce coverage of threading and switch to Jupyter Lab from Jupyter Notebook. These are small changes at the edges of large pool of important details.

I’m still waffling over one choice of advanced topics. Does unit testing count as an advanced topic? For software professionals, a testing framework is as important as the language itself. For amateur hackers, however, a testing framework may be a more advanced topic. The location of a chapter on unit testing is a telling indication of who the book’s audience is. 

The Python ecosystem includes the standard library and the vast collection of packages and applications available through the Python Package Index. These components can all be added to a Python environment. This means any book on the language must also cover parts of the standard library, as well as covering how to install  new packages from the larger ecosystem. Python 101 doesn’t disappoint. There are solid chapters in PIP and Virtual Environment management. I can quibble over their  place in Part II. The presence of chapters on tools is important; Python is more than a language; Python 101 makes it clear Python is a collection of tools for building on the work of others to solve problems collaboratively.

I’m not easily convinced that Part IV has the same focus on helping the new programmer as the earlier three parts. I think packaging and distribution considerations take the reader too far outside problem-solving with a programming language and tools. I’m not sure the audience who sees testing as an advanced topic is ready to distribute their code. I think there’s room for a Python 102 book to cover these more professionally-oriented topics.

The volume of material covered by this comprehensive book on Python seems to require something more elaborate than a simple, linear sequence of chapters. The sequence of chapters have jumps that seem a little awkward. For example, from an introduction run-time introduction introspection, we move to the PIP and virtual environment tools, then move back to ways to make best use of Python’s annotations and type hints. Calling this flow awkward is — admittedly — a highly nuanced consideration. I suspect few people will read this book sequentially; when each chapter is used more-or-less independently, the sequence of chapters becomes a minor side-bar consideration. Each chapter has generous examples and there are screen shots where necessary. 

The scope of this book covers the language and the features through Python 3.8 in a complete and intelligible way. The depth is appropriate for a beginning audience and the examples are focused on simple, concrete, easy-to-understand code. The presence of review questions in each chapter is a delight, making it easy to leverage the book for instructor-guided training. I can imagine covering a few chapters each week and quizzing students with the review questions. Some of the questions are nicely advanced and can lead to further exploration of the language.

If you’re new to Python, this should be part of your Python reading list. If you’ve just started and need more examples and help in using some of the common tools, this book will be very helpful. If you’re teaching or helping guide people deeper into Python, this may be a helpful resource. 

Driscoll’s colorful nature photos are a bonus. My Kindle is limited to black and white, and the pictures would have been disappointing. I’m glad I got the PDF version.

Tuesday, August 11, 2020

Modern Python Cookbook 2e -- Out with the old

 Most of the things that got cut were (to me) obviously obsolete. For example, replacing collections.namedtuple with typing.NamedTuple seemed like a clear example of obsolete. A reviewer really thought I should skip all NamedTuple and use frozen data classes. 

More important are some things that I learned about in my formative years. I think they're important because they'll little nuggets of cool algorithm. But. Pragmatically? They're too hard to explain and don't really capture interesting features of Python.

Back in '01. Yes. The turn of the millennium. 

(Pull up a chair. This is a long yarn.)

Back in '01, I was starting to look at ways to perfect my Python and literate programming skills.

(And yes, I was using Python on '01.)

I had a project that I'd learned about in the 80's. That's in the previous millennium. A thousand years ago. Computers were large, expensive, and rare.

And. Random Number Generators (RNG's) were a bit of a struggle. In the 80's, more sensitive statistical methods were uncovering biases in the RNG's of the day. Back in the 70's, Knuth's The Art of Computer Programming, Volume 2, Seminumerical Algorithms had covered this topic pretty well. But. Not quite well enough for language libraries or OS's to offer really solid RNG's.

(The popular Mersenne Twister algorithm dates from '97.)

One of my co-workers at the time showed me a technical report that I have no real bibliographic information for. I read it, captivated, because it described -- in detail -- Knuth's statistical tests for random number generators. 

This lead me to Knuth Volume 2. 

This lead me to implement *all* of this in Pascal (in the '80's.)

This lead me to implement *all* of this in Python (in the '00's.)

There were 10 tests. Each is a tidy little algorithm with a tidy little implementation that can run on a big collection of data to ascertain how random it is. 

  1. Frequency Test - develops frequency distribution of individual samples.
  2. Serial Test - develops frequency distribution of pairs of samples.
  3. Gap Test - develops frequency distribution of the length of gaps between groups samples in a given range.
  4. Poker Test - develops frequency distribution for 5-card "hands" of samples over a small (16-value) domain.
  5. Coupon Collector's Test - develops frequency distribution for lengths of subsets that contain a complete set of values from a small (8-value) domain.
  6. Permutation Test - develops frequency distribution for the permutations of ordering of 4-sample selections.
  7. Runs Up Test - develops frequency distribution for lengths of "runs up" where each value is larger than the previous value; one variation covers the case where runs are statistically dependent.
  8. Runs Up Test with independent runs and a relatively large domain.
  9. Runs Up Test with a "small domain", that has a slightly different expected distribution.
  10. Maximum of T - develops frequency distribution for the largest value in a group of T values.
  11. Serial Correlation - computes the correlation coefficient between adjacent pairs of values.

What's important here is that we're gaging the degree of randomness of a collection of samples. All of these are core data science. Finding a truly random random number generator is the same as looking at a variable and seeing that it's too random to have any predictive value. This is the Type I Error problem.

Doing this with RNG's means starting with a specific seed. Which means we need to run this for a large number of seed values and compare the results. Lots of computer cycles can be burned up examining random number generators.

Lots.

The frequency test, for example. We bin the numbers and compare the frequencies. They aren't the same; they're within a few standard deviations of each other. That means you don't use 5 bins. You use 128 bins so you can compare the bin sizes to the expected bin size and compute a deviation. The deviation for expected needs to pass a chi-squared test.

Back in the day, chi-squared values were looked up in the back of a handy statistics book. 

That seems weak. Can we compute the exact chi-squared values?  

(Spoiler alert, Yes.)

Computing expected chi-squared values means computing Sterling numbers, Bernoulli numbers, and evaluating the partial gamma function. Knuth gives details on Sterling numbers. I have no reference material on Bernoulli numbers. 

The Log Gamma function is ACM collected algorithms (CALGO) number 291 and 309. The incomplete gamma function is CALGO 435 and 654.

Fascinating stuff. 

To me.

Of this, only one thing ever saw the light of day.

The Coupon Collector's test. Given a long sequence composed of selections from a small pool of distinct values ("coupons"), how many samples from the overall sequence do you have to examine to collect one of each distinct coupon value? This yields a kind of Poisson distribution of the number of samples seen before getting a full set of coupons.

If there's eight kinds of coupons, the smallest number of samples we have to examine is eight. Lucky break. One of each and done. But. Pragmatically, we'll see a distribution that varies from a low of 8 to a high of -- well -- infinity. We'll see a peak at like 15 to 18 samples before collecting all eight coupons and a long, long tail. We can cut the tail at 40 samples and have a statistically useful distribution to discern of the source samples were randomly ordered.

Why did this -- of all things -- see the light of day?

It involves set manipulations.

def collect_coupons(samples: Iterable[int]) -> Iterator[int]:
    while True:
        coupons = set()
        count = 0
        for u in samples: 
            coupons |= u
            count += 1
            if len(set(coupons)) == 8:
                break
        yield count

I've used a number of variations on the above theme to use set manipulation to accumulate data.  There are a lot of ways to restate this using itertools, also. It can be viewed as a clever "reduce" algorithm.

But.

It's so hard to explain. And. It's not really used much by data scientists to reject type I errors because few things fit the coupon model very well.

But. 

It's a cool set processing example.

So.

It's safely out of the book. 

Thursday, July 30, 2020

Modern Python Cookbook Journey

For the author, a book is a journey.  

Writing something new, the author describes a path the reader can follow to get from -- well -- anywhere the reader might be to the author's suggested destination. Not everyone makes the whole trip. And not everyone arrives at the hoped-for destination.

Second editions? The idea is to update the directions to reflect the new terrain.  

I'm a sailor. Here's a view of the boat.

BERJAYA

What's important to me is the way the authorities produce revised nautical charts on a stable, regular cadence. There's no "final" chart, there's only the "current" chart. Kept up-to-date by the patient hard work of armies of cartographers. 

Is updating a book like updating the nautical charts? I don't think so. Charts have a variety of update cadences.  For sailors in the US, we start here: https://nauticalcharts.noaa.gov/charts/chart-updates.html. The changes can be frequent. See https://distribution.charts.noaa.gov/weekly_updates/ for the weekly chart updates. This is supplemented by the Notices to Mariners, here, too: https://msi.nga.mil/NTM. So, I think charts are much, much more complex than books.

Sailors have to integrate a lot of data.  This is no different from software developers having to keep abreast of language, library, and platform changes.

The author's journey is different from the reader's journey. A technical book isn't a memoir. 

The author may have crashed into all kinds of rocks and shoals. The author's panic, fear, and despair are not things the reader needs to know about. The reader needs to know the course to set, the waypoints, and hazards. The estimated distances and the places to anchor that provide shelter.

For me, creating a revision is possibly as difficult as the initial writing. I don't know how other authors approach subsequent editions, but the addition of type hints meant every example had to be re-examined.  And this meant discovering problems in code that I *thought* was exemplary. 

While many code examples can simply have type hints pasted in, some Python programming practices have type hints that can't be trivially introduced to the code. Instead some thinking is required.

Generics

Python code is always generic with respect to type. Expressions like a + b will work for a surprisingly wide variety of object classes. Of course, we expect any of the numbers to work. But lists, tuples, and strings all respond to the "+" operator. This is implemented by a sophisticated check of a's __add__() and b's __radd__() methods.

When we write hints, it's often intended to narrow the domain of potential types. Here's some starting code.

def fact(a):
   if a == 0:
       return 1
   return a*fact(a-1)

The implied type hint is Any. This means, any class of objects that defines __eq__(), __mul__() and __sub__() will work. There are a fair number of these classes.

When we write type hints, we narrow the domain. In this case, it should be integers. Like this:

def fact(a: int) -> int:
    if a == 0:
        return 1
    return a*fact(a-1)

This tells mypy (or other, similar analytic tools) to confirm that every place the fact() function is used, the arguments will be integers. Also, the result will be an integer.

What's important is there's no run-time consequence to this. Python runs the same whether we evaluate fact(2) or fact(3.0).  The integer-based computation clearly matches the intent stated in the code. The floating-point computation is clearly at odds with the stated intent.

And this brings us to the author's journey.

Shoal Water

Sometimes we have code that works. And will always work. But. The type hints are hard to express.

The most common examples?

Decorators.

Decorators can be utterly and amazingly generic. And this can make it very, very difficult to express the domain of types involved.

def make_a_log(some_function: Callable) -> Callable:
    @wraps(some_function)
    def concrete_function(*args, **kwargs):
        print(some_function, args, kwargs)
        result = some_function((*args, **kwargs)
        print(result)
    return concrete_function

This is legal, but very shady Python. The use of the Callable type hint is almost intentionally misleading. It could be anything. Indeed, because of the way Python works, it can truly be any kind of function or method. Even a lambda object can be decorated with this. 

The internal concrete_function doesn't have any type hints. This forces mypy to assume Any, and that will lead to a possibly valid application of this decorator when -- perhaps -- it wasn't really appropriate.

In the long run, this kind of misleading hinting is a bad policy.

In the short run, this code will pass every unit test you can throw at it.

What does the author do?
  1. Avoid the topic? Get something published and move on? It is simpler and quicker to ignore decorators when talking about type hints. Dropping the section from the outline would have been easy.
  2. Dig deeply into how we can create Protocols to express a narrower domain of candidates for this decorator? This is work. And it's new work, since the previous edition never touched on the subject. But. Is it part of this cookbook? Or do these deeper examples belong in a separate book?
  3. Find a better example? 
Spoiler Alert: It's all three.

I start by wishing I hadn't broached the topic in the first edition. Maybe I should pretend it wasn't there and leave it out of the second edition.

Then I dig deeply into the topic, overwriting the topic until I'm no longer sure I can write about it. There's enough, and there's too much. A journey requires incremental exposition, and the side-trip into Protocols may not be the appropriate path for any but a very few readers.

After this, I may decide to throw the example out and look for something better.  What's important is having an idea of what is appropriate for the reader's journey, and what is clutter.

The final result can be better because it can be:
  • Focused on something useful.
  • Any edge cases can be corrected to work with the latest language, library, and mypy release.
  • Where necessary, replaced by an alternative example that's clearer and simpler.
Unfortunately (for me) I examine everything. Every word. Every example.

Packt seems to be tolerant of my slow pace of delivery. For me, it simply takes a long time to rewrite -- essentially -- everything. I think the result is worth all the work.

Tuesday, July 28, 2020

Modern Python Cookbook 2nd ed -- Advance Copies -- DM me

This is your "why wait" invitation.

Advanced copies will be available.  

IF.

And this is a big "if".

You have to write a blurb. 

I'll be putting you in contact with Packt marketing folks who will get you your advanced copy so you can write blurbs and reviews and -- well -- actually use the content.

It's all updated to Python 3.8. Type hints almost everywhere. F-strings and the walrus operator. Bunches of devops and data science examples. Plus a few personal examples involving sailboat navigation and management.

See me at LinkedIn https://www.linkedin.com/in/steven-lott-029835/ and I'll hook you up with Packt marketing folks.

See https://www.amazon.com/Modern-Python-Cookbook-Updated-programmer/dp/180020745X for the official Amazon Book Link. This is for ordinary "no obligation to write a review" orders.

DM me directly slott56 at gmail to be put into the marketing spreadsheet.

Tuesday, June 30, 2020

Over-Solving or Solving Problems You Don't Have

Sometimes we call them "Belt and Braces" solutions. As a former suspenders person who switched to belts, the idea of wearing both is a little like over-engineering. In the unlikely event of catastrophic failure of one system, your pants can still remain properly hoist. There's a weird, but defensible reason for that. Most over-engineering lacks a coherent reason. 

Sometimes we call them "Bells and Whistles." The solution has both bells and whistles for signaling. This is usually used in a derogatory sense of useless noisemakers, there for show only. Again, there's a really low-value and dumb, but defensible reason for this. 

While colorful, none of this is helpful for describing over-engineered software. Over-engineered software is often over-engineered for incoherent and indefensible reasons.

Over-engineering generally means trying to solve a problem that no user actually has. This leads to throwing around irrelevant features.

Concrete Example

I lived on a boat. I spent a fair amount of time fretting over navigation. 

There are two big questions: 
  1. How far apart are two points, really. 
  2. What's the real bearing from one point to another.
These are -- in some cases -- easy to answer.

If you have a printed, paper chart at the right scale, you can use dividers to compute a distance. It's actually a very easy task. Similarly, you can read the bearing off the chart directly. There's a trick to comparing a course to a nearby compass rose, but it's easy to learn and very accurate.

Of course, we don't want to painstakingly copy our notes from a paper chart to a spreadsheet to add them up to get total distance. And then fold in speed to get time and fuel consumption. These summary computations are a pain.

What you want is to do all of this with a computer.
  1. Plot the points using a piece of software like OpenCPN (https://opencpn.org).
  2. Extract the GPX file.
  3. Compute distances, bearings, and durations to create a route.
"So?" you ask.

So. When I did this, I researched the math and got a grip on the haversine formula for doing the spherical geometry computation of distances between points on a sphere.

It's not too bad. The formula are big-ish. But manageable. See http://www.edwilliams.org/avform.htm#Dist for the great circle distance formula.

BERJAYA

For airplanes and powered freighters crossing oceans, this is perfect.

For a small sailboat going from Annapolis, Maryland, to the Bahamas, this level of complexity is craziness. While accurate, it doesn't really solve the problem I have. 

I don't actually need that much accuracy. 

I need this much accuracy.

BERJAYA

And no more. This is the essential hypotenuse distance using an R-factor to convert the difference between latitudes and the distance between longitudes into pretty-close distances. For nautical miles, R is 60×180÷Ï€. 

This is simpler and it solves the problem I actually have. Up to about 232 miles, the answer is within 1 mile of correct. The error grows quickly. Double the distance and the error seems to jump to 8 miles. A 464 mile sailing journey (at 6 knots) takes 3 days. Wind, weather, tides and currents will introduce more error than the simplifying assumptions.

What's important is this can be put into a spreadsheet without pain. I don't need to write sophisticated Python apps to apply haversine to sequences of way-points. I can do a simpler hypotenuse computation on waypoints converted to radians.

Is there a lesson learned?

I think there is.

There's the haversine a super-general solution. It handles great-circle routes elegantly. 

But it doesn't solve my actual problem. And that makes it over-engineering.

My problem is what we call rhumb-line sailing. Over short-enough distances the world may as well be flat. Over slightly longer distances, errors in the ship's compass and speedometer make a hyper-accurate great circle route moot. 

(Even with several fancy GPS-based navigation computers, a prudent mariner has paper backups. The list of waypoints, estimated times and directions are essential when the boat's GPS reciever fails.)

I don't really need the sophistication (and the potential for bugs) with haversine. It doesn't solve a problem I actually have.

Tuesday, April 14, 2020

The COBOL-to-SomeBetterLang Translator

Here's a popular idea.
... a COBOL-to-X translator, where X is a more-modern programming language ...
This is a noble aspiration.

In principle -- down deep -- all programming can be reduced to an idealized Turing Machine.

This means that we *should* be able to locate all the state changes in a given spaghetti-bowl of COBOL. Given the abstract state transitions, we can emit a version of that machine in any language.

Emphasis on the *should*.

There are road-blocks.

The first two are rarities. But. When confronted with these, we'll have significant problems.

  • The ALTER statement means the code can be changed at run-time. There are constraints, but still... When the code is not static, the possible domain of state changes moves outside working storage and into the procedure division itself.
  • A data structure with a RENAMES clause. This adds a layer of alternative naming, making the data states quite a bit more complex.
The next one is a huge complication: the GOTO statement. This makes state transitions extremely difficult to analyze. It's possible to untangle any GOTO of arbitrary complexity into properly tested IF and WHILE statements. 

However. The tangle of GOTO's may have been actually meaningful. It may have carried some suggestion of a business owner's intent. A COBOL elimination algorithm may turn tangled code into opaque code. (It's also possible that it clarifies age-old bad programming.)

The ordinary REDEFINES clause. This was heavily used as a storage optimization for the tiny, slow file systems we had back in the olden days. It's a union of distinct types. And. It's a "free" union. We do not know how to distinguish the various types that are being redefined. It's intimately tied to processing logic in the procedure division.

Just to make it even more horrifying...

File layouts evolve over time. It's entirely possible for a *working*, *valid*, *in-production* file to have content that does not match any working program's DDE. The data has flags or indicators or something that lets the app glide past the bad data. But the data is bad. It used to be good. Then something changed, and now it's almost uninterpretable. But the apps work because there are enough paths through the logic to make the row "work" without it matching any file layout in any obvious way.

I'm not sure an automated translation from COBOL is of any value. 

I think it's far better to start with file layouts, review the code, and then write new code from scratch in a modern language. This manual rewrite leads directly to small programs that -- in a modern language -- are little more than class definitions. In some cases, each legacy COBOL app would like becomes a Python module.

Given snapshots of legacy files, the Python can be tested to be sure it does the same things. The processing is not nuanced, or tricky, or even particularly opaque.

The biggest problem is the knowledge captured in COBOL code tends to be disorganized. The real work is disentangling it. A language that supports ruthless refactoring will be helpful.

Tuesday, March 17, 2020

70% of Modern Python Cookbook 2e...

At this point, we're closing in on 9/13 (70%) of the way through the 2nd edition rewrite.

Important changes.
  1. Type Hints
  2. Type Hints
  3. Type Hints
First. Every single class, method, or function has to be changed to add hints. Every. Single. One. This is kind of huge. The book is based on over 13,000 lines of example code in 157 files. A big bunch of rewrites.

Second. Some things were either wrong or at least sketchy. These rewrites are important consequences of using type hints in the first place. If you can't make mypy see things your way, then perhaps your way needs rework.

Third. dataclasses, frozen dataclasses, and NamedTuples have some nuanced overlapping use cases. Frequently, they differ only by small type hint changes.

I hate to provide useless non-advice like "try them and see which works for you." However, there's only so much room to try and beat out a detailed list of consequences of each alternative. Not every decision has a clear, prescriptive, "do this and you'll be happy." Further, I doubt any reader needs detailed explanations of *potential* performance consequences of mutable vs. immutable objects.

Also. I'm very happy cutting back on the overwrought, detailed explanations. This is (a) not the only book on Python, and (b) not my only book. When I started the first drafts 20 years ago, I wrote as though this was my magnum opus, a lifetime achievement.  A Very Bad Idea (VBI™).

This is a resource for people who want more depth. At work, I spend time coaching people who call themselves advanced beginners. The time spent with them has helped me understand my audience a lot better, and stuck to useful exposition of the language features.


Tuesday, August 13, 2019

Coping with Windows via AWS

For a training class, I needed to address The Windows Problem™. TWP is the my summary of all the non-standard features of Windows in its various inconsistent incarnations.

Any training class that involves "install Python" inevitably involves at least one Windows user who can't get their PATH set correctly. It's an eternal mystery to me, since the installers all seem to take care of this, but, some people are able to click the wrong thing somewhere in a simple installation.

The uninstall and start again sometimes helps. Having a Windows expert in the room sometimes helps.

(The hapless flailing I sometimes observe is my personal problem to deal with. I should not let myself get short-tempered with people who try things exactly once and then stop, unable to discern a single branch in the path they chose. It's as if the sequence of dialog boxes with different choices never existed for them, and I need to respect the fact that they did not read the messages and did not think of themselves as actively making choices.)

I have to say that being able to get a Windows machine in free tier of AWS is a wonderful resource.

I can screen capture the installation on Windows. 

I can narrate the sequence of choices I made. I'm hoping this prevents TWP from side-tracking a person who's struggling with Windows.

I haven't used the movies yet. But it's so handy to be able to spin up a Windows machine in the cloud and run the Conda install. It's a whole lot nicer than buying a throw-away Windows machine to do screen shots on.

Tuesday, June 4, 2019

Probabilistic Data Structures

Interesting data structures with O(n) performance. This can help to unscramble O(n²) problems allowing progress.

https://pdsa.readthedocs.io/en/latest/index.html

Tuesday, May 7, 2019

Fiction Writers and Query Letters

See http://flstevens.itmaybeahack.com/writing-world-building-and/ for some back-story on F. L. Stevens and the need to write a *lot* of query letters to agents for fiction. (The non-fiction industry is entirely different; there are acquisition editors who look for technical contributors.)

There's a tiny possibility of a Query Manager Tool (of some kind) on a writer's desktop.

Inputs include:
  • Template letter.
  • A table of customizations per Agent. The intent is to allow more than simple name and pronoun changes. This includes Agent-specific content requirements. These vary, and can include the synopsis, first chapter, first 10 pages, first 50 pages. They're not email attachments; they have to be part of the main body of the email, so they're easy to prepare.
  • Another table of variant pitches to plug into the template. There are a lot of common variations on this theme. Sizes vary from as few as 50 words to almost 300 words. Summaries of published works seem to have a median size of 140 words. A writer may have several (or several dozen) variants to try out.
This can't become a spam engine (agents don't like the idea of an impersonal letter.) 

Also. A stateful list of agents, queries, and responses is important. Some Agents don't necessarily respond to each query; they often offer a blanket apology along the lines of "if you haven't heard back in sixty days, consider it a rejection." So. You want to try again. Eventually. Without being rude. And if you requery, you have to send a different variant on the basic pitch.

Some Agents give a crisp "nope" and you can update your list to avoid requerying.

For new authors (like F. L. Stevens,) there's a kind of manual query tracking mess. It's not really horrible. But it is annoying. Keeping the database up-to-date with responses is about as hard as a tracking spreadsheet, so, there's little value to a lot of fancy Python software.

The csv, string.Temple, email and smtplib components of Python make this really easy to do.  While I think it would be fun to write, creating this would be work avoidance. 

Tuesday, March 12, 2019

Python's multi-threading and the GIL

Got this in an email.
"Python's multi-threading module seems not efficient because of the global interpreter lock?" 
Yep.
Is the trick is to use "Thread-Local Data"?
Nope.

It Gets Worse

Interestingly, there was no further ask. The questioner had decided on thread-local data because the questioner had decided to focus on threads. And they were done making choices at that point.

Sigh.

No question on "What was recommended?" or "What's a common solution?" or "What is Dask?" Nothing other than "confirm my assumptions."

This is swirling around a bunch of emails on trying to determine the maximum number of concurrent threads or processes based on the number of cores or CPU's or something.

Maximum.

I'll repeat that for those who skim.

They think there's a maximum number of concurrent threads or processes.

If you have some computation which (1) makes zero OS requests and (2) is never interrupted, I can imagine you'd like to have all of the cores fully committed to executing that theoretical stream of instructions. You might even be able to split that theoretical workload up based on the number of cores.

Practically, however, that stream of uninterrupted computing rarely exists.

Maybe. Maybe you've got some basin-hopping or gradient-following or random forest ML algorithm which is going to do a burst of computation on an in-memory data structure. In that (rare) case, Dask is still ideal for exploiting all of the cores on your processor.

The upper-bound idea bugs me a lot.

  • Any OS request leads to a context switch. Any context switch leads to waiting. Any waiting means you can have more threads than you have cores. 
  • AFAIK, any memory write outside the local cache will lead to a stall in the pipeline. Another thread can (and should) leap in to the core's processing stream. The only way you can create the "all-computing" sequence of instructions bounded by the number of cores is to *also* be sure the entire thing fits in cache. Hahahaha.

What's the maximum number of threads or processes? It depends on the wait times. It depends on memory writes. It depends on the size of the data structure, the size of cache, and the size of the instruction stream.

Because it depends on a lot of things, it's rather difficult to predict. And that makes it rather difficult to determine a maximum.

Replying about the uselessness of trying to establish a maximum, of course, does nothing. AFAIK, they're still assiduously trying to use os.cpu_count() and os.sched_getaffinity() to put an upper bound on the size of a thread pool.

Acting as if Dask doesn't exist.

Solution

Use Dask.

Or

Use a multiprocessing pool.

These are simple things. They don't require a lot of hand-wringing over the GIL and Thread Local Data. They're built. They work. They're simple and effective solutions.

Side-bar Nonsense

From "a really smart guy. He got his PhD in quantum mechanics and he got major money to actually go build … . He initially worked for ... and now he is working for .... So, when he says something or asks a question, I listen very carefully."
The laudatory blah-blah-blah doesn't really change the argument. It can be omitted. It is an "Appeal to Authority" fallacy, and the Highest Paid Person's Opinion (HIPPO) organizational pattern. Spare me.

Indeed. Asking for my confirmation of using Thread-Local Data to avoid the GIL is -- effectively -- yet another Appeal to Authority. Don't ask me if you have a good idea. An appeal to me as an authority is exactly as bad as appeal to some other authority to convince me you've found a corner case that no one has ever seen before.

Worse is to ask me and then blah-blah-blah Steve Lott says blah-blah-blah. Please don't.

I can be (and often am) wrong.

Write your code. Measure your code's performance. Tell me your results. Explore *all* the alternatives while you're at it.