S.Lott-Software Architect: open source

Showing posts with label open source. Show all posts

Tuesday, November 23, 2021

Processing Apple Numbers Files

Apple's freebie tools -- Pages, Numbers, Keynote, Garage Band, etc. -- are wonderful things. I really like Numbers. I'm tolerant of Pages. I've used Pages to write books and publish them to the Apple Bookstore. (Shameless Plug: Pivot to Python.)

These tools have a significant problem. Protobuf.

Some History

Once upon a time, Numbers used an XML-based format. This was back in '09, I think. At some point, version 10 of Numbers (2013?) switched to Protobuf.

I had already unwound XLSX and ODS files, which are XML. I had also unwound Numbers '09 in XML. I had a sense of what a spreadsheet needed to look like.

The switch to Protobuf also meant using Snappy compression. Back in 2014? I worked out my own version of the Snappy decompression algorithm in pure Python. I think I knew about python-snappy but didn't want the complex binary dependency. I wrote my own instead.

I found the iWorkFileFormat project. From this, and a lot of prior knowledge about the XML formats, I worked out a way to unpack the protobuf bytes into Python objects. I didn't leverage the formal protobuf definitions; instead I lazily mapped the objects to a dictionary of keys and bytes. If a field had a complex internal structure, I parsed the subset of bytes.

(I vaguely recall the Protobuf definitions are in XCode somewhere. But. I didn't want to write a protobuf compiler to make a pure-Python implementation. See the protobuf project for what I was looking for, but didn't have at the time.)

Which brings us to today's discovery.

State of the Art

Someone has taken the steps necessary to properly unpack Numbers files. See numbers-parser. This has first-class snappy and protobuf processing. It installs cleanly. It has an issue, and I may try to work on it.

I'm rewriting my own Stingray Reader with intent to dispose of my own XLSX, ODS, and Numbers processing. These can (and should) be imported separately. It's a huge simplification to stand on the shoulders of giants and write a dumb Facade over their work.

Ideally, all the various spreadsheet parsing folks would adopt some kind of standard API. This could be analogous to the database API used by SQL processing in Python. The folks with https://www.excelpython.org or http://www.python-excel.org might be a place to start, since they list a number of packages.

The bonus part? Seeing my name in the Credits for numbers-parser. That was delightful.

At some point, I need to make a coherent pitch for a common API with permits external JSON Schema as part of extracting data from spreadsheets.

First. I need to get Stingray Reader into a more final form.

Wednesday, September 9, 2020

Open Source Support Ideas

"... [I] am thinking of building an in house conda forge, or buying a solution, or paying someone to set something up."

The build v. Buy decision. This is always hard. Really hard.

We used to ask "What's your business? Is it building software or making widgets?"

And (for some) the business is making widgets.

This is short-sighted.

But. A lot of folks in senior positions were given this as a model back in the olden days. So, you need to address the "how much non-widget stuff are we going to take on?" question.

The "Our Business is Widgets" is short-sighted because it fails to recognize where the money is made. It's the ancillary things *around* the widgets. Things only software can do. Customer satisfaction. Supply-chain management.

So. Business development == Software development. They're inextricably bound.

With that background, lets' look at what you want to do.

Open Source software is not actually "free" in any sense. Someone has to support it. If you embrace open source, then, you have to support it in-house. Somehow. And that in-house work isn't small.

The in-house open-source support comes in degrees, starting with a distant "throw money at a maintainer" kind of action. You know. Support NumFocus and Anaconda and hope it trickles down appropriately (it sometimes does) to the real maintainers.

The next step is to build the tooling (and expertise) in-house. Conda forge (or maybe JFrog or something else) and have someone on staff who can grow to really understand how it fits together. They may not be up to external contributions, but they can do the installs, make sure things are running, handle updates, manage certificates, rotate keys, all the things that lead to smooth experience for users.

The top step is to hire one of the principles and let them do their open source thing but give them office space and a salary.

I'm big on the middle step. Do it in-house. It's *not* your core business (in a very narrow, legal and finance sense) but it *is* the backbone fo the information-centric value-add where the real money is made.

Folks in management (usually accouting) get frustrated with this approach. It seems like it should take a month or two and you're up and running. (The GAAP requires we plan like this. Make up a random date. Make up a random budget.)

But. Then. 13 weeks into the 8-week project, you still don't have a reliable, high-performance server. Accounting gets grumpy because the plan you have them months ago turns out to have been riddled with invalid assumptions and half-truths. (They get revenge by cancelling the project at the worst moment to be sure it's a huge loss in everyone's eyes.)

I think the mistake is failing to enumerate the lessons learned. A lot will be learned. A real lot. And some of it is small, but it still takes all day to figure it out. Other things are big and take failed roll-outs and screwed up backup-restore activities. It's essential to make a strong parallel between open source and open learning.

You don't know everything. (Indeed, you can't, much to the consternation of the accountants.) But. You are learning at a steady rate. The money is creating significant value.

And after 26 weeks, when things *really* seem to be working, there needs to be a very splashy list of "things we can do now that we couldn't do before." A demo of starting a new project. `conda create demo python=3.8.6 --file demo_env.yml` and watch it run, baby. A little dask. Maybe analyze some taxicab data.

Tuesday, October 30, 2018

The SourceForge vs. GitHub Conundrum

Or "When is it time to move?"

I've got https://sourceforge.net/projects/stingrayreader/ which has been on SourceForge since forever.

Really since about 2014. Not that long. But. Maybe long enough?

The velocity of change is relatively slow.

However.

(And this is a big however.) SourceForge seems kind of complicated when compared with Github.

It's not a completely fair comparison. SourceForge has a *lot* of features. I don't use very many of those features.

The troubling issues are these.

1. Documentation. SourceForge -- while it has a Git interface -- doesn't handle my documentation very well. Instead of a docs directory, I do a separate upload of the HTML. It's inelegant. SourceForge may handle this more smoothly nowadays. Or maybe I should switch to readthedocs?

2. The Literate Programming Workflow. There's an extra step (or two) in LP workflows. The PyLit3 synchronization to create the working Python from the RST source. This is followed by the ubiquitous steps creation of a release, creation of a distribution, and the upload to PyPI. I don't have an elegant handle on this because my velocity of change is so low. SourceForge imposed a "make your own ZIP file" mentality that could be replaced by a nicer "use PyPI" approach.

3. Clunky Design Issue. I've uncovered a clunky, stateful design problem in the StingrayReader. I really really really need to fix it. And while fixing it, why not move to Github?

4. Compatibility Testing. The StingrayReader seems to work with Python 3.5 and up. I don't have a formal Tox suite. I think it works with a number of versions of XLRD. And it *should* be amenable to other tools for Excel processing. Not sure. And (until I start using tox) can't tell.

5. Type Hints. See #3. The stateful design problem can be finessed into a much more elegant use of NamedTuples. And then mypy can be used.

6. Unit Tests. Currently, the testing is all unittest.TestCase. I really want to convert to pytest and simplify all of it.

7. Lack of a proper workflow in the first place. See #2. It's a more-or-less sitting in the master branch of a git repo that's part of SourceForge. That's kind of shabby.

8. Version Numbering Vagueness. When I was building my own Zip archives from the code manually (because that's the way SourceForge worked.) I wasn't super careful about semantic versioning, and I've been release patch-number versions for a while. Which is wrong. A few of those versions included new features. Minor, but features.

But. One tiny new feature. So. It will be release 4.5.

See https://sourceforge.net/p/stingrayreader/blog/2018/10/moving-to-github/ for status, also

Tuesday, October 4, 2016

Alternatives to PowerPoint (or Keynote) for Presentations

This: https://opensource.com/business/16/9/alternatives-powerpoint

Missing from the list? The S5-based slide-show tools that are part of docutils.

The only issue with S5 is that you need to carefully review each and every page to be sure you material fits. There's no autosizing of the fonts, or other trickery to pack too much trash onto a slide.

TIL that Lync/Skype doesn't politely handle Keynote on a Mac. You must mirror displays because Lync can only share one of your displays, and it's not the one Keynote chooses to display on.

I often use Keynote because it's expedient. I sometimes use S5 to show off an entirely open-source toolchain.

Tuesday, March 3, 2015

Let's all Build a Hat Rack

Wound up here: "A Place to Hang Your Hat" and the #LABHR hash tag.

H/T to this post: "Building a Hat Rack."

This is a huge idea. I follow some folks from the Code For America group. The +Mark Headd Twitter feed (@mheadd) is particularly helpful for understanding this movement. Also, follow +Kevin Curry (@kmcurry) for more insights.

Open Source contributions are often anonymous and the rewards are intangible. A little bit of tangibility is a huge thing.

My (old) open source Python books have a "donate" button. Once in a while I'll collect the comments that come in on the donate button. They're amazingly positive and encouraging. But also private. Since I have a paying gig writing about Python, I don't need any more encouragement than I already have. (Indeed, I probably need less encouragement.)

However.

There are unsung heroes at every hackathon and tech meetup who could benefit from some recognition. Perhaps they're looking for a new job or a transition in their existing job. Perhaps they're looking to break through one of the obscure social barriers that seem to appear in a community where everyone looks alike.

And. There's a tiny social plus to being the Recognizer in Chief. There's a lot to be said in praise of the talent spotters and friction eliminators.

Thursday, December 2, 2010

More Open Source and More Agile News

ComputerWorld, November 22, 2010, has this: "Open Source Grows Up". The news of the weird is "It's clear that open-source software has moved beyond the zealotry phase." I wasn't aware this phase existed. I hope to see the project plan with "zealotry" in it.

The real news is "More than two-thirds (69%) of the respondents said they expect to increase their investments in open source." That's cool.

Be sure to read the sidebar "Many Enterprises Aren't Giving Back." There's still a lot of concern over intellectual property. I've seen a lot of corporate software -- it's not that good. Most companies that are wringing their hands over losing control of their trade secrets should really be wringing their hands because their in-house software won't measure up to open-source standards.

I like this other quote: 'Five years ago, the South Carolina government was "considering writing a policy to prohibit or at least 'control' open source".' I like the "Must Control Open Source" feeling that IT leadership has. Without this mysterious "control", the organization could be swamped by software it didn't write. How's that different from being swamped by software products that involve contracts and fees? And requires Patch Tuesday?

Agility

SD Times has two articles on Agile methods. Both on the front page of a print publication. That's how you know the technique has "arrived".

First, there's "VersionOne survey finds agile knowledge and use on the rise". My favorite quote: "Interestingly, management support, the ability to change organizational culture and general resistance to change, remained at the forefronts of participants’ minds when indicating barriers to further agile adoption." I like the management barriers. I like it when management tries to exert more 'control' over a process (like software creation) that's so poorly understood.

Here's the companion piece, "For agile success, leaders must let teams loose". This is all good advice. Particularly, this: '"It’s hard to not command and control, but leadership is not about managing work. It’s about creating a capable organization that can manage work," [Rick Simmons] added.'

If you're micro-managing, you're not building an organization. Excellent advice. However, tell that to the financial control crowd.

Budgets and "Control"

Finally, be sure to read this by Frank Hayes in ComputerWorld: "Big Projects, Done Small". Here are the relevant quotes: "The logical conclusion: We should break up all IT projects into sub-million-dollar pieces." "The political reality: Everybody wants multimillion-dollar behemoths." "...huge projects get big political support."

In short, Agile is the right thing to do until you're trying to get approval. Bottom line: use Agile methods. But for purposes of pandering to executives who want to see large numbers with lots of zeroes, it's often necessary to write giant project "plans" that you don't actually use.

Go ahead, write waterfall plans. Don't feel guilty or conflicted. Some folks won't catch up with Agility because they think "Control" is better. Pander to them. It's okay.

Tuesday, November 23, 2010

Open-Source, moving from "when" to "how"

Interesting item in the November 1 eWeek: "Open-Source Software in the Enterprise".

Here's the key quote: "rather than asking if or when, organizations are increasingly focusing on how".

Interestingly, the article then goes on to talk about licensing and intellectual property management. I suppose those count, but they're fringe issues, only relevant to lawyers.

Here's the two real issues:

Configuration Management
Quality Assurance

Many organizations do things so poorly that open source software is unusable.

Configuration Management

Many organizations have non-existent or very primitive CM. They may have some source code control and some change management. But the configuration of the test and production technology stacks are absolutely mystifying. No one can positively say what versions of what products are in production or in test.

The funniest conversations center on the interconnectedness of open source projects. You don't just take a library and plug it in. It's not like neatly-stacked laundry, all washed and folded and ready to be used. Open Source software is more like a dryer full of a tangled collection of stuff that's tied in knots and suffers from major static cling.

"How do we upgrade [X]"? You don't simply replace a component. You create a new tech stack with the upgraded [X] and all of the stuff that's knotted together with [X].

Changing from Python 2.5 to 2.6 changes any binary-compiled libraries like PIL or MySQL_python, mod_wsgi, etc. These, in turn, may require OS library upgrades.

A tech stack must be a hallowed thing. Someone must actively manage change to be sure they're complete and consistent across the enterprise.

Quality Assurance

Many organizations have very weak QA. They have an organization, but it has no authority and developers are permitted to run rough-shod over QA any time they use the magic words "the user's demand it".

The truly funny conversations center on how the organization can be sure that open source software works, or is free of hidden malware. I've been asked how a client can vet an open source package to be sure that it is malware free. As if the client's Windows PC's are pristine works of art and the Apache POI project is just a logic bomb.

The idea that you might do acceptance testing on open source software always seems foreign to everyone involved. You test your in-house software. Why not test the downloaded software? Indeed, why not test commercial software for which you pay fees? Why does QA only seem to apply to in-house software?

Goals vs. Directions

I think one other thing that's endlessly confusing is "Architecture is a Direction not a Goal." I get the feeling that many organizations strive for a crazy level of stability where everything is fixed, unchanging and completely static (except for patches.)

The idea that we have systems on a new tech stack and systems on an old tech stack seems to lead to angry words and stalled projects. However, there's really no sensible alternative.

We have tech stack [X.1], [X.2] and [X.3] running in production. We have [X.4] in final quality assurance testing. We have [X.5] in development. The legacy servers running version 1 won't be upgraded, they'll be retired. The legacy servers running version 2 may be upgraded, depending on the value of the new features vs. the cost of upgrading. The data in the version 3 servers will be migrated to the version 4, and the old servers retired.

It can be complex. The architecture is a direction in which most (but not all) servers are heading. The architecture changes, and some servers catch up to the golden ideal and some servers never catch up. Sometimes the upgrade doesn't create enough value.

These are "how" questions that are more important than studying the various licensing provisions.

Wednesday, March 17, 2010

COBOL File Processing in Python (really)

Years ago (6? 7?) I did some data profiling in Python.

This required reading COBOL files with Python code.

Superficially, this is not really very hard.

Python slice syntax will pick fields on of the record. For example: data[12:14].
Python codecs will convert from EBCDIC to Unicode without pain. codecs.get('cp037').decode( someField ).

With some more finesse, one can handle COMP-3 fields. Right?

Maybe not.

Problems

There are three serious problems.

Computing the field offsets (and in some cases sizes) is a large, error-prone pain.
The string slice notation makes the COBOL record structure completely opaque.
COMP-3 conversion is both ubiquitous and tricky.

Okay, what's the solution?

COBOL DDE Parsing

What I did was write a simple parser that read the COBOL "copybook" -- the COBOL source that defined the file layout. Given this Data Definition Entry (DDE) it's easy to work out offset, size and type conversion requirements.

It was way cool, so I delivered the results -- but not the code -- to the customer. I posted parts of the code on my personal site.

Over the years, a few people have found it and asked pointed questions.

Recently, however, I got a patch kit because of a serious bug.

Unit Tests

The code was written in Python 2.2 style -- very primitive. I cleaned it up, added unit tests, and -- most importantly -- corrected a few serious bugs.

And, I posted the whole thing to SourceForge, so others can -- in principle -- fix the remaining bugs. The project is here: https://sourceforge.net/projects/cobol-dde/.

Thursday, March 4, 2010

Literate Programming

About a decade ago, I discovered the concept of Literate Programming. It's seductive. The idea is to write elegant documentation that embeds the actual working code.

For tricky, complex, high-visibility components, a literate programming approach can give people confidence that the software actually works as advertised.

I actually wrote my own Literate Programming tool. Amazingly, someone actually cared deeply enough to send me a patch to fix some long-standing errors in the LaTeX output. What do I do with a patch kit?

Forward and Reverse LP

There are two schools of literate programming: Forward and Reverse. Forward literate programming starts with a source text and generates the documentation plus the source code files required by the compilers or interpreters.

Reverse literate programming generates documentation from the source files. Tools like Sphinx do this very nicely. With a little bit of work, one can create a documentation tree with uses Sphinx's autodoc extension to create great documentation from the source.

Reverse LP, however, tends to focus on the API's of the code as written. Sometimes it's hard to figure out why it's written that way without further, deeper explanation. And keeping a separate documentation tree in Sphinx means that the code and the documentation can disagree.

My pyWeb Tool

The gold standard in Literate Programming is Knuth's Web. This is available as CWEB which generates TeX output. It's quite sophisticated, allowing very rich markup and formatting of the code.

There are numerous imitators, each less and less sophisticated. When you get to nuweb and noweb, you're getting down to the bare bones of what the core use cases are.

For reasons I can't recall, I wrote one, too. I wrote (and used) pyWeb for a few small projects. I posted some code as an experiment on the Zope site, since I was a Zope user for a while. I went to move it and got emails from a couple of folks who are serious Literate Programmers and where concerned when their links broke. Cool.

I moved the code to my own personal site, where it sat between 2002 and today. It was hard-to-find; but there are some hard-core Literate Programmers who are willing to chase down tools and play with them to see how they work at producing elegant, readable code. Way cool.

Patch Kit

Recently, I received a patch kit for pyWeb. This says several things.

It's at least good enough that folks can use it and find the errors in the LaTeX markup it produced
Some folks care enough about good software to help correct the errors.
Hosting it on my personal web site is a bad idea.

So, I created a SourceForge project, pyWeb Literate Programming Tool, to make it easier for folks to find and correct any problems.

I expect the number of downloads to hover right around zero forever. But at least it's now fixable by someone other than me.

Saturday, October 31, 2009

Open Source in the News

Whitehouse.gov is publicly using open source tools.

See Boing Boing blog entry. Plus Huffington Post blog entry.

Most importantly, read this from O'Reilly.

Many places are using open source in stealth mode. Some even deny it. Ask your CIO what the policy on open source is, then check to see if you're using Apache. Often, this is an oops -- policy says "no", practice says "yes".

For a weird perspective on open source, read Binstock's Integration Watch piece in SD Times: From Open Source to Commercial Quality. "rigor is the quality often missing from OSS projects". "Often" missing? I guess Binstock travels in wider circles and sees more bad open source software than I do. The stuff I work with is very, very high quality. Python, Django, Sphinx, the LaTeX stack, PIL, Docutils -- all seem to be outstandingly good.

I guess "rigor" isn't an obvious tangible feature of the software. Indeed, I'm not sure what "commercial quality" means if open source lacks this, also.

All of the commercial software I've seen has been in-house developed stuff. Because there's so much of it, it must have these elusive "rigor" and "commercial quality" features that Binstock values so highly. Yet, the software is really bad: it barely works and they can't maintain it.

My experience is different from Binstock's. Also, most of the key points in his article are process points, not software quality issues. My experience is that the open source product exceeds commercial quality. Since there's no money to support a help desk or marketing or product ownership, the open source process doesn't offer all the features of a commercial operation.

Friday, May 22, 2009

Open Source Use Rising

Or so claims SD Times...

http://www.sdtimes.com/link/33430

The decision process includes: "find a low-cost solution". More importantly, it includes "justify the fees to purchase and for support."

This drives down the cost of software and support for commercial products. It also rationalizes what your buy when you buy a license and pay for support.

In the olden days, you just paid. Now, you debate the merits of support and determine what you're getting and if the value is commensurate with the cost.

S.Lott-Software Architect

Moved

Moved. See https://slott56.github.io. All new content goes to the new site. This is a legacy, and will likely be dropped five years after the last post in Jan 2023.