close

Moved

Moved. See https://slott56.github.io. All new content goes to the new site. This is a legacy, and will likely be dropped five years after the last post in Jan 2023.

Showing posts with label HTML. Show all posts
Showing posts with label HTML. Show all posts

Tuesday, September 18, 2018

Data Modeling Nightmare -- XML, HTML, and Markdown

Here's a particularly tangled and difficult problem. It arises because I have another blog. Specifically this: Team Red Cruising. And it's an unholy mess.

There are two important features of the Team Red Cruising blog.
  1. It's managed with off-line editor(s) so I can write posts from the boat and then upload them when I get connectivity. Welcome to being a technomad -- I don't always have a web-based blog editor available.
  2. It was actually created with two different off-line editors over a period of years: iWeb and Sandvox. iWeb is long dead. Sandvox hasn't seen many updates recently, and I think I'd like to move on to something newer and "better". 
(In this case, "better" means iOS-friendly. e.g., Blogo or BlogPad ProAlso. Blogo's support site seems to be a right mess. Not a good look. They're working on it.)

The blog isn't the unholy mess. We'll get to the mess below. First, however some background on the overall strategy. I want to move my content. What's involved? There are several things in play: the hosting, the target, and the source. So. Essentially. Everything.

Changing the Hosting Platform

Both of my legacy tools would export and upload the changes to my hosting service directly, avoiding the overheads of having any complex hosting software. The site was static and served simply from the filesystem via Apache httpd. Publishing was an SFTP transfer to the server. Nothing more. The "platform" was almost nothing.

(I could switch to using an Amazon S3 bucket and a DNS entry and it would work nicely.)

Both of these offline editing tools have a tiny bias toward working with hosting services like WordPress. Blogo claims it can also work with Medium, and Blogger, as well.

This means running Wordpress on top of my default SFTP/Apache configuration. I use A2 Hosting, so this is really easy to do.

So. The hosting is more-or-less settled. I'll do very little. (Dealing with breaking links is a separate hand-wringing exercise.)

In order to move from iWeb and Sandvox to another tool, and start using WordPress, I have two strategies for converting the content.
  1. Ignore my legacy content. Leave it where it is, more-or-less uneditable. The tool(s) are gone, all that's left is the static HTML output from the tool. 
  2. Gather the legacy content and migrate it to WordPress and then pick an offline tool that works with WordPress. 
I've already done strategy #1, when I converted from iWeb to Sandvox. I left the old iWeb stuff out there, and moved to a new URL path with new content. While a clever menu structure can make it look like it's all one multi-year blog, the pages themselves are vastly different in the way they look. There's no comprehensive search. And, of course, I can't easily maintain the old iWeb stuff.

Having one #1, I'm now sure that's a bad idea.

An advantage of moving to WordPress is the ability to have all of the content in one, uniform database. WordPress has export functionality, so the next tool is a distinct possibility.

Note that SandVox seems to have a distinct problem trying to import iWeb's published content. They have a cool HTML scraper, but iWeb relies on JavaScript, and scraper doesn't do well.

Getting to WordPress

What we're looking at is a fairly complex data structure. While I'd like to look at this from a vast and reserved distance (i.e., in the abstract) I have a very concrete problem. So, we're forced to consider this from the WordPress POV.

We have a WordPress "Site" with a long series of posts and some pages.

BERJAYA

The essence here is that the content can -- to an extent -- be converted to Markdown. The titles and dates are easy to preserve. The body? Not so much.

We can, as an alternative to Markdown, use some kind of skinny HTML that WordPress supports. I think WP can handle a structure free of class names, and using a most of the available HTML tags.

Most of the blog content is relatively flat. The block structure is generally limited to images, block quotes, paragraphs, ordered and unordered lists. The inline tags in use seem to be a, img, strong, em, and a few span tags for font changes.

The complexity, then, is building a useful content model from the source. There are a few AST's for Markdown. commonmark.py might have a useful AST.  It's not complex, so it may be simpler to define my own.

It's hard to understand the inline blocks in mistletoe. The python-markdown project uses ElementTree objects to build the AST. I'm not a fan of this because I'm not parsing Markdown.

Starting From -- Well, it's Complicated

There are -- as noted above -- two sources:
  • Sandvox.
  • iWeb.
The Sandvox desktop "database" structure is opaque. The media is easy to find. The content is some kind of binary-encoded data with headers that tell me a little about the XCode environment, but nothing else.

To read this, I have to scrape the HTML using Beautiful Soup. It involves processing like this:

    content = soup.html.body.find("div", id="main-content")
    article = content.find(class_="article-content").find(class_="RichTextElement").div

Find a nested <div> with a target ID. Inside that <div> is where the article can be found.

This seems to work out pretty well. Almost everything I want to preserve can be -- sort of -- mushed into Markdown.

The iWeb desktop "database" is XML. The published HTML depends on Javascript and is hard to work with. The XML is -- of course -- densely wordy and convoluted as can be. But the words and markup are there.  I can use ElementTree to walk down through XML to locate the right tags.

There's a lot of code like this

    main_layer = child_root.find('ns0:site-page/ns0:drawables/ns0:main-layer', ns)

This example digs into site pages, and nested drawables, and main layers of content.  Eventually, we wind up looking at <p>, <span>, <attachment-ref>, and <link> tags in the XML to build the relevant content.

The nuance is style. They're not part of the inline markup. They're stored separately, and included by reference. Each of the four tags that seem to be in use have a style attribute that references styles defined within the posting. Once these references are resolved, I think Mardown can be generated.

The Unholy Mess

The hateful part of this is the disconnect between HTML (and XML) and Markdown. The source data permits indefinite nesting of tags. Semantically meaningless <p><p>words</p></p> are legal. The "flattening" from HTML/XML to Markdown is worrisome: what if I trash an entry by missing something important?

Ideally, it's this:

BERJAYA


Pragmatically, HTML/XML can be more complex. This diagram assumes we won't have paragraphs inside list items. HTML permits it. It's redundant in Markdown.

Worse, of course, are the inline tags. HTML has a kabillion of them. The software I've been using seems to limit me to <img>, <strong>, <em>, and <a>. HTML/XML allows nesting. Markdown doesn't.

Ideally, I can reframe the inline tags to create a flat sequence of styled-text objects within any of the tags.

Right now. Headaches.

Working on the code. It's not a general solution to anyone else's problem. But. I'm hoping -- as I beat the problem into submission -- to find a way to make some useful tutorial materials on mapping between complex, and different, data structures.

Tuesday, July 10, 2012

Cool stuff I saw at MADExpo

A random list of cool things.

  1. HTML5.  Start now.  It's supported (more or less) by all browsers if you add appropriate shims.  Start with sites like http://www.html5rocks.com/en/ and continue reading.  It's largely arrived and there's no compelling reason to delay.
  2. JavaScript.  I'm not a fan of the language.  However.  It's clearly here to stay. It's part of many important technologies (like CouchDB).  HTML5 shims (or shivs) are necessary.  Flex (and other browser plug-in languages) are effectively dead.  JavaScript is all that's left.
  3. Redis.  Wow! Is that cool?  Rather that fart around trying to get outstanding performance out of a clunky old RDBMS, use a simple, high performance data store.
  4. MongoDB.  Now that I've seen it, I have a vague notion of places where Mongo is better and places where Couch might be better.  In 80% of the application space, both are fine.  But there's a 20% where we might be able to split a hair and leverage the slightly different feature sets.
I don't build mobile apps.  But if I did, I think I'd start with Appcelerator Titanium before switching to a "native" development environment.  The idea that several standard features are readily accessible with a convenient development environment is pleasant.

I finally laid eyes on an Arduino.  Let me simply say that it looks like dangerous, dangerous fun.  I think I'll be hanging around at Radio Shack a lot.

Tuesday, December 20, 2011

Color Schemes

I worked with this a few years ago to tweak up some web pages.

http://colorschemedesigner.com/

I just rediscovered it.  It's a cool toy.  You get some colors that all "go" together.  If you're careful with your .CSS definitions, you give people this page and let them fuss around until their positively silly with color palettes.

Monday, April 5, 2010

Getting Started Creating Web Pages

Got this question recently.
I’m looking for an HTML editor that fits into my price range (free of course). I don’t need to do anything fancy, just vanilla HTML to run on an Apache server ..., and maybe some PHP down the line. Can you recommend any open source or shareware software that would run on Windows?
What to do?

First, civilized folks don’t edit HTML any more. That’s so 1999.

You have a spectrum of choices if you want to try and edit HTML.
  • General-purpose text editors. Good ones do HTML syntax coloring. This is the hardy, forge-through-the-forest way to go. Raw text editing. Like when we were kids. http://en.wikipedia.org/wiki/List_of_text_editors. In Windows world, I use Notepad++.
  • HTML-specific editors. http://en.wikipedia.org/wiki/List_of_HTML_editors. Note that WYSIWYG HTML Editing is more trouble than you’d believe possible. It’s always fun for the first few months, but then you try to do something that confuses the GUI interface and you wind up with an entire paragraph in italics and can’t figure out why. Or you want to move a punctuation outside a link and discover that the editor just can’t figure out where the tag is supposed to fall and puts everything inside it. Most of us do not try to use WYSIWYG HTML editors because it slowly becomes annoying once you get beyond the trivial basics.
  • IDE’s. To produce HTML sensibly, you have to also write .CSS style sheets, and you often have a number of related pages. Essentially, a “project”. An IDE is usually a better choice than an editor. All the good IDE’s are free: Eclipse, NetBeans and Komodo Edit. I use ActiveState Komodo Edit heavily.
While NetBeans or Komodo Edit seems like overkill, it will (eventually) pay out as you move into developing more than static HTML pages.

Better Than HTML

Instead of creating HTML, many of us use “Lightweight Markup” which is much, much easier to cope with and simple tools to produce HTML from the markup. http://en.wikipedia.org/wiki/Lightweight_markup_language

I use reStructuredText instead of HTML. I use the DocUtils project, which has an rst2html.py tool that converts my RST into HTML for me. I also use rst2s5.py to create power-point-like presentations from my reStructuredText. If you want to see the power of RST, you can look at my personal site and my books: and. 100% RST. No manual HTML anywhere. I use Sphinx to create really complex docments like the books.

For some tasks, I use HTML templates and simple scripts to process data and create static HTML from the data. You’d be surprised how effective this is. Few things require up-to-the-second web applications. Many things can be done as nightly batch programs that emit static HTML and FTP the HTML up to the web page. No PHP.

Application Development

For web development, PHP is fine. It will – before long – create holes in your head because it’s so badly thought out. But for getting started, it’s fun. Real companies (like Google) don’t waste their time with it because of the numerous problems PHP causes.

“Problems?” you say. “What problems?”

PHP’s world view (HTML + code in a single package) is a terrible architecture. It’s horribly slow and leads to very muddled, inflexible designs. Everyone who tries to make a global change to their site's “look and feel” finds that PHP is inflexible and a regrettable platform. Even folks who simply want consistency among several different pages within their site find that the PHP world view is more headache than solution.

But it’s fun when you first build a site that works.

Frameworks

Generally, most folks find that a “framework” is absolutely essential for debugging, consistency and separating Content, Processing and Presentation. Even a simple Blog or Forum or Visitor Registration has separate Content, Processing and Presentation; PHP muddles these. A framework can help unmuddle them.

I use Django as framework and Python as programming language. Your hosting site may not support this, in which case you may be in trouble.

The Web Frameworks list on Wikipedia is good. Zend and CodeIgniter are highly recommended in places like StackOverflow. However, here's a good Django vs. PHP comparison: The Onion Uses Django, And Why It Matters To Us.

"Because
Cleaner. Much cleaner. Proper unit testing. Real reusable components across applications. An ORM rather than a just a series of functional query helpers...."

Summary
  1. Get an IDE to edit your pages. Komodo Edit.
  2. Consider using RST and tools instead of raw HTML. Installing Python + DocUtils and using rst2html.py is easier than learning HTML.
  3. Try to avoid PHP’s numerous pitfalls; ideally by avoiding PHP. Use Django + Python and create a real application that clearly separates the content (data model) from processing (view functions) from presentation (HTML templates)

Saturday, March 20, 2010

Obsolescence

My old Citizen Pro-Master watch died. It needs batteries. It's a dive watch, so it also needs to be opened by professionals, have the gaskets replaced, and get pressure tested to be sure it works.

I tried sending it to the Citizen Watch Service facility in Dallas. Their web site has the advantage of being search-engine friendly, a real plus. It has some significant problems, also.
  1. Their address is an image, not text. How do I copy and paste to create a shipping label?
  2. They spelling mistakes.
  3. Overall, it has an amateurish look, leaving one uncomfortable mailing and expensive watch to them.
They do respond promptly, however, and told me not to mail them the watch. They did not have parts. They suggested the "Torrance" facility.

Okay, try and find the "Torrance" facility on line.



Not so easy, is it? You can find a lot of peripheral information about the Torrance location and it's location. But not any real contact information directly from a Citizen-branded web site.

The Citizen site is slick, but appears to be totally search-engine unfriendly. No spelling mistakes, but amazingly hard to find the "Torrance facility" via the Citizen site.

Further, without actually seeing the watch, email #3 said "We hate for you to send the watch only to find out that we too cannot fix it."
  1. It doesn't work. Why would you hate to have me send it out? If you can't fix it, I haven't lost anything by trying, have I? I don't understand that comment.
  2. Dallas never looked at it, so the "we, too" part doesn't make sense, either.
I tried one last time to explain that I just wanted batteries.

After four emails simply trying to figure out if they would look at it, I guess I have to give up trying to get it fixed.

Sad that a solidly built watch isn't even good for 20 years of service. Sad, too, that the web sites are collectively so bad: either they're slick and search-engine proof or amateurish.

Tuesday, November 10, 2009

Another HTML Cleanup

Browsers are required to skip over bad HTML and render something.

Consequently, many web sites have significant HTML errors that don't show up until you try to scrape their content.

Beautiful Soup has a handy hook for doing markup massage prior to parsing. This is a way of fixing site-specific bugs when necessary.

Here's a two-part massage I wrote recently that corrects two common (and show-stopping) HTML issues with quoted attributes values in a tag.

# Fix style="background-image:url("url")"
background_image = re.compile(r'background-image:url\("([^"]+)"\)')
def fix_background_image( match ):
return 'background-image:url("e;%s"e;)' % ( match.group(1) )

# Fix src="url name="name""
bad_img = re.compile( r'src="([^ ]+) name="([^"]+)""' )
def fix_bad_img( match ):
return 'src="%s" name="%s"' % ( match.group(1), match.group(2) )

fix_style_quotes = [
(background_image, fix_background_image),
(bad_img, fix_bad_img),
]

The "fix_style_quotes" sequence is provided to the BeautifulSoup contructor as the markupMassage value.


Wednesday, November 4, 2009

Parsing HTML from Microsoft Products (Like Front Page, etc.)

Ugh. When you try to parse MS-generated HTML, you find some extension syntax that is completely befuddling.

I've tried a few things in the past, none were particularly good.

In reading a file recently, I found that even Beautiful Soup was unable to prettify or parse it.
The document was filled with <!--[if...]>...<![endif]--> constructs that looked vaguely directive or comment-like, but still managed to stump the parser.

The BeautifulSoup parser has a markupMassage parameter that applies a sequence of regexps to the source document to cleanup things that are baffling. Some things, however, are too complex for simple regexp's. Specifically, these nested comment-like things were totally confusing.

Here's what I did. I wrote a simple generator which emitted the text that was unguarded by these things. The resulting sequence of text blocks could be assembled into a document that BeautifulSoup could parse.

def clean_directives( page ):
"""
Stupid Microsoft "Directive"-like comments!
Must remove all <!--[if...]>...<![endif]--> sequences. Which can be nested.
Must remove all <![if...]>...<![endif]> sequences. Which appear to be the nested version.
"""
if_endif_pat= re.compile( r"(\<!-*\[if .*?\]\>)|(<!\[endif\]-*\>)" )
context= []
start= 0
for m in if_endif_pat.finditer( page ):
if "[if" in m.group(0):
if start is not None:
yield page[start:m.start()]
context.append(m)
start= None
elif "[endif" in m.group(0):
context.pop(-1)
if len(context) == 0:
start= m.end()+1
if start is not None:
yield page[start:]

Monday, May 25, 2009

ReStructured Text markup and Content Management

I can't say enough good things about ReStructuredText (RST).  I've used all of the available markup languages (SGML, HTML and XML).  They have their place, but they all fall short of being truly usable.

In This sounds complicated, because it is I reviewed some of my history of cheap content management.   

In looking at content of all kinds, I'm finding that RST is much, much easier to work with than SGML, HTML or XML.  In short, I think that RST makes the file system into a really good content management system (CMS).  Unstructured content is a big win.  Structured content is a "don't care".  But there's a middle ground of semi-structured content that requires sophisticated semantic markup.

SGML At The Dawn Of Time

When the web started it's ascent (back in the 90's), I was lucky.  I had already been working with folks that did military contracting, and folks there had introduced me to SGML.   When I moved from SGML to HTML, I saw it as a pleasant simplification because it had a more-or-less fixed DTD.  

My first personal web pages were lovingly hand-crafted HTML masterpieces.  (Okay, they were lovingly hand-crafted.)   There was  a lot of work involved in markup, cross-references, and presentation. 

HTML via a Class Hierarchy

My first templating was via proper Python classes.  I created class hierarchies that embodied the page template and filled in required data.  The heart of each class was an emit method that wrote the final HTML.

Variant page layouts and special cases were easily handled by Python simple inheritance.  

Of course, the big problem is that HTML is just representation.  There's often some bleed-through between the problem domain model and the HTML representation of that underlying model.  You don't want your problem domain objects to encode any HTML.  You can have a generic Tag class, but the Page class is specific to your problem domain.

The Python class structure is nice, but it's only suitable for structured content management.  When you have semi-structured and unstructured data -- the strong suit of HTML -- you find the class hierarchy to be too rigid.

Some time in the early 00's, I discovered Cheetah.

HTML via Templates

Cheetah (and template engines like Mako, Jinja, and numerous others) did what I wanted.  A base template was -- effectively -- a superclass.  Each block in that template could be overridden by a subclass.

The content, then, becomes a relatively simple template file that extends a page layout.  You can handle unstructured and semi-structured content very nicely.  I changed my ways of working with HTML to leverage this elegant, extensible view of the world.  I redid my personal web site: the content become a collection of Cheetah templates that contained all the content.

Note that I've *added* a markup language.  In addition to HTML, I also have some Cheetah markup on each page.  While this got me consistency and flexibility (and a reduction in the volume of stuff on each page) it did make things slightly more complex.

Look at http://cadesignquilts.com/ for another example of an all-Cheetah static site.  I did several sites like this.  The workflow involved (1) design the overall page, (2) getting the data into a usable form, (3) generating the page-level template files, and (4) running Cheetah to emit HTML from the templates.  All static content.  Runs like lightning.  

The JSP Distraction

Eventually, I started doing development with Struts, which depends heavily on JSP.  You have HTML commingled with Java code.  Plus, you've got custom actions via a tag library to extend JSP processing.  You can create page-level templates with a reasonably smart JSP tag library.

This template solution doesn't work well for unstructured or semi-structured data.  It's a pure programming solution.

DocBook XML and Semantic Markup

I wrote Building Skills in Python entirely in Appleworks.  That was pretty well unmaintainable and unpublishable in that form.

I converted the text to DocBook XML.  I used the Leo outliner to manage the document as a whole.  I wrote my own publishing workflow to transform the XML to HTML and PDF.   It worked reasonably well.

More important, using DocBook reinforced the importance of semantic markup.  It took me back to my SGML days.  It also showed why and other HTML presentation things have to be moved out of the document and into the stylesheet.

This was a very nice way to handle the semi-structured and unstructured content in a book.  Direct use of XML is a pain in the neck.  XML has a lot of syntax.  It's much nicer to do your thinking with something lighter weight.  

ReStructured Text (RST) for Unstructured Content

Somewhere in the late 00's, I found Python's docutils and RST.  I can't figure out when I started -- precisely -- but using RST as part of content management didn't fully click at first.

After reworking my personal site, which includes a lot of really unstructured ("random" might be a better word) content, I'm seeing the value in RST + Filesystem as a CMS.  I think the Sphinx folks are right.  If you have a simple markup system and all the filesystem tools that have evolved over the past few decades, you're covered.

Further, on larger projects, I've found that I can pop out a nice template documentation tree with a simple .. toctree:: directive on the index.rst page and generate a tidy, complete documentation package without much pain.

Structured Content

For structured data, you have ordinary classes and programs.  You have SQL databases, ORM to map to classes; all of that technology.  It's easy to write applications that emit RST which you can then publish.  

Most structured content can be boiled down to tables and charts.  The .. csv-table:: directive makes it easy to have an application emit data that you fold into a more elegant-looking report.

The Nuance -- Semi-Structured Data

My worst-case scenarios are my résumés: sailing, programming and writing.  The data has deep semantic meaning:  it isn't just words.  On the other hand, the data has lots of special-cases and exceptions: it isn't totally amenable to a database.

The absolute best part of docutils is that the parser's output is available for processing.  You can -- easily -- add directives and text roles to create semantic meaning.

I experimented with XML and YAML for my résumés.  The XML is cumbersome.  The YAML requires a fairly sophisticated class model to make use of the information.  

RST with a few text roles, however, rocks.  The .. role:: directive makes it easy to throw roles into a document for later use by applications.

Thursday, May 21, 2009

Applet Not Inited; the "Red X" problem

I haven't done Applet stuff in years.

I do -- intensely -- like embedding functionality in web pages.  RIA/Ajax and what-not are something I have trouble with because I'm not a graphic designer.  Javascript and Applets fall into three clear categories:
  1. Basic usability.  Javascript offers lots of little enhancements to HTML presentation that make sense.  Emphasis on "little".
  2. Client-side features.  Many things are simple calculators or other processing that makes sense on the client side -- the relevant factors can be downloaded and used by an applet or javascript script.  
  3. Junk.  There are lots of graphical effects that vary from gratuitous to irritating.  Too many folks in marketing see some "pop-up" technique and think it's cool.  Worse, they'll take an application that lacks solid use cases and try to add flashing to scrolling to emphasize something instead of reducing clutter and distraction.  Sigh.
The Common Problems

In all web-based software development, the number one problem is always permissions.  Always.  In the case of applet development, this is always hurdle number one.  The file isn't owned by the right person or doesn't have the right permissions.  You see the "applet not inited" and "red X icon" as symptoms of the applet not being downloaded at all.

The number two problem is access to resources.  Usually this is a CLASSPATH issue, but it can also be an HTML page with a wrong URI for the applet's code.  You see the "applet not inited" and "red X icon" as symptoms of the applet not being referenced correctly, or not being able to locate all of its parts.

[Technically, the basic access comes before permissions, but you usually don't get access wrong first.  Usually, you get permissions wrong; later, you discover you have a subtle access issue.]

One of the more subtle manifestations are the case-matching issues.  Your Java class definitions are usually UpperCase.  The source file and resulting class file will have this same UpperCase format.  But if you get the case wrong in your HTML, you just get an applet not inited error.  Arrgh.

When you don't work with applets all that often, the "applet not inited" is baffling.

Misdirection

I wasted hours on Google and Stack Overflow looking up "applet not inited" and "Red X icon" and similar stuff.

Then I looked at the HTML I was testing.

Surprise.  No one had moved the .jar file into the proper directory.

There's a lot of stuff on the applet not inited error.  Most of it misses the usual culprits: permissions and access to the resources.  

Wednesday, May 20, 2009

This sounds complicated, because it is

For a while, I generated documentation with Cheetah. I wrote bodies as a fragment of HTML and used Cheetah to wrap those bodies in standard templates with navigation and branding.

To write my books, I learned DocBook markup and used DocBook XSL tools to create HTML and PDF versions of the book's text. Even though XML is hard to work with, I managed to muddle through. It's painful -- at times -- but doable.  

[Eventually, I found XMLMind's XML Editor.  It rocks.  But that's off-topic.]

Then, I fount RST and RST2HTML.  For a while, I wrote my documentation in RST and used a simple script to create the HTML version of the documentation from RST source.

Why ReStructuredText?

From their site: "reStructuredText is an easy-to-read, what-you-see-is-what-you-get plaintext markup syntax".  
  • Easy-to-Read.  The markup is very, very simple.  Mostly spacing and simple quoting.  Yet, for edge cases, there is enough richness to approach DocBook XML.
  • WYSIWYG.  The markup doesn't get in the way; you write the text with a few conventions for spacing and quoting.
  • Plain Text.  A few spacing and quoting rules are used to distinguish structure from content.  Presentation is a limited part RST (like HTML where some presentation is present in the structural markup, but can be avoided.)
RST lead me, eventually to Sphinx

The Secret of Sphinx

Sphinx is RST-based markup.  You write in plaintext (plus some quoting and spacing) and you get an elegant HTML web site with inter-document references all resolved correctly, contents, indexes, auto-generated API documentation for your Python software, syntax coloring, everything.  Wow.

I can't stop myself from doing everything in Sphinx.  You create a development structure for your source files.  You use a series of toctree directives to build the resulting documentation structure that people will see and use.

I've decided to convert some ancient Cheetah-based stuff to Sphinx.  

Unmarking Up

Revising HTML-based document bodies to RST is annoying.  It can be done with Beautiful Soup.  The HTML is pretty regular (and pretty simple) so it wouldn't be too bad.  Except for a bunch of edge cases that have significant complexity.

The original Cheetah-based site wasn't purely documentation.  It doesn't fit the Sphinx use cases perfectly.  A fairly significant percentage of the Cheetah-based pages are HTML pages with complex, embedded applets to do calculations.

These pages are not -- strictly speaking -- documentation.  They're an application.  They contain markup (<embed> mostly) that RST can't generate.  Further, they have to be unit tested prior to running Sphinx to build the documentation, since the HTML is actually part of the application.

Raw HTML?

The applet pages are -- more or less -- raw HTML pages that need to be folded in with the Sphinx-generated documentation.  Sphinx has an HTML_STATIC_PATH configuration parameter that can copy these applications from project folders into destination directories.

But this leaves me with dozens of Cheetah-generated pages as part of this application.  The presence of Cheetah in the midst this Sphinx operation makes things complicated.

Or, perhaps it doesn't.

It turns out that Sphinx is built on Jinja.  There's a template engine under the hood!  That's handy.  That lets me build the application HTML with a slightly different template engine; one that's compatible with the rest of the Sphinx-generated site.

I think I've got a clean, RST-based replacement for my lovingly hand-crafted HTML.  It's a lot of rework, but the simplification is of immense value.