S.Lott-Software Architect: wsgi

Showing posts with label wsgi. Show all posts

Tuesday, September 21, 2021

Found an ancient CGI script -- part IV -- OpenAPI specification

See the previous sections, starting with the first on finding an ancient CGI script.

We don't need an OpenAPI specification. But, it is so helpful to formalize the behavior of a web site that it's hard for me to imagine working without it.

In this case, the legacy script only have a few paths, so the OpenAPI specification is relatively small.

openapi: 3.0.1
info:
  title: CGI Conversion
  version: 1.0.0
paths:
  /resources/{type}/:
    get:
      summary: Query Form
      operationId: form
      parameters:
      - name: type
        in: path
        required: true
        schema:
          type: string
      responses:
        200:
          description: Form
          content: {}
    post:
      summary: Add a document
      operationId: update
      requestBody:
        description: document
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/Document'
        required: true
      parameters:
      - name: type
        in: path
        required: true
        schema:
          type: string
      responses:
        201:
          description: Created
          content: 
            text/html:
              {}
        405:
          description: Invalid input
          content: 
            text/html:
              {}
  /resources/{type}/{guid}:
    get:
      summary: Find documents
      operationId: find
      parameters:
      - name: type
        in: path
        required: true
        schema:
          type: string
      - name: guid
        in: path
        required: true
        schema:
          type: string
      responses:
        200:
          description: successful operation
          content:
            text/html:
              {}
        404:
          description: Not Found
          content:
            text/html:
              {}

components:
  schemas:
    Document:
      type: object
      properties:
        fname:
          type: string
        lname:
          type: string

This shows the rudiments of the paths and the responses. There are three "successful" kinds of responses, plus two additional error responses that are formally defined.

There is a lot of space in this document for additional documentation and details. Every opportunity should be taken to capture details about the application, what it does now, and what it should do when it's rewritten.

In our example, the form (and resulting data structure) is a degenerate class with a pair of fields. We simply write the repr() string to a file. In a practical application, this will often be a bit more complex. There may be validation rules, some of which are obscure, hidden in odd places in the application code.

What's essential here is continuing the refactoring process to more fully understand the underlying data model and state processing. These features need to be disentangled from HTML output and CGI input.

The OpenAPI spec serves as an important part of the definition of done. It supplements the context diagram with implementation details. In a very real and practical way, this drives the integration test suite. We can transform OpenAPI to Gherkin and use this to test the overall web site. See https://medium.com/capital-one-tech/spec-to-gherkin-to-code-902e346bb9aa for more on this topic.

Tuesday, September 14, 2021

Found an ancient cgi script -- part III -- refactoring

Be sure to see the original script and the test cases in the prior posts.

We need to understand a little about what a web request is. This can help us do the refactoring.

It can help to think of a web server a function that maps a request to a response. The request is really a composite object with headers, $h$, method verb, $v$, and URL, (\u\). Similarly, the response is a composite with headers, $h$, and content, (\c\).

$$h, c = s(h, v, u)$$

The above is true for idempotent requests; usually, the method verb is GET.

Some requests make a state change, however, and use method verbs like POST, PUT, PATCH, or DELETE.

$$h, c; \hat S = s(h, v, u; S)$$

There's a state, $S$, which is transformed to a new state, $\hat S$, as part of making the request.

For the most part, CGI scripts are limited to GET and POST methods. The GET method is (ideally) for idempotent, no-state-change requests. The POST should be limited to making state changes. In some cases, there will be an explicit GET-after-POST sequence of operations using an intermediate redirection so the browser's "back" button works properly.

In too many cases, the rules aren't followed well and their will be state transitions on GET and idempotent POST operations. Sigh.

Multiple Resources

Most web servers will provide content for a number of resource instances. Often they will work with a number of instances of a variety of resource types. The degenerate case is a server providing content for a single instance of a single type.

Each resource comes from the servers's universe of resources, $R$.

$$r \in R$$

Each resource type, $t(r )$, is part of some overall collection of types that describe the various resources. In some cases we'll identify resources with a path that includes the type of the resource, $t(r )$, and an identifier within that type, $i(r )$, $\langle t( r ), i( r ) \rangle$. This often maps to a character string "type/name" that's part of a URL's path.

We can think of a response's content as the HTML markup, $m_h$, around a resource, $r$, managed by the web server.

$$ c = m_h( r )$$

This is a representation of the resource's state. The HTML representation can have both semantic and style components. We might, for example, have a number of HTML structure elements like <p>, as well as CSS styles. Ideally, the styles don't convey semantic information, but the HTML tags do.

Multiple Services

There are often multiple, closely-related services within a web server. A common design pattern is to have services that vary based on a path item, $p(u)$, within the url.

$$ h, m_h(r ); \hat S = s(h, v, u; S) = \begin{cases} s_x(h, v, u; S) \textbf{ if $p(u) = x$} \\ s_y(h, v, u; S) \textbf{ if $p(u) = y$} \\ \end{cases} $$

There isn't, of course, any formal requirement for a tidy mapping from some element of the path, $p(u)$, to a type, $t ( r ) $, that characterizes a resource, $r$. Utter chaos is allowed. Thankfully, it's not common.

While there may not be a tidy type-based mapping, there must be a mapping from a triple and a state, $\langle h, u, v; S \rangle $ to a resource, $r$. This mapping can be considered a database or filesystem query, $q(\langle h, u, v; S \rangle)$. The request may also involve state change. It can help to think of the state as a function that can emit a new state for a request. This implies two low-level processing concepts:

$$ \{ r \in R | q(\langle h, u, v; S \rangle, r) \} $$

And

$$ \hat S = S(\langle h, u, v \rangle) $$

The query processing to locate resources is one aspect of the underlying model. The state change for the universe of resources is another aspect of the underlying model. Each request must return a resource; it may also make a state change.

What's essential, then, is to see how these various $s_x$ functions are related to the original code. The $m_h(r)$ function, the $p( u )$ mappings, and the $s_{t(u)}(h, v, u; S)$ functions are all separate features that can be disentangled from each other.

Why All The Math?

We need to be utterly ruthless about separating several things that are often jumbled together.

A web server works with a universe of resources. These can be filesystem objects, database rows, external web services, anything.
Resources have an internal state. Resources may also have internal types (or classes) to define common features.
There's at least one function to create an HTML representation of state. This may be partial or ambiguous. It may also be complete and unambiguous.
There is at least one function to map a URL to zero or more resources. This can (and often does) result in 404 errors because a resource cannot be found.
There may be a function to create a server state from the existing server state and a request. This can result in 403 errors because an operation is forbidden.

Additionally, there can be user authentication and authorization rules. The users are simply resources. Authentication is simply a query to locate a user. It may involve using the password as part of the user lookup. Users can have roles. Authorization is a property of a user's role required by a specific query or state change (or both.)

As we noted in the overview, the HTML representation of state is handled (entirely) by Jinja. HTML templates are used. Any non-Jinja HTML processing in legacy CGI code can be deleted.

The mapping from URL to resource may involve several steps. In Flask, some of these steps are handled by the mapping from a URL to a view function. This is often used to partition resources by type. Within a view function, individual resources will be located based on URL mapping.

What do we do?

In our example code, we have a great deal of redundant HTML processing. One sensible option is to separate all of the HTML printing into one or more functions that emit the various kinds of pages.

In our example, the parsing of the path is a single, long nested bunch of if-elif processing. This should be refactored into individual functions. A single, top-level function can decide what the URL pattern and verb mean, and then delegate the processing to a view function. The view function can then use an HTML rendering function to build the resulting page.

One family of URL's result in presentation of a form. Another family of URL's processes the form input. The form data leads to a resource with internal state. The form content should be used to define a Python class. A separate class should read and write files with these Python objects. The forms should be defined at a high level using a module like WTForms.

When rewriting, I find it helps to keep several things separated:

A class for the individual resource objects.
A form that is one kind of serialization of the resource objects.
An HTML page that is another kind of serialization of the resource objects.

While these things are related very closely, they are not isomorphic to each other. Objects may have implementation details or derived values that should not be trivially shown on a form or HTML page.

In our example, the form only has two fields. These should be properly described in a class. The field objects have different types. The types should also be modeled more strictly, not treated casually as a piece of a file path. (What happens if we use a type name of "this/that"?)

Persistent state change is handled with filesystem updates. These, too, are treated informally, without a class to encapsulate the valid operations, and reject invalid operations.

Some Examples

Here is one the HTML output functions.

def html_post_response(type_name, name, data):
    print "Status: 201 CREATED"
    print "Content-Type: text/html"
    print
    print "<!DOCTYPE html>"
    print "<html>"
    print "<head><title>Created New %s</title></head>" % type_name
    print "<body>"
    print "<h1>Created New %s</h1>" % type_name
    print "<p>Path: %s/%s</p>" % (type_name, name)
    print "<p>Content: </p><pre>"
    print data
    print "</pre>"
    # cgi.print_environ()
    print "</body>"
    print "</html>"

There are several functions like this. We aren't wasting any time optimizing all these functions. We're simply segregating them from the rest of the processing. There's a huge amount of redundancy; we'll fix this when we starting using jinja templates.

Here's the revised main() function.

def main():
    try:
        os.mkdir("data")
    except OSError:
        pass

    path_elements = os.environ["PATH_INFO"].split("/")
    if path_elements[0] == "" and path_elements[1] == "resources":
        if os.environ["REQUEST_METHOD"] == "POST":
            type_name = path_elements[2]
            base = os.path.join("data", type_name)
            try:
                os.mkdir(base)
            except OSError:
                pass
            name = str(uuid.uuid4())
            full_name = os.path.join(base, name)
            data = cgi.parse(sys.stdin)
            output_file = open(full_name, 'w')
            output_file.write(repr(data))
            output_file.write('\n')
            output_file.close()
            html_post_response(type_name, name, data)

        elif os.environ["REQUEST_METHOD"] == "GET" and len(path_elements) == 3:
            type_name = path_elements[2]
            html_get_form_response(type_name)

        elif os.environ["REQUEST_METHOD"] == "GET" and len(path_elements) == 4:
            type_name = path_elements[2]
            resource_name = path_elements[3]
            full_name = os.path.join("data", type_name, resource_name)
            input_file = open(full_name, 'r')
            content = input_file.read()
            input_file.close()
            html_get_response(type_name, resource_name, content)

        else:
            html_error_403_response(path_elements)
    else:
        html_error_404_response(path_elements)

This has the HTML output fully segregated from the rest of the processing. We can now see the request parsing and the model processing more clearly. This lets us move further and refactor into yet smaller and more focused functions. We can see file system updates and file path creation as part of the underlying model.

Since these examples are contrived. The processing is essentially a repr() function call. Not too interesting, but the point is to identify this clearly by refactoring the application to expose it.

Summary

When we start to define the classes to properly model the persistent objects and their state, we'll see that there are zero lines of legacy code that we can keep.

Zero lines of legacy code have enduring value.

This is not unusual. Indeed, I think it's remarkably common.

Reworking a CGI application should not be called a "migration."

There is no "migration" of code from Python 2 to Python 3. The Python 2 code is (almost) entirely useless except to explain the use cases.
There is no "migration" of code from CGI to some better framework. Flask (and any of the other web frameworks) are nothing like CGI scripts.

The functionality should be completely rewritten into Python 3 and Flask. The processing concept is preserved. The data is preserved. The code is not preserved.

In some projects, where there are proper classes defined, there may be some code that can be preserved. However, a Python dataclass may do everything a more complex Python2 class definition does with a lot less code. The Python2 code is not sacred. Code should not be preserved because someone thinks it might reduce cost or risk.

The old code is useful for three things.

Define the unit test cases.
Define the integration test cases.
Answer questions about edge cases when writing new code.

This means we won't be using the 2to3 tool to convert any of the code.

It also means the unit test cases are the new definition of the project. These are the single most valuable part of the work. Given test cases that describe the old application, writing the new app using Flask is relatively easy.

Tuesday, September 7, 2021

Found an ancient cgi script -- part II -- testing

See "We have an ancient Python2 CGI script -- what do we do?" The previous post in this series provides an overview of the process of getting rid of legacy code.

Here's some code. I know it's painfully long; the point is to provide a super-specific, very concrete example of what to keep and what to discard. (I've omitted the module docstring and the imports.)

try:
    os.mkdir("data")
except OSError:
    pass

path_elements = os.environ["PATH_INFO"].split("/")
if path_elements[0] == "" and path_elements[1] == "resources":
    if os.environ["REQUEST_METHOD"] == "POST":
        type_name = path_elements[2]
        base = os.path.join("data", type_name)
        try:
            os.mkdir(base)
        except OSError:
            pass
        name = str(uuid.uuid4())
        full_name = os.path.join(base, name)
        data = cgi.parse(sys.stdin)
        output_file = open(full_name, 'w')
        output_file.write(repr(data))
        output_file.write('\n')
        output_file.close()

        print "Status: 201 CREATED"
        print "Content-Type: text/html"
        print
        print "<!DOCTYPE html>"
        print "<html>"
        print "<head><title>Created New %s</title></head>" % type_name
        print "<body>"
        print "<h1>Created New %s</h1>" % type_name
        print "<p>Path: %s/%s</p>" % (type_name, name)
        print "<p>Content: </p><pre>"
        print data
        print "</pre>"
        print "</body>"
        # cgi.print_environ()
        print "</html>"
    elif os.environ["REQUEST_METHOD"] == "GET" and len(path_elements) == 3:
        type_name = path_elements[2]
        print "Status: 200 OK"
        print "Content-Type: text/html"
        print
        print "<!DOCTYPE html>"
        print "<html>"
        print "<head><title>Query %s</title></head>" % (type_name,)
        print "<body><h1>Create new instance of <tt>%s</tt></h1>" % type_name
        print '<form action="/cgi-bin/example.py/resources/%s" method="POST">' % (type_name,)
        print """
          <label for="fname">First name:</label>
          <input type="text" id="fname" name="fname"><br><br>
          <label for="lname">Last name:</label>
          <input type="text" id="lname" name="lname"><br><br>
          <input type="submit" value="Submit">
        """
        print "</form>"
        # cgi.print_environ()
        print "</body>"
        print "</html>"
    elif os.environ["REQUEST_METHOD"] == "GET" and len(path_elements) == 4:
        type_name = path_elements[2]
        resource_name = path_elements[3]
        full_name = os.path.join("data", type_name, resource_name)
        input_file = open(full_name, 'r')
        content = input_file.read()
        input_file.close()

        print "Status: 200 OK"
        print "Content-Type: text/html"
        print
        print "<!DOCTYPE html>"
        print "<html>"
        print "<head><title>Document %s -- %s</title></head>" % (type_name, resource_name)
        print "<body><h1>Instance of <tt>%s</tt></h1>" % type_name
        print "<p>Path: %s/%s</p>" % (type_name, resource_name)
        print "<p>Content: </p><pre>"
        print content
        print "</pre>"
        print "</body>"
        # cgi.print_environ()
        print "</html>"
    else:
        print "Status: 403 Forbidden"
        print "Content-Type: text/html"
        print
        print "<!DOCTYPE html>"
        print "<html>"
        print "<head><title>Forbidden: %s to %s</title></head>"  % (os.environ["REQUEST_METHOD"], path_elements)
        cgi.print_environ()
        print "</html>"
else:
    print "Status: 404 Not Found"
    print "Content-Type: text/html"
    print                               # blank line, end of headers
    print "<!DOCTYPE html>"
    print "<html>"
    print "<head><title>Not Found: %s</title></head>" % (os.environ["PATH_INFO"], )
    print "<h1>Error</h1>"
    print "<b>Resource <tt>%s</tt> not found</b>" % (os.environ["PATH_INFO"], )
    cgi.print_environ()
    print "</html>"

At first glance you might notice (1) there are several resource types located on the URL path, and (2) there are several HTTP methods, also. These features aren't always obvious in a CGI script, and it's one of the reasons why CGI is simply horrible.

It's not clear from this what -- exactly -- the underlying data model is and what processing is done and what parts are merely CGI and HTML overheads.

This is why refactoring this code is absolutely essential to replacing it.

And.

We can't refactor without test cases.

And (bonus).

We can't have test cases without some vague idea of what this thing purports to do.

Let's tackle this in order. Starting with test cases.

Unit Test Cases

We can't unit test this.

As written, it's a top-level script without so much as as single def or class. This style of programming -- while legitimate Python -- is an epic fail when it comes to testing.

Step 1, then, is to refactor a script file into a module with function(s) or class(es) that can be tested.

def main():
    ... the original script ... 

if __name__ == "__main__":  # pragma: no cover
    main()

For proper testability, there can be at most these two lines of code that are not easily tested. These two (and only these two) are marked with a special comment (# pragma: no cover) so the coverage tool can politely ignore the fact that we won't try to test these two lines.

We can now provide a os.environ values that look like a CGI requests, and exercise this script with concrete unit test cases.

How many things does it do?

Reading the code is headache-inducing, so, a fall-back plan is to count the number of logic paths. Look at if/elif blocks and count those without thinking too deeply about why the code looks the way it looks.

There appear to be five distinct behaviors. Since there are possibilities of unhandled exceptions, there may be as many as 10 things this will do in production.

This leads to a unit test that looks like the following:

import unittest
import urllib
import example_2
import os
import io
import sys

class MyTestCase(unittest.TestCase):
    def setUp(self):
        self.cwd = os.getcwd()
        try:
            os.mkdir("test_path")
        except OSError:
            pass
        os.chdir("test_path")
        self.output = io.BytesIO()
        sys.stdout = self.output
    def tearDown(self):
        sys.stdout = sys.__stdout__
        sys.stdin = sys.__stdin__
        os.chdir(self.cwd)
    def test_path_1(self):
        """No /resources in path"""
        os.environ["PATH_INFO"] = "/not/valid"
        os.environ["REQUEST_METHOD"] = "invalid"
        example_2.main()
        out = self.output.getvalue()
        first_line = out.splitlines()[0]
        self.assertEqual(first_line, "Status: 404 Not Found")
    def test_path_2(self):
        """Path /resources but bad method"""
        os.environ["PATH_INFO"] = "/resources/example"
        os.environ["REQUEST_METHOD"] = "invalid"
        example_2.main()
        out = self.output.getvalue()
        first_line = out.splitlines()[0]
        self.assertEqual(first_line, "Status: 403 Forbidden")
    def test_path_3(self):
        os.environ["PATH_INFO"] = "/resources/example"
        os.environ["REQUEST_METHOD"] = "GET"
        example_2.main()
        out = self.output.getvalue()
        first_line = out.splitlines()[0]
        self.assertEqual(first_line, "Status: 200 OK")
        self.assertIn("<form ", out)
    def test_path_5(self):
        os.environ["PATH_INFO"] = "/resources/example"
        os.environ["REQUEST_METHOD"] = "POST"
        os.environ["CONTENT_TYPE"] = "application/x-www-form-urlencoded"
        content = urllib.urlencode({"field1": "value1", "field2": "value2"})
        form_data = io.BytesIO(content)
        os.environ["CONTENT_LENGTH"] = str(len(content))
        sys.stdin = form_data
        example_2.main()
        out = self.output.getvalue()
        first_line = out.splitlines()[0]
        self.assertEqual(first_line, "Status: 201 CREATED")
        self.assertIn("'field2': ['value2']", out)
        self.assertIn("'field1': ['value1']", out)


if __name__ == '__main__':
    unittest.main()

Does this have 100% code coverage? I'll leave it to the reader to copy-and-paste, add the coverage run command and look at the output. What else is required?

Integration Test Case

We can (barely) do an integration test on this. It's tricky because we don't want to run Apache httpd (or some other server.) We want to run a small Python script to be sure this works.

This means we need to (1) start a server as a separate process, and (2) use urllib to send requests to that separate process. This isn't too difficult. Right now, it's not obviously required. The test cases above run the entire script from end to end, providing what we think are appropriate mock values. Emphasis on "what we think." To be sure, we'll need to actually fire up a separate process.

As with the unit tests, we need to enumerate all of the expected behaviors.

Unlike the unit tests, there are (generally) fewer edge cases.

It looks like this.

import unittest
import subprocess
import time
import urllib2

class TestExample_2(unittest.TestCase):
    def setUp(self):
        self.proc = subprocess.Popen(
            ["python2.7", "mock_httpd.py"],
            cwd="previous"
        )
        time.sleep(0.25)
    def tearDown(self):
        self.proc.kill()
        time.sleep(0.1)
    def test(self):
        req = urllib2.Request("http://localhost:8000/cgi-bin/example.py/resources/example")
        result = urllib2.urlopen(req)
        self.assertEqual(result.getcode(), 200)
        self.assertEqual(set(result.info().keys()), set(['date', 'status', 'content-type', 'server']))
        content = result.read()
        self.assertEqual(content.splitlines()[0], "<!DOCTYPE html>")
        self.assertIn("<form ", content)

if __name__ == '__main__':
    unittest.main()

This will start a separate process and then make a request from that process. After the request, it kills the subprocess.

We've only covered one of the behaviors. A bunch more test cases are required. They're all going to be reasonably similar to the test() method.

Note the mock_httpd.py script. It's a tiny thing that invokes CGI's.

import CGIHTTPServer
import BaseHTTPServer

server_class = BaseHTTPServer.HTTPServer
handler_class = CGIHTTPServer.CGIHTTPRequestHandler

server_address = ('', 8000)
httpd = server_class(server_address, handler_class)
httpd.serve_forever()

This will run any script file in the cgi-bin directory, acting as a kind of mock for Apache httpd or other CGI servers.

Tests Pass, Now What?

We need to formalize our knowledge with a some diagrams. This is a Context diagram in PlantUML. It draws a picture that we can use to discuss what this app does and who actually uses it.

@startuml
actor user
usecase post
usecase query
usecase retrieve
user --> post
user --> query
user --> retrieve

usecase 404_not_found
usecase 403_not_permitted
user --> 404_not_found
user --> 403_not_permitted

retrieve <|-- 404_not_found
@enduml

We can also update the Container diagram. There's an "as-is" version and a "to-be" version.

Here's the as-is diagram of any CGI.

@startuml
interface HTTP

node "web server" {
    component httpd as  "Apache httpd"
    interface cgi
    component app
    component python
    python --> app
    folder data
    app --> data
}

HTTP --> httpd
httpd -> cgi
cgi -> python
@enduml

Here's a to-be diagram of a typical (small) Flask application.

@startuml
interface HTTP

node "web server" {
    component httpd as  "nginx"
    component uwsgi
    interface wsgi
    component python
    component app
    component model
    component flask
    component jinja
    folder data
    folder static
    httpd --> static
    python --> wsgi
    wsgi --> app
    app --> flask
    app --> jinja
    app -> model
    model --> data
}

HTTP --> httpd
httpd -> uwsgi
uwsgi -> python
@enduml

These diagrams can help to clarify how the CGI will be restructured. A complex CGI might have a database or external web services involved. These should be correctly depicted.

The previous post on this subject said we can now refactor this code. The unit tests are required before making any real changes. (Yes, we made one change to promote testability by repackaging a script to be a function.)

We're aimed to start disentangling the HTML and CGI overheads from the application and narrowing our focus onto the useful things it does.

Tuesday, August 31, 2021

We have an ancient Python2 CGI script -- what do we do?

This was a shocking email: the people have a Python 2 CGI script. They needed advice on Python 2 to 3 migration.

Here's my advice on a Python 2 CGI script: Throw It Away.

A great deal of the CGI processing is part of the wsgi module, as well as tools like jinja and flask. This means that the ancient Python 2 CGI script has to be disentangled into two parts.

All the stuff that deals with CGI and HTML. This isn't valuable and must be deleted.
Whatever additional, useful, interesting processing it does for the various user communities.

The second part -- the useful work -- needs to be preserved. The rest is junk.

With web services there are often at least three communities: the "interactive users", "analysts", and the administrators who keep it running. The names vary a lot with the problem domain. The interactive users may further decompose into anonymous visitors, people with privileges to make changes, and administrators to manage the privileges. There may be multiple flavors of analytical work based on the web transactions that are logged. A lot can go on, and each of these communities has a feature set they require.

The idea here is to look at the project as a rewrite where some of the legacy code may be preserved. It's better to proceed as though this is new development with the legacy code providing examples and test cases. If we look at this as new, we'll start with some diagrams to provide a definition of done.

Step One

Understand the user communities. Create a 4C Context Diagram to show who the users are and what the expect. Ideally, it's small with "users" and "administrators." It may turn out to be big with complex privilege rules to segregate users.

It's hard to get this right. Everyone wants the code "converted". But no one really knows all the things the code does. There's a lot of pressure to ignore this step.

This step creates the definition of done. Without this, there's no way to do anything with the CGI code and make sure that the original features still work.

Step Two

Create a 4C Container Diagram showing the Apache HTTPD (or whatever server you're using) that fires the CGI. Document all other ancillary things are going on. Ideally, there's nothing. Ideally, this is a minor, stand-alone server that no one noticed until today. Label this picture "As Is." It will change, but you need a checklist of what's running right now.

(This should be very quick to produce. If it's not, go back to step one and make sure you really understand the context.)

Step Three

Create a 4C Component Diagram, and label it "As Is". This has all the parts of your code base. Be sure you locate all the things in the local site-packages directory that were added onto Python. Ideally, there isn't much, but -- of course -- there could be dozens of add-on libraries.

You will have several lists. One list has all the things in site-packages. If the PYTHONPATH environment variable is used, all the things in the directories named in this environment variable. Plus. All the things named in import statements.

These lists should overlap. Of course someone can install a package that's not used, so the site-packages list should be a superset of the import list.

This is a checklist of things that must be read (and possibly converted) to build the new features.

Step Four?

You'll need two suites of fully automated tests.

Unit tests for the Python code. This must have 100% code coverage and will not be easy.
Integration tests for the CGI. You will be using the WSGI module instead of Apache HTTPD (or whatever the server was) for this testing. You will NOT integrate with the original web server, because, that interface is no longer supported and is a security nightmare.

Let's break this into two steps.

Step Four

You need automated unit tests. You need to reach at last 100% code coverage for the unit tests. This is going to be difficult for two reasons. First, the legacy code may not be easy to read or test. Second, Python 2 testing tools are no longer well supported. Many of them still work, but if you encounter problems, the tool will never be fixed.

If you can find a Python 2 version of coverage, and a Python 2 version of pytest, I suggest using this combination to write a test suite, and make sure you have 100% code coverage.

This is a lot of work, and there's no way around it. Without automated testing, there's no way to prove that you're done and the software can be trusted in production.

You will find bugs. Don't fix them now. Log them by marking the test case with the proper answer different from the answer you're getting.

Step Five

Python has a built-in CGI server you can use. See https://docs.python.org/3/library/http.server.html#http.server.CGIHTTPRequestHandler for a handler that will provide core CGI features from a Python script allowing you to test without the overhead of Apache httpd or some other server.

You need an integration test suite for each user stories in the context you created in Step One. No exceptions. Each User. Each Story. A test to show that it works.

You'll likely want to use the CGIHTTPRequestHandler class in the http.server module to create a test server. You'll then create a pytest fixture that starts the web server before a test and then kills the process after the test. It's very important to use subprocess.Popen() to start and stop the target server to be sure the CGI interface works correctly.

It is common to find bugs. Don't fix them now. Log them by marking the test case with the proper answer different from the answer you're getting.

Step Six

Refactor. Now that you have automated tests to prove the legacy CGI script really works, you need to disentangle the Python code into three distinct components.

A Component to parse the request: the methods, cookies, headers, and URL.
A Component that does useful work. This corresponds to the "model" and "control" part of the MVC design pattern.
A Component that builds the response: the status, headers, and content.

In many CGI scripts, there is often a hopeless jumble of bad code. Because you have tests in Step Four and Step Five, you can refactor and confirm the tests still pass.

If the code is already nicely structured, this step is easy. Don't plan on it being easy.

One goal is to eventually replace HTML page output creation with jinja. Similarly, another goal is to eventually replace parsing the request with flask. All of the remaining CGI-related features get pushed into a wsgi-compatible plug-in to a web server.

The component that does the useful work will have some underlying data model (resources, files, downloads, computations, something) and some control (post, get, different paths, queries.) We'd like to clean this up, too. For now, it can be one module.

After refactoring, you'll have a new working application. You'll have a new top-level CGI script that uses the built-in wsgi module to do request and response processing. This is temporary, but is required to pass the integration test suite.

You may want to create an intermediate Component diagram to describe the new structure of the code.

Step Seven

Write an OpenAPI specification for the revised application. See https://swagger.io/specification/ for more information. Add the path processing so openapi.json (or openapi.yaml) will produce the specification. This means updating unit and integration tests to add this feature.

While this is new development, it is absolutely essential for building any kind of web service. It will implement the Context diagram, and most of the Container diagram. It will describe significant portions of the Component diagram, also. It is not optional. It's very likely this was not part of the legacy application.

Some of the document structures described in the OpenAPI specification will be based on the data model and control components factored out of the legacy code. It's essential to get these details write in the OpenAPI specification and the unit tests.

This may expose problems in the CGI's legacy behavior. Don't fix it now. Instead document the features that don't fit with modern API's. Don't be afraid to use # TODO comments to show what should be fixed.

Step Eight

Use the 2to3 tool to convert ONLY the model and control components. Do not convert request parsing and response processing components; they will be discarded. This may involve additional redesign and rewrites depending on how bad the old code was.

Convert the unit tests for ONLY the model and control components components.

Get the unit tests for the model and control to work in Python 3. This is the foundation for the new web site. Update the C4 container, component, and code diagrams. Since there's no request handling or HTML processing, don't worry about code coverage for the project as a whole. Only get the model and control to have 100% coverage.

Do not start writing view functions or HTML templates until underlying model and control module works. This is the foundation of the application. It is not tied to HTTP, but must exist and be tested independently.

Step Nine

Using Flask as a framework and the OpenAPI specification for the web application, build the view functions to exercise all the features of the application. Build Jinja templates for the HTML output. Use proper cookie management from Flask, discarding any legacy cookie management from the CGI. Use proper header parsing rules in Flask, discarding any legacy header processing.

Rewrite the remaining unit tests manually. These unit tests will now use the Flask test client. The goal is to get back to 100% code coverage.

Update the C4 container, component, and code diagrams.

Step Ten

There are untold number of ways to deploy a Flask application. Pick something simple and secure. Do some test deployments to be sure you understand how this works. As one example, you can continue to use Apache httpd. As another example, some people prefer GUnicorn, others prefer to use NGINX. There's lots of advice in the Flask project on ways to deploy Flask applications.

Do not reuse the Apache httpd and CGI interface. This was terrible.

Step Eleven

Create a pyproject.toml file that includes a tox section so that you have a fully-automated integration capability. You can automate the CI/CD pipeline. Once the new app is in production, you can archive the old code and never use it again for anything. Ever.

Step Twelve

Fix the bugs you found in Steps Four, Five, and Seven. You will be creating a new release with new, improved features.

tl;dr

This is a lot of work. There's no real alternative. CGI scripts need a lot of rework.

S.Lott-Software Architect

Moved

Moved. See https://slott56.github.io. All new content goes to the new site. This is a legacy, and will likely be dropped five years after the last post in Jan 2023.