close

Moved

Moved. See https://slott56.github.io. All new content goes to the new site. This is a legacy, and will likely be dropped five years after the last post in Jan 2023.

Showing posts with label database design. Show all posts
Showing posts with label database design. Show all posts

Tuesday, June 12, 2018

Coping with a Spreadsheet Database

A common way to save persistent, important data is a spreadsheet. It provides a handy, potentially normalized store that's readily accessible with minimal tooling. It has a UI usable by people with a spectrum of skills.

Sadly.

There's a core conflict:
  • The advantages of spreadsheets-as-database are numerous. 
  • The disadvantage is the lack of any strict, formal control over the schema.
At the very best, the steward of the data has some discipline and they include column headers and assure they're used throughout the rows of data.

It goes downhill rapidly from that ideal.

Let's look at some scenarios. And. How to cope. And. Python to the Rescue.

Outliers, Special Cases, Anomalies, and other Irregularities

The whole point of a "normalized" view of the data is to identify a pattern, assign the lofty title of "Schema" to the pattern, and assure all of the data fits the schema. In rare cases, all of the data fits a simple schema. These cases are so rare they only exist in examples of SQL code in tutorials.

A far more common case is to have several subtypes which are so similar that optional attributes (or "nullable columns" in SQL parlance) allow one schema description to encompass all of the cases. If you're a JSON Schema person, this is the "OneOf" or "AnyOf" type definition.

Some folks will try argue that optional attributes don't always mean that there are several subtypes. They'll ramble on for a while and eventually land on "state change" as a reason for optional attributes. The distinct states are distinct subtypes. Read up on the State design pattern for OO programming. Optional attributes is the definition of subtype.

The hoped-for simple case is a superclass extended by subclasses used to add new attributes. In this case, they're all polymorphic with respect to the superclass. In a spreadsheet page, the column names reflect the union of all of the various attributes. There are two minor variants in the way people use this:

  • An attribute value is a discriminator among the subtypes. We like this in SQL processing because it's fast. It also allows for some validation of the discriminator value and the pattern of attributes present vs. attributes omitted. Of course, the pattern of empty cells may disagree with the discriminator value provided.
  • The pattern of attributes provided versus omitted is used to identify the subtype. This is a more reliable way to detect subtypes. There can, of course, be problems here with values provided accidentally, or omitted accidentally.
The less desirable case is disjoint classes with few common attributes. Worse, the common attributes are not part of the problem domain, but are things that feel databasey, like made-up surrogate keys. There's an "ID" in column A or some other such implementation detail. Some of the rows use column A and columns B to G. The other rows use column A and columns H to L. The only common attributes are the surrogate keys, perhaps mixed with foreign key references to rows in other spreadsheet tables or pages.)

This is a collection of disjoint types, slapped together for no good reason. SQL folks like to call it "multiple master-detail relationships". The master record has children of multiple types. In some cases, the only thing the children have in common is the foreign key relationship with the parent. If you want a concrete example, think of customer contact information: multiple email addresses, multiple phone numbers. The two contacts have nothing in common except belonging to one customer. 

These don't belong in a single spreadsheet table. But. There they are. Our code must disentangle the subtypes.

Arrays

A lot of spreadsheet data is a two-dimensional grid. Budgets, for example, might have categories down the page and months across the page. 

This is handy for visualization. But. It's not the right way to process the data at all. 

This extends, of course, to higher orders. Each tab of a spreadsheet may be a dimension of visualization. There may be groups of tabs with a complex naming convention to include multiple dimensions into tab names. Rows may have multiple-part names, or use bullets and indentation to show a hierarchy.

All of these techniques are ways to provide a number of dimensions around a fact that's crammed into a cell. The budget amount is the fact. The category and the month information are the two dimensions of that cell. In many cases, Star-Schema techniques are helpful for understanding the underlying data, separate from the visualization as a spreadsheet.

Our code must disentangle the dimensions of the meaningful facts. 

Normalization

There are tiers of normalization. The normalization described above is part of First Normal Form (1NF): all rows are the same and all data items are atomic. Pragmatically, it's rare that all spreadsheet rows are the same, because it's common to bundle multiple subtypes into a single table.
Sidebar Rant. Yes, the presence of nullable columns in a SQL table *is* a normalization error. There, I said it. Error. We can always partition the rows of table into a number of separate tables; in each of those tables, all columns are required. We can rebuild the original table (with optional fields) via a union of the various decompositions (none of which have optional fields). The SQL folks prefer nullable columns and 1NF violations over unions and 1NF absolutism. I'm a fan of 1NF absolutism to understand each and every nullable attribute because casual abuse of nulls is a common design error.
The other part of 1NF is each value is atomic: there's no internal structure to the value. In manually-prepared spreadsheet data, this is difficult to insist on.  Stuff gets combined into a single cell because -- well -- it seemed helpful to the people entering it. They put all the lines of an address into a single cell because they like to see it that way.

Third Normal Form (3NF) forbids derived data (and transitive dependencies). In a spreadsheet, we might have a row-level computation. It helps the person confirm the data is correct. It's not "essential". It breaks the 3NF rule because the computed attribute depends on other field values; a change to one attribute will also change the derived attribute.

When we first encounter spreadsheet data, this isn't always obvious. In some cases, the derived data is computed "off-line" -- i.e., manually -- and entered into the spreadsheet. Really. People pull up a calculator app (or whip out their phone), compute a value, and type it in. In other cases, they look something up manually and enter it.

These kinds of data entry weirdnesses require code to normalize the manually-prepared data. We'll have to decompose non-atomic fields. And we'll have to handle derived data gracefully. (Reject it? Fix it? Warn them about it? Handle it as an exception?)

Relationships

Let's talk about Second Normal Form (2NF). We really want to have a row in a table represent a single thing. The SQL folks require all of the attributes to be dependent on the row's key. In spreadsheet world, we may have a jumble of attributes with a jumble of dependencies. We may have multiple relationships in a single row.  Look at the Second Normal Form page on Wikipedia for examples of multiple relationships mashed together into a single row.

When a spreadsheet has 2NF problems, there will be situations were some collection of attributes is repeated -- verbatim -- in multiple places. The most common example in US-based data is City-State-ZIP Code. These three *always* form a consistent triple of data, and should be repeated as part of an address. In SQL terms, City and State have a functional dependency on the ZIP Code. In an Object-Oriented database, we might have a separate City-State-Zip class definition. In a document datastore, we might combine these items into a sub-document.

In any 2NF problem area, we're forced to write code which normalizes this internal relationship.

And. When we do that we'll find the kinds of problems we find with derived data: The ZIP code 22102 might be McLean or Tysons Corner. One of them is "right" and the other is "wrong", Or perhaps there needs to be an exception to handle this. Or perhaps a correction applied to coerce the wrong values to be right.

The "Association" Table

There's a SQL design pattern called an association table. This is used to handle a many-to-many relationship between two entities. Consider Boats and Owners. A boat will have multiple owners. An owner may have multiple boats. In SQL world, this requires a special table with two foreign keys. In the degenerate case, there are no other attributes. In the boat-owner relationship case, however, there's often a range of dates that specifies when an owner was associated with a boat. The range of dates applies to the relationship itself, not to boat nor to owner.

In a spreadsheet there are numerous ways to represent this. Numerous. A list of boat rows after each owner.  A list of owner rows after each boat. A number of owner columns for each boat.  A block of text with a list of owner names in a single cell. Creative people will create many creative solutions to this data representation problem.

Note that the association table is a SQL hack. It's an implementation detail, not an essential feature of the problem domain. In Python, for example, we'll need to use weakref objects to handle this cleanly. 

When Owner O1 refers to Vessel V1 it's easy to have a list of vessel references under the owner. When the Owner O1 object is no longer needed, it can be removed from memory. This decrements the references count for Vessel V1 to zero, and it will also be removed from memory, too. 

When we have mutual references, we have a problem, solved by weakrefs.

If Owner O1 refers to Vessel V1 and we also have Vessel V1 referring to Owner O1, we have mutual references. O1 has a list that includes V1.  V1 also has a list that includes O1. This means there are two strong references to O1: some variable, owner, and Vessel V1 also refers to O1. When the variable owner is no longer needed, then the reference count to O1 is decremented from two to one. And the object can't be deleted yet. 

If V1 has a weak reference to O1, then the strong reference count -- based on the variable owner -- is only one. The weak reference from V1 doesn't count for memory management purposes. O1 can be removed from memory, references to V1 will be decremented, and it, too, can be removed.

Our code will have to parse and populate the relationships. And we'll need to use weakref to be sure we can cleanly remove objects.

Coping Strategies

As noted above, we have to cope with manually-prepared spreadsheet data. It looks like this:
  1. Figure out what the likely data structure is. This isn't simple. We'll look at Pythonic techniques below. When starting, it helps to draw UML class diagrams (or ER diagrams) over and over again to try and depict the data. I'm a fan of using https://yuml.me to draw the pictures because they have a super-handy text notation for the relationships and attributes.
  2. Leverage the Extract-Transform-Load design pattern.

    • The "extract" reads the source spreadsheet data. A first version will be trivial use of xlrd or csv module. Or any of the modules listed here: http://www.python-excel.org
    • The "transform" should be implemented as a function to transform source to the target model. Pragmatically, this single function will leverage a number of other functions to validate, cleanse, convert, and normalize the data.
    • The "load" may not be anything more than creating instances of the underlying model classes. In some cases, the instances of the model classes may wind up in an in-memory dictionary. In other cases, the "load" might be a simple use of pickle or shelve to persist the useful data.

  3. Separate Model, ETL, and "Real Work" from each other. The model should evolve very slowly. It's the essential problem we're solving. The ETL may vary with each major revision to the spreadsheet database. Users add columns, they change meanings, their understanding evolves. The final work is based on the model -- and only the model -- ignoring the vagaries of ETL.
  4. Plan for change. Each manually-prepared spreadsheet is a unique snowflake, precious and distinct. This leads to an important lesson based on the Open/Closed Principle: Code Must Be Closed To Modification and Open To Extension. Each version of the source data means adding new functions or classes to cope with each bizarre new spreadsheet issue. When the source data changes, don't modify any old code; Always Be Adding. This means planning for multiple versions of functions: validate_1(), validate_2(), validate_3().  It's essential to be able process *all* old versions of the data and get meaningful, useful results for regression testing.

Python To The Rescue

Data modeling must be done slowly and reluctantly. Don't overfit the model to the first spreadsheet.

Here's the place to start

from typing import SimpleNamespace
class Model(SimpleNamespace ):
    pass

This is *enough* modeling to get started. Don't over-engineer the model. We can then do things like this.

class Owner(Model):
    pass

This defines the class Owner as an instance of some abstract Model class. The SimpleNamespace allows us to have any attributes we think we need.

owner = Owner(vessel=some_id, name=row['name'])

We can leverage the SimpleNamespace to build useful objects with minimal code. This can be replaced with a typing.NamedTuple or a @dataclass class definition when the definition is more mature.

The "extract" code needs to gather row-like objects. Ideally, this is a generator function. Because normalization and dereferencing may require multiple passes through the data, a list can be slightly easier to deal with. We'll come back to normalization and dereferencing below.

For some background in the classes used here, see https://sourceforge.net/projects/stingrayreader/. (Yes, this is old; I'm thinking of moving it to GitHub and updating it to Python 3.7.)

def load_live_rows(workbook, sheet_name):
    sheet1 = sheet.EmbeddedSchemaSheet(workbook, sheet_name, schema.loader.HeadingRowSchemaLoader)
    dict_rows = sheet1.schema.rows_as_dict_iter(sheet1)
    clean_data = filter(lambda row:not row['Hull No.'].is_empty(), dict_rows)
    initial_data = take_until(lambda row:row['Hull No.'].to_str() == 'Definitely WB Owners:', clean_data)
    return list(initial_data)

Step-by-step.
  1. We're working with a sheet that has the schema embedded in it. That means using the heading rows as column information. The HeadingRowSchemaLoader will be grabbing the first few rows from the EmbeddedSchemaSheet. Sometimes we need more complex loaders to read multiple rows. If the schema is separate from the sheet, then the loader doesn't interact with the source of data. 
  2. Each row is modeled as a simple dictionary in this example code.
  3. A filter locates rows that have hull numbers. Other rows are quietly discarded.
  4. The take_until() function reads rows until the matching row is found, then stops. This chops off the bottom of the spreadsheet where manual notes were kept.
The resulting list of rows can be validated, cleansed, and normalized to create the useful instances of the various Model subclasses.

Here's the "transform" portion.

def make_owner_1(row: Dict[str, Cell]) -> Owner:
    return Owner(
        last_name=null_strip(row["Owner's Last Name"].to_str()),
        first_name=null_strip(row["Owner's First Name"].to_str()),
        display_name=null_strip(row["Display Name"].to_str()),
        website=null_strip(row["Website"].to_str()),
        owner_vessel=[],
    )

We've built an instance of the Owner subclass of Model by extracting a number of attributes from the row. There are other columns not extracted; they are part of various normalizations and dereferencing.

The owner_vessel attribute is a parent-child relationship that can't be trivially populated from the row. The SQL folks would include a foreign key in each child that refers to the parent. The vessel page of the spreadsheet has this information, and it's used to populate the owner's details. This is one of the dereferencing activities that needs to be done as part of "loading".

The to_str() method is feature of the Stingray Reader's cell definitions. Conversion methods like this are not typical of idiomatic Python code. If we were only creating built-in str, float, or int, the bunch of conversion methods would be A Bad Idea. To be useful, we also need to create Decimal objects, and that leads us to embracing a grid of conversion methods for each cell source to desired resulting objects. We could use decimal(str(cell)), but it seems cleaner to use cell.to_decimal().

Multiple Passes

We often touch the source more than once.
  1. There's a "validate and load" pass to get rows that are sensible to process. A generator might make sense here. 
  2. There may be a "cleanse and convert" pass to reformat the source data, perhaps parsing complex cells into components or combining multiple source rows into a single entity description. This, too, might involve a generator to restructure the spreadsheet rows into something sensible.
  3. There will be multiple "normalization" passes. Any 2NF relationships need to be extracted to create model objects. Any restructuring of complex dimensions should be handled via restructuring source data from grid to rows, or from multiple sheets to a single, long, sequence of rows with the various dimensions as explicit attributes of each row.
  4. There may be multiple "load" passes to build final objects from the source rows. This will often lead to including the built objects as part of the source data.
  5. There will be some final "dereferencing" passes where foreign key relationships are turned into proper references among the objects. These should be weakref references to permit proper garbage collection.
At this point, the application will have tidy collections of Python objects that can be used for the real work.

What's essential is finding a balance between end-user visualization of the data in a spreadsheet and schema validation in Python. It's often helpful to be flexible when trying to automate processing of complex, irregular, manually gathered data.

Letting candidate users work with spreadsheets lowers the barrier to automation.

Coping with irregularity gets the process started.

As the work matures, some schema controls will tend to evolve. People tend to recognize the cost and complexity of irregular data. They will try to identify the patterns and impose some order on those patterns. As they uncover patterns in the data, the "schema" will evolve. This is a good thing, and Python lets this proceed at a human pace.

We can -- easily-- create flexible tools that let people understand and organize their data.

Tuesday, January 30, 2018

The SQL-based relational database isn't perfection? Whoa if true

Yes, there are people for whom document databases (and the file system) are confusing and weird.

I was sent this: Relational Algebra Is the Root of SQL Problems which is really brilliant and provides some helpful concrete examples of stuff SQL is really bad at.

The accompanying email was filled with nonsense about how important and world-changing SQL was.

I can't disagree. Back when disk was very expensive and very small, the SQL-based join strategies where essential for micro-managing every bit of data. Literally. Every Bit.

And then we would denormalize the structure for performance reasons. Because we always knew the SQL was terrible at a fairly large number of things.

Those days are behind us. We can now chose to use a document database, and make our lives simpler. Storage is relatively inexpensive, and the labor to normalize and denormalize data doesn't create significant value. The need to write stored procedures to turn a single conceptual operation into a bunch of inserts and updates was a symptom that this wasn't the best approach.

I've had many "But what about..." conversations regarding document databases.

"What about ad-hoc queries in SQL?"

- Do you really do these without writing a Python script or creating a Pandas dataframe? I doubt it. But. If you really think you'll do this, most document stores either support a modified SQL or Javascript. And yes, you hate Javascript, duly noted. I hate SQL, so we're even there.

"What about joins?"

- It's a space-saving technique. We don't need the overheads to save the space. The "update anomalies" still require careful design, and may lead to some decomposition of data into multiple documents. But the ruthless normalization shouldn't be seen as a requirement.

"What about the schema?"

- It's brittle and schema migration creates a lot of low-value labor. We can use Python JSONSchema to validate documents. See NoSQL Database doesn't Mean No Schema.

Transactional v. Analytical

It requires some care to understand the distinction between "transactional" and "analytical" uses for data. While folks try to leverage this distinction, it's a spectrum not a distinction.

A lot of data collection is a simple sequence of event documents. These have no sensible state change, so they're not really transactional. They are often created by concurrent processes where locking prevents corruption, so transactions *seem* helpful. Except, of course, the file system writes can be trivially sharded by process ID and then unified later. And all document databases serialize document writes from multiple client processes, so there's no value to writing a relational database.

Some data operations are properly stateful. By normalizing our tables, moving from consistent state to consistent state is made complex. Which requires a defined transaction as a work-around. And don't get me started on replication and two-phase commit as yet another layer of complexity on top of transactions.

A document database allows us to skip over 1NF. We can think of a document as being a row in a table where the data types are complex data structures involving mappings, sequences, strings, numbers, booleans, and nulls. (See JSON Schema.) A lot of multi-step SQL transactions are operations on several children of a common parent. If the parent was persisted as a single document, there wouldn't be multiple operations, an atomic MongoDB update operation can make complex rewrites to a complex document.

We can contrive a design where state changes must be coordinated and the data cannot be colocated in a single document. It's not difficult to stipulate enough requirements to make single documents difficult. The presence of these contrived requirement, however, doesn't suddenly invalidate document datastores for transactional data. In the SQL world, the idea of long-running and reversible long-running transactions has always been a horrible problem. Allowing stacked "undo" for the user means either creating a chain of Memento objects that can recover previous state, or having numerous flags and indicators on each record, allowing the state to be reversed. Some design problems are really hard. And the SQL model seems to make them harder.

The core ACID concepts of always consistent is -- in practice -- nonsense. As soon as we have to consider "isolation levels" and "read consistency" it becomes clear that there is no consistent state unless all transactions and queries are serialized via exclusive "whole database" locking. Competent DBA's know that long-running analytic queries performed concurrently with transactional updates can't use locking, and must tolerate inconsistencies in the database.

It's common practice to do data extracts so that analytic queries aren't working against the (inconsistent) transactional data. In this case, the frequency of extracts is the timing of "eventual consistency" promised by the BASE concept.

Bottom Line: Relational ACID rules are almost always broken in practice by read consistency rules and extracts to analytic databases. Analytical data is always based on eventual consistency expectations. The batch extracts means "eventually" is measured in hours. A document data store can often create consistency in milliseconds. (MongoDB primary failure, voting, and secondary promotion to primary relies on a 10-second heartbeat, so it takes time to discover and repair.)

Also

A second email detailed their amazement (Amazing! Wow! Unbelievable! You Must Inform The World Of This!) that analytic processing of data is actually faster and simpler using the file system. The very idea of HDFS was so amazing that they were amazed.

Somehow, the idea of the raw filesystem as being really, really fast was the source of much amazement.

I'm glad they're making an effort to catch up. I'm glad they're seeing the relational model as a bad choice that has a limited number of use cases. Mostly, relational databases are useful for an organization can't write API's to handle the integrity issues.

To SQL or NoSQL? That's the database question | Ars Technica

Tuesday, June 6, 2017

An Epic Fail Example

What's the most Epic Fail I've ever seen?

I was a traveling consulting for almost 35 years. I saw a lot. I did learn from epic fail scenarios. But. I haven't really spent a lot of time thinking about the lessons learned there. I never have a glib answer to this question. Mostly because the stories are incomplete: I came in during a awful mess and left and it was still an awful mess. No arc. No third act. No punchline.

These aren't really stories as much as they're vignettes -- just sad fragments of some larger tragedy. Consequently, they don't leap to front of mind quickly.

One example is a smallish company that had built some pretty cool software in MS-Access. They had created something that was narrowly focused on a business problem and they were clever, so it worked. And worked well.

They leveraged this success, solving another major business problem. In MS-Access. Clever. Focused on real user's real needs. It doesn't get any better than that.

Well, of course, it does get better than that.

They replicated their success seven times. Seven interlocking MS-Access databases. They had subsumed essentially all of the company's information and data processing. Really. It's not that hard to do. Companies buy General Ledger software that doesn't really do very much. You can write a perfectly serviceable ledger application yourself. (Many people ask "why bother?")

When I talked with them they had finally been swamped by the inevitable scalability problem. They had done all the hackarounds they could do. Their network of MS-Access servers and interlocked cluster of databases had reached it's limit of growth.

Unsurprisingly.

Questions I did not ask at the time: Who let this happen? Who closed their eyes to the scalability problem and let this go forward? Who avoided the idea of contingency planning? How do you back this up and restore it to a consistent state?

They were in a world of trouble. I told them what they had to do and never saw them again. End of vignette.

(In case you want to know... I told them to get a real server, install SQL-Server, and migrate each individual MS-Access table to that central SQL-Server database, replacing the Access table with an ODBC connection to the central DB. This would take months. Once every single database was expunged from MS-Access, they could start to look at a web-based front-end to replace the Access front-end.)

There are others. I'll have to ransack my brain to see if I've got other examples.

Tuesday, September 27, 2016

Database Schema Migration

Some thoughts: http://workingwithdevs.com/delivering-databases-migrations-vs-state/

This covers a lot of ground on the Declarative vs. Procedural question. It explains a lot of the considerations that lead to choosing a procedural schema evolution vs. a declarative schema with an implied change sequence to migrate to each new declared state.

The article calls the declarative "state-based" and procedural approach "migration-based".

My 2¢ are focused on this point: 
When using a state-based solution you will most often be using a diff tool like those provided by Redgate or Visual Studio to examine the differences and generate an upgrade script. While this is a very efficient solution for most changes, with table renames and a few other types of table refactoring they can do bad things, ...
This point about table refactoring is, for me, the show-stopper. Relational theory tells me that I can map any schema to any other schema using selection, projection, and join. I can denormalize data and I can normalize again via group-by clauses. I can reduce the original schema to a sequence of object-attribute-value triples, and restructure this into any desired new schema. 

Given enough time, a change tracking tool should be able to find a minimal-cost transformation from schema to schema. This might involve a complex search over a large state space, and it certainly involves creating costs for each alternative query plan. 

Pragmatically, I'm not sold on this being a good idea. And I'm rarely sure I even want to get involved in a fully automated solution. While a tool might be able to detect and automate a variety of simple changes, I think that developers must always vet those change scripts.

In particular, the search space is emphatically not limited to select, project, and join. There are also database unload-reload, index create and drop. There are even more complex operations like creating intermediate results which aren't part of the final database structure. With proper indices, these might actually be beneficial.

In some cases, the continuous operation requirements are such that we might have two copies of a database: one being used and the other being transformed. A logger tracks transactions in the older copy and a synchronizer replicates those transactions in the new copy. After the data is moved, the customer access is moved via a feature toggle from the old database to the new database.

Semantic Drift

Also important is the issue of semantic drift. When we're making structural changes where the "before" column names match the "after" column names, then there's little chance for semantic drift. There's still some possibility, though. We can (and sometimes do) repurpose columns, preserving the original name. In some cases, we might change a database constraint without renaming the column.

In the larger case, of course, it doesn't require "‘hot-fix’ changes to QA or even production databases" to create profound semantic changes. All it takes is an app developer deciding that a column should be repurposed. There's may be no structural change on the schema overall. 

A non-structural change in some past release could have implications for structural change in a future release. Imagine three columns in three tables with the same names. Two started out life as simple foreign keys to the third. But one became optional, and now the semantics don't match but the names do. Automated tools are unlikely to discern the intent here. 

Conclusion?

It's all procedural migration. I'm not declarative ("state") tools can be trusted beyond discerning the changes and suggesting a possible migration.

Tuesday, June 14, 2016

Continuous Data Migration

See http://slott-softwarearchitect.blogspot.com/2013/07/database-conversion-or-schema-migration.html

People talk about CI/CD (Continuous Integration/Continuous Deployment).

They also need to talk about CM (Continuous Migration).

"Wait, what?" you ask.

When we roll out a new version of the software (CD) there are three common situations.
  1. The new software uses the existing data model with no changes. This is a "minor version change": from v3.2 to v3.3.
  2. The new software requires a tweak to the schema, but it's backward compatible. This, too, is a minor version change. In a SQL context, we might have used an ALTER TABLE to add a nullable column. If there are no SELECT * statements in the code, this change is essentially transparent to legacy code.
  3. The new software involves a new schema that's not backwards compatible. This is a major version change. From v3.2 to v4.0. This is difficult. Really difficult.
Clearly, the first two can be done with the data in place. New software is installed, the servers are restarted, and away we go. In a big environment, there may be a rolling deployment. There may be a canary release that will get converted first, then others will be brought online.

A change of the Second Kind does involve the one-time database transformation script. This may lead to some down-time. Or it may lead to a feature toggle so that the new software can work with the old database until the script is run.

In a NoSQL context, a change of the Second Kind doesn't require the one-time script. The new documents have new fields that old documents don't have. NoSQL apps -- in general -- must be able to cope with data model variations.

A change of the Third Kind is trouble.

Big trouble.

We have two schema: the v3 schema and the v4 schema. We have two sets of software: the v3.2 release and the v4.0 release. We'd like to have just one valid set of data. How do we deal with this?

How can we do schema migration badly?

We can't easily have a single software release that includes one set of data in both schema. It's technically possible Anything that doesn't involve time travel, anti-gravity or perpetual motion is technically possible. But it rapidly becomes so complex that we have to set this uber-version idea aside.

We have to do more deployment work to have both v3.2 and v4.0 installed in parallel. v3.2 will use data in the v3 schema, v4.0 will use data in the new schema.

How do we migrate the data from the old schema to the new schema?

This can be tricky. There are proven bad ideas out there. Really epically bad ideas.

One Very Bad Idea (VBI™) is the one-time-only data migration. Back in the olden days, we couldn't afford enough storage to have two copies of the database. Seriously. When a company owned exactly one computer (before PC's -- a Very Long Time Ago) the conversion had to be done by making special backups and restoring the backups into the new schema.

This VBI is still with us today.  Lots of places want to do one-time-only data migrations because it's the traditional approach. If they can't done a one-time conversion (over a long weekend) they complain. Loudly.

BTW. This never worked well. The one-time-only conversion software was never tested carefully, and therefore rarely worked the one time it was needed. Also, data profiling was never done, so edge and corner cases were found during conversion. These often called the new software's features into question, leading to larger and larger problems.

Continuous Migration

The ideas behind continuous migration are these.
  1. We're always going to be migrating the data. Always.
  2. Storage is cheaper than labor. When in doubt, buy more storage.
  3. Data outside the database (in CSV files or YAML documents) is smaller than data inside the database. Don't be afraid to export.
  4. Data outside the database is inaccessible. Be cautious of the implied down-time during exports and imports.
  5. ABP. Always Be Profiling. If you don't have a data profiler in place right now, that's the first thing to build. There are schema definition tools and schema checking tools. Look at JSON-Schema.org. Write schema definitions and use a data profiler to examine all rows and check all rules. All. Seriously. All. In a SQL DB, actually check the foreign keys to be sure the referenced row exists; you'll be surprised.
  6. We're moving forward. We're not milling around; we're not supporting the old version except for the purposes of a parallel test or a fall-back in the event the next version doesn't work. There's no long-term coexistence strategy. Preserve the data; upgrade the software.
Here's the central data migration requirement:

Be able to migrate to the new schema as many times as needed.

I'll repeat that. As. Many. Times. As. Needed.

Migration is not a one-time thing. You do it all the time.
  1. Migrating (and possibly sanitizing or subsetting) production data into the development environment.
  2. Migrating production data for QA testing.
  3. Migrating production data for integration testing.
  4. Migrating production data for performance testing.
  5. Migration production data for the production upgrade.
These are all the same activity. 

I'll repeat that. The. Same. Activity. Sometimes with mappings. Sometimes with filters. 

Since you'll do many, many migrations, your data migration programming is as important as your application programming. Perhaps more important than the application code because it's what preserves the data, and the data is the only thing of value. Applications come and go. Data is forever.

Having real data available permits seamless, silent, and automatic parallel testing. We can easily do a parallel test with v3.2 and v4.0 release candidates by simply running the migration (or migration with subset filter) to gather some data for the parallel test. If the release candidate has problems, we can fix v4.0 to create the next release candidate, re-migrate the data, and try the parallel test again.

At some point the v4.0 release is final, and we need to migrate all of the data. This (usually) involves some feature toggles to put v3.2 into a special "end-of-life" mode where the keys for records which change are logged separately. After turning off v3.2 and turning on v4.0, a second phase of migration will process these end-of-life rows through the migration mill.

Software and Schema Design Consequences

This has an important consequence.

Your software must be explicitly bound to a specific schema by major version number.

Explicitly bound. In a SQL context, you can use the "schema" construct an include the version number in the schema name. "myapp_v3" vs. "myapp_v4". This becomes a ubiquitous qualifier on all table names. SELECT col FROM myapp_v4.some_table AS st.

Yes. Do this Everywhere. Do it Now. 

If you're using mybatis or SQLAlchemy to get the SQL out of your application, then this kind of thing is a trivial change. If you have SQL in your application code, well, you have two problems to solve. First, get the SQL out of your application. Then make the schema version explicit.

In a NoSQL context, you can include the schema version as part of a collection name. "collection_v3" or "collection_v4".

This should be present everywhere.

Then, you'll need data validation apps and data migration apps. The validation apps will use your favorite schema definition and schema validation framework. Start running this as soon as you think you might need to make a major version change.

Finally, you'll need the data migration tool set. This will involve filter rules and sanitizing rules. These are not sophisticated "rules engine" kind of things with unbounded complexity. They're usually if statements and simple computations. But they come and go pretty freely, so design the software in a way that makes the filter and sanitizing code obvious.

Now you can -- trivially -- migrate data between schema versions inside the same database. You can have v3.2 and v4.0 running side-by-side. You can migrate the data early and often. You can profile and validate the data. You have a formal schema for the data validation. 

Tuesday, January 19, 2016

SQL Hegemony -- the "Pivot Table" problem

As far as I can tell, the Pivot Table Problem™ only exists for people who have actively put on blinders so that they can only see data one way.

This leads to the following.

The context appears to be millions of rows of data. Hundreds of columns.  It appears that someone we'll call DesKtop tried to load a spreadsheet to "pivot" the data. And the spreadsheet -- of course -- breaks because it's too much data.

DesKtop then calls the DBA.

DesKtop: "We need to load a table and use the database to create a spread-sheet like pivot table."

DBA: "Okay. Cool. It's complex SQL, though. Check this out..." They look at the Oracle 11g PIVOT and UNPIVOT.

DesKtop: "Oh my. That's really complex. Okay, I guess that's the only choice, right?"

DBA: "Right, it is our only possible choice."

Wrong.

[I heard about this from DBA who sent me a "humble brag" about something that can only be done with hyper-complex SQL query. DBA had found a Python tutorial on Pandas that mentioned pivoting. The humble brag point was this: the Python stuff was just as hyper-complex as the SQL. Apparently, DBA conflated the entire tutorial with the one line of code that was the pivot example.]

If you restructure the data into (row key, column key, cell value) triples, you don't have a Pivot Table Problem™ any more. You have a SELECT reduction() GROUP BY row vs. SELECT reduction GROUP BY column kind of query. There's no "pivot". Maybe it's a conceptual pivot but there's no hyper-complex SQL.

It requires a non-trivial loader to transform data that's in row order and explode it into triples. This isn't the kind of thing a program like Oracle's SQL*Loader or other bulk loader does particularly well. In Python (without using Pandas) we can expand the data into triples like this:

for row in reader:
    for column in column_names:
        new_row = row['key'], column, row[column]
        insert...

The idea here is that we're using something like a csv DictReader. We have a list of column names we'd like to pivot. In many row-oriented data sets, there are columns we might like to ignore. For example, the row key column itself shouldn't be exploded into a (row key, column key, row key) triple.

This restructuring idea applies in full force to doing Python-based reduction of the data. Forget loading a database in the first place.

by_column = defaultdict(list)
for row in reader:
    for column in column_names:
        by_column[column].append(row[column])

We've summarized each column's data into a list of values. This is the "GROUP BY" part of the SQL. Now we can do reductions on the values in each column-based list.

from statistics import mean
for column in by_column:
    print(column, sum(by_column[column]), mean(by_column[column]))

We've done sum and mean reductions on the values in each column. We can -- of course -- layer in mapping and filtering if that's required.

This works well for millions of individual cells of data. We can comfortably hold several hundred million individual values in memory in a 32Gb desktop computer. You may notice the fan kicks on when this is running.

If this turns out to require too much storage, then the reductions can be computed item-by-item rather than simply accumulating a list of values. This a hair more complex, but not in an interesting way.

sum_by_column = defaultdict(int)
count_by_column = defaultdict(int)

for row in reader:
    for column in column_names:
        sum_by_column[column] += row[column]
        count_by_column[column] += 1

Minima and Maxima are a trifle trickier. We don't want to initialize them to None and have an if current_min is None statement executed millions of times. We have to create an iterator and process the first row specially, using it to initialize all of the values. The remaining rows can then be processed free of any initialization question.

row_iter= iter(reader)
first = next(row_iter)
for column in column_names:
    min_by_column[column]= row[column]
    max_by_column[column]= row[column]
for row in row_iter:
    for column in column_names:
        min_by_column[column] = min( min_by_column[column], row[column])
        max_by_column[column] = max(max_by_column[column], row[column])

I like to call this the Head-Tail design pattern.

The DBA and DesKtop appear to be married to SQL. Even when it appears to be an ineffective solution to their problem.
     


Tuesday, October 6, 2015

Today's Milestone: Refactoring and Django Migrations

Once upon a time, when today's old folks were young, we'd debate the two project strategies: Hard Part Do Later (HPDL) vs. Hard Part First (HPF).

The HPDL folks argued that you could pick away at the hard part until -- eventually -- it wasn't hard any more. This doesn't often work out well in practice, but a lot of people like it. Sometimes the attempt to avoid the hard part makes it harder.

The HPF folks, on the other hand, recognized that solving the hard problem correctly, may make the easy problems even easier. It may not, but either way, the hard part was done.

The debate would shift to what -- exactly -- constituted the hard part. Generally, what one person finds hard, another person has already done several times before. It's the part that no one has done before that eventually surfaces as being truly hard.

Young kids today (get off my lawn!) often try to make the case that an Agile approach finesses the "hard part" problem. We define a Minimally Viable Product (MVP) and we (magically) don't have to worry about doing the hard part first or last.

They're wrong.

If the MVP happens to include the hard part, we're back a HPF. If the MVP tries to avoid the hard part, we're looking at HPDL.

The Novelty Factor

Agile methods don't change things. We still have to Confront the Novelty (CTN™). Either it's new technology or it's a new problem domain or a new solution to an existing problem domain. Something must be novel, or we wouldn't be writing software, we'd be downloading it.

I'm a HPF person. If you set the hard part aside to do later, all the things you do instead become constraints, limiting your choices for solving the hard part that comes later. In some rare cases, you can decompose the hard part and solve it in pieces. The decomposition is simply Hard Part First through Decomposition (HPFtD™) followed by Prioritize the Pieces (PtP™) and another round of Hard Part First.

Today, we're at a big milestone in the HPF journey.

The application's data model is simple. However.

The application has a complex pipeline of processing to get from source data to the useful data model.

A strict (and dumb) MVP approach would skip building the complex pipeline and assume that it was magically implemented somehow.

A slightly smarter MVP approach uses some kind of technical spike solution to handle the complex pipeline. We do that manually until we get past MVP and decide to implement the pipeline in something more final and complete.

My HPF strategy tackles the complex pipeline because we have to build it anyway and it's hard. We don't have to build all of it. Just enough to lay out the happy path.

The milestone?

It's time to totally refactor because -- even doing the hard part first -- we have the wrong things in the wrong places. Django application boundaries generally follow the "resources". It's a lot like designing a RESTful API. Define the resources, cluster them together in some kind of ontology that provides a meaningful hierarchy.

Until -- of course -- you get past the problem domain novelty and realize that some portion of the hierarchy is going to become really lopsided. It needs to be restructured so we have a flat group of applications.

Wait. What?

Flatten?

Yes.

When we have a Django application model that's got eleventy-kabillion classes, it's too big. Think the magic number 7±2: there's a limit to our ability to grasp a complex model.

Originally, we thought we'd have apps "A", "B", and "C". However. "A" turned out to be more complex than it seemed when we initially partitioned the apps. Based on the way the classes are named and clustered in the model file, it's clear that we have an internal structure is struggling to emerge. There are too many comments and high-level organizational hints in the docstrings.

It looks like this might be the model that's emerging:
  • Former A
    • A1
    • Conceptual A2
      • A2a
      • A2b
    • A3
  • B
  • C
This means that there will be classes in A3 that depend on separate apps A2a and A2b. Further, A2 is really just a concept that unifies the design; it doesn't need to be implemented as a proper app. Both A2a and A2b depend on A1. A3 depends on A2a, A2b, and A1.  

Ugh. Refactoring. And the associated migrations. 

Django allows us to have nested apps. But. Do we really want to go there? Is a nested collection of packages really all that helpful? 

Or.

Would it be better to flatten the whole thing, and simply annotate the dependencies among apps?

The Zen Of Python suggests that Flat is Better than Nested.

The hidden benefit of Flat is that the Liskov Substitution Principle is actually a bit easier to exploit. Yes, we have a tangled web of dependencies, but we're slightly less constrained when all of the Django apps are peers. Yes, many things will depend on the A1 app, but that will be less of a problem than the current pile of classes is.

The important part here is to start again. This means I need to discard the spike database and discard the history of migrations to date. I always hate disrupting my development databases, since it has test cases I know and remember.

That's the disruptive milestone for me: discarding the old database and starting again.

Tuesday, June 9, 2015

On Waiting to Write "Serious Code"

Someone told me they weren't yet ready to write "serious code." They needed to spend more time doing something that's not coding.

I'm unclear on what they were doing. It appears they have some barriers that I can't see.

They had sample data. They had a problem statement. They had an existing solution that was not very good. I couldn't see any reason for waiting. Indeed, I can't figure out what "serious" code is. Does that mean there's frivolous code?

Because there was a previous solution, they had a minimum viable product already defined: it has to do what the previous version did, only be better in some way. One could trivially transform the previous product into unit test cases and an acceptance test case. Few things could be more amenable to coding than having test cases.

Since everything necessary seemed to be in place, I had a complete brain cramp when they mentioned they weren't yet ready to write "serious" code. "Serious?" Seriously?

It appears that this developer suffers from a bad case of Fear of Code™. I know some common sources of this fear.
  1. Waterfall Project Experience (WPE™.) Old people (like me,) who started in Waterfall World, were told that we had to produce mountains of design before we produced any code. No one knew why in any precise way. Indeed, there's ample evidence that too much design is simply a way to introduce noise into the process. In spite of real questions, some folks think that you can write a design so detailed that a coder can just type in the code from the design. (This level of design is isomorphic to code; to avoid ambiguity it must be written as code.)
  2. Relational Database Hegemony (RDH™.) Folks (like me,) who were DBA's, know that databases require a lot of design and a lot of review before they can be created. Writing stored procedures requires even more design and review time. You don't just slap an SP out there. It might be "bad" or "create problems." Also, when you insist on DBA's writing application code, it takes super-detailed, code-level design details. In effect, you must write the code for the DBA to write your code back to you. 
  3. One and Done (OAD™.) Some people like to feel that they can write code once and it can be a thing of beauty and a joy forever. The idea of a rewrite is anathema to these people. While this is obviously silly, people still like the conceit that they can produce some prototype code that will be a proper part of every future release forever and always. It's not possible to make all of the decisions the first time regarding adoption and scaling and user preferences. Your prototype code will get replaced eventually: get over it. Write the prototype, get funding, move forward. Don't dither trying to make a bunch of future-oriented decisions based on a future you cannot actually foresee. You can't "future-proof" your code.
  4. Learnings are Expensive (LAE™.) You can find people that think that the sequence of (spike, POC, version 0, version 1) is too expensive. They are sure that learning is a project drag, since no "tangible" results are created by learning. This means that they don't value intellectual property or knowledge work, either; an attitude is actually destructive to the organization. Knowledge is everything: software captures knowledge: a spike followed by a POC followed by version zero will arrive on the scene more quickly than any alternative strategy. Don't waste time trying to write version 1 from a position of ignorance.
  5. Tools are Expensive (TAE™.) Some people feel that -- since tools are expensive -- they should be used rarely. Back in the olden days, when a compiler took many minutes to produce an error report, you had to be sure the code was good. (I'm old enough that I remember when compiles took hours. Really.) Those days are gone. Most compilers today work at the "speed of light" -- if they were any faster, you couldn't tell, because you can't click any faster. For dynamic languages, like Python, the speed with which code can be emitted makes all tool considerations quaint and silly.
  6. Diagram it to Death (DTD™.) Rather than write code, some folks would rather talk about writing code. To them, email, powerpoint, and whiteboard are cheaper than coding. This is a false economy. Nothing is saved by avoiding code. Time is wasted drawing diagrams of things at a level of detail that mirrors the code. Pictures aren't bad in general. Detailed pictures are simply a stalling tactic.
I find it frustrating when people search for excuses to avoid simply creating code. While I see a number of sources, there are many counter-arguments available. 
  1. Waterfall is dead. Make something minimal that works for this sprint. Call it a "spike" if that makes you happier. Clean it up in the next sprint. Create value early. Expand on the features later.
  2. Databases are free now. SQLite and similar products mean that we can prototype a database without waiting around for DBA's to give us permission to make progress. Build the database now, get something that works. Rework the database as your understanding of the problem matures. Rework the database as the problem itself matures and morphs. Nothing is static; the universe is expanding; do something now.
  3. No code lasts forever. Waiting around to create some kind of perfect value one time only is perfect silliness. Create value early and often. Discarding code means you're making progress. If you think it's important, write "draft" on every electronic document which might get changed. (Hint: version numbers are smarter than putting "draft" everywhere.)
  4. A spike and code happens more quickly than code. It's a matter of technical risk: unfinished work is an "exposure" -- an unrealized investment. Failing soon is better than researching extensively in an effort prevent a failure that could have been found quickly.
  5. Use a dynamic language and avoid all overheads.
  6. Keep the diagrams high-level. Code is the only way to meaningfully capture details. Code endures better than some out-of-date Visio file that's in Sharepoint completely disconnected from GitHub.
It's imperative to break down the roadblocks. All "pre-coding" activities are little more than emotional props: knock them down and start coding.

Tuesday, January 7, 2014

Wrestling with the Python shelve module

While wrestling with Python's shelve module, I ran across ACID. Interesting thoughts.

Plus what appears to be the related Tumblr blog: python sweetness. Also interesting.

Not sure I can make heavy use of it right now, but it's helpful to see the thought process.

I find the subject of shelve (or pickle) and schema change endlessly fascinating.  I have no good ideas to contribute, but it helps to read about ways to track schema evolution against data that's as highly class-specific as shelve data is.

Versioning class definitions and doing data migration to upgrade a database is -- right now -- a fascinating problem.

Tuesday, June 12, 2012

The Universal Data Element Framework (UDEF)

Okay.  This is seriously cool.

The Universal Data Element Framework (UDEF)  provides a controlled vocabulary that should be used to seed a project's data model.

See http://www.udef.com/

See http://www.opengroup.org//udef/

We're looking at applying UDEF retroactively to an existing schema.  What a pain in the neck!

Step 1.  Parse the table names.  In our case, they're simply CONTIGUOUSSTRINGSOFCHARS, so we have to work out a quick lexicon and use that to break the names into words.  Then we can find the obvious aliases, spelling mistakes and noise words.  'spec', 'quanitity' and 'for' are examples of each.

Step 2.  Look up the various words in the UDEF vocabulary to create candidate matches.   Since each individual word is matched, each table will have multiple candidate matches to seed the analyst's thinking.

Step 3.  Manually pick UDEF standard names or create internal extensions to the standard for the problem domain or enterprise unique features.

Do a similar thing for the column names.  In that case, they're CamelCaseWithSomeACRONYMS.  This is slightly easier to parse, but not much.

Eventually, we have to apply real human business analyst grey matter to locating standard names which might fit with the host of legacy names.

Here's the column name parser.

def property_word_iter( prop_name ):
    """Find words via case changes.
   
    -   Lower to upper ends a word.
    -   Upper to lower ends a word.  However.
        Sometimes the Upper is an acronym that was all caps.
        A lookahead is required to disambiguate.
    """
    cc_iter= iter(prop_name)
    word=[ next(cc_iter) ]
    for c in cc_iter:
        if c.isdigit():
            yield ''.join(word)
            yield c
            word=[ next(cc_iter) ]
        if word[-1].islower() and c.islower():
            word.append(c)
        elif word[-1].isupper() and c.isupper():
            word.append(c)
        elif word[-1].islower() and c.isupper():
            yield ''.join(word)
            c2= next(cc_iter)
            if c2.isupper():
                word= [c, c2]
            else:
                word= [c.lower(), c2]
        elif word[-1].isupper() and c.islower():
            c0 = word[-1]
            yield ''.join(word[:-1])
            word= [c0.lower(), c]
        else:
            raise Exception( "What? {0!r} {1!r}".format( word[-1], c ) )
    if word:
        yield ''.join(word)

Thursday, May 26, 2011

Code Kata : "Simple" Database Design

Here's a pretty simple set of use cases for a code-kata database application.

This is largely transactional, not analytical.

It's a simple inventory of ingredients, recipes and locations.

Context
  • 42' sailboat.
  • Lots of places to keep stuff. Lots.
Stuff gets lots or misplaced. It's helpful to marry recipes with ingredients to use up the last of something before it goes bad and stinks up the boat.

Actor is essentially the cook.

Use Cases
  • Perishables to be eaten soon?
  • Shopping list for specific recipes.
  • Where did I put that?
Model

BERJAYA

  • Ingredient. A generic description: "lime", "coconut". Not too much more is needed. A "food safety" notation (refrigeration required, etc.) is a helpful attribute. Maybe a "food group" or other nutrition information.
  • Location. A text description of where things can be stored. This shouldn't have too many attributes, because boats aren't big grids. Phrases like "port saloon upper cabinet", or "galley outer cooler" make sense to folks who live on the boat.
  • On Hand. This is simply ingredient, location and a measurement of some kind. Example: 3 limes in the starboard galley center cooler. There's a lot of magic around units and unit conversion that can be fun. But that strays outside the database domain.
  • Recipe. Example: "One of sour, two of sweet, three of strong, and four of weak.", lime, simple syrup, rum, water. Plain text using a lightweight markup is what's required here. Along with a many-to-many relationship with ingredients. This is not carefully defined above because it should be done as a "more advanced" exercise.
I think this has the right amount of complexity and isn't very abstract. Since the use cases are pretty obvious to anyone who's cooked or been to a grocery store, use case details aren't essential.

Wednesday, February 2, 2011

Escaping the Relational Schema Trap

We're struggling with our Relational Schema. We're not alone, of course, everyone struggles with the relational model. The technology imposes difficult limitations and we work around them.

There's kind of a 4-step process through which the relational schema erodes into irrelevance. The concept of a schema is not irrelevant. It's the rigid relational schema that's a problem.

Many DBA's will say that the relational model is the ultimate in flexibility. They're right, but they're missing the point. The relational database clearly separates the physical storage from the logical model as seen in tables and columns. It's flexible, but the presence of a rigid relational schema limits the pace of business change.

"Clearly," the DBA says, "you don't know how to use ALTER." I beg to differ. I can use ALTER; however, it doesn't permit the broad, sweeping scope of change that the business demands.

In order to attempt to match the pace of business change, we're using an ORM layer. This allows us to fabricate methods and properties left, right and center. We can tackle some pretty big problems with simple code changes. This, however, is no longer helping.

Straws and Camels

When designing a database, we have to be cognizant of the nature and tempo of change. In highly-regulated, very settled business applications (back-office accounting, for example) the data model is well known. Changes are mostly distinctive reporting changes and the tempo is pretty lethargic. It's the back office. Sorry, but innovation rarely happens there.

Each change is just a another hand-full of straw thrown on the camel's back. It happens fairly slowly. And there aren't many surprises. Hacks, workarounds and technical debt accumulates slowly.

In innovative, novel, experimental businesses, however, the nature and tempo are very different. The changes are disruptive, "what are you saying?" kinds of changes. They are "throw out the bathwater, the babies, the cribs and fire the nursemaid" kinds of changes. The tempo is semi-annual reinvent everything. Hacks, workarounds and technical debt get out of control suddenly.

Important Lesson Learned. When the customer misunderstands the offering and asks for something completely senseless, it's good to listen and try to build that -- even if it wasn't what you were offering. In some cases, the original offering was too complex or contrived. In other cases, the offering didn't create enough value. But when you offer [X] and the customer asks how much it will cost for [Y], you have disruptive, sudden, and surprising database changes.
This is bales of hay through onto an unprepared camel. Backs can get broken.

Coping

One common coping strategy is SQL ALTER statements to fiddle with the logical model. This has to be coupled with CREATE TABLE AS SELECT scripts to do open-heart surgery on the logical model. Married with modified ORM definitions. This requires some careful "schema versioning" techniques.

Another coping strategy is lots of "Expansion" columns in the tables. These can be renamed and repurposed without physical storage changes. The rows haven't physically changed, but the column name morphed from "EXPANSION_INT_01" to "Some_Real_Attribute". This doesn't prevent the CREATE TABLE AS SELECT scripts to do open-heart surgery. It still requires some careful "schema versioning" techniques to be sure that the ORM layer matches the logical schema.

A third -- and perhaps most popular -- coping strategy is manpower. Just having dedicated DBA's and maintenance programmers is a common way to handle this. Some folks object, saying that a large staff isn't a way to "cope with change" but is a basic "cost of doing business".

It's false, by the way, to claim that dedicated DBA's are essential. A solo developer can design and implement a database and application software with no help at all. Indeed, in most organizations, developers design and build databases, then turn them over to DBA's for operational support. If the nature of change is minor and tempo of change is slow, a solo developer can deal perfectly well with the database. A dedicated DBA is someone we add when the developer gets swamped by too much change.

(Some DBA's like to claim that the developers never get normalization or indexing correct. I counter with the observation that some DBA's don't get this right, either. DBA's aren't essential. They're a popular way to cope with the nature and tempo of change.)

In the ORM world, there are schema migration toolkits. Projects like Storm, this list for Django, Embarcadero Change Manager for Oracle, and numerous others attempt to support the schema evolution and change management problem. All of this is a clever way to cope with a problem inherent in our choice of technology.

Chaos Theory

Rather than invent clever coping mechanisms, let's take a step back. If we're inventing technology to work around the fixed relational schema, it might be time to rethink the relational schema.

"Oh noes," DBA's cry, "we must have a fixed logical model otherwise chaos ensues."

Really? How come we're always altering that schema? How come we're always adding tables and restructuring the tables?

"Oh that? That's 'controlled change'," the DBA responds.

No, that's slow chaos.

Here's how it plays out. We have a disruptive change. We negotiate with the DBA's to restructure the database. And the test database. And the QA database. We do the development database without any help from the DBA's. We fix the ORM layers. We unit test the changes.

Then we plan and coordinate the production rollout of this change with the DBA's. Note. We already made the change in development. We're not allowed to make the change in production. The DBA's then suggest design alternatives. Normalization isn't "right". Or there are physical changes that need to be declared in the table definitions. We redo the development database. And the ORM layer. And rerun the unit tests.

Because the production database couldn't be touched -- and we had paying customers -- we copied production data into a development database and started doing "production" in development. Now that we're about to make the official production change, we have two databases. The official database content is out-of-date. The development database is a mixture of live production and test data. Sigh.

Rethinking Schema

If the schema is a problem, perhaps we can live without it. Enter NoSQL databases.

Here's how you start down the slippery slope.

Phase I. You need a fairly radical database change. Rather than wait weeks for the DBA's, you ask for a single "BLOB" column. You take the extra data elements for the radical change, JSON encode them, and store the JSON representation in the BLOB field. Now you have a "subschema" buried inside a single BLOB column.

Since this is a simple ALTER, the DBA's will do it without a lot of negotiation or delay. You have a hybrid database with a mixture of schema and noSQL.

Phase II. You need an even more radical change. Rather than wait weeks for the DBA's, you ask for a few tables that have just a primary key and a BLOB column. You've basically invented a document-structured database inside SQL, bypassing the SQL schema entirely.

Phase III. While waiting for the Phase II changes to be implemented, you convert the customer data from their obscure, stupid format into a simple sequential file of JSON documents and write your own simple map-reduce algorithms in Python. Sure, performance is poor, but you're up and running without any database overheads.

Phase IV. Start looking for alternatives.


This MongoDB looks really nice. PyMongo offers lots of hints and guidance.

At least one person is looking at mango, a MongoDB database adapter for Django. For us, this isn't the best idea. We use OpenAM for identity management, so our Users and Sessions are simply cloned from OpenAM by an authentication backend that gets the user from OpenAM. SQLite works fine for this.

We think we can use Django's ORM and a relational database for User and Session. For everything else, we need to look closely and MongoDB.

Wins and Losses

The big win is the ability to handle disruptive change a little bit more gracefully.

The big loss in switching away from the Django ORM is we lose the built-in admin pages. We have to build admin Forms and view functions. While this is a bit of a burden, we've already customized every model form heavily. Switching from ModelForm to Form and adding the missing fields isn't much additional work.

The biggest issue with document-oriented data models is assuring that the documents comply with some essential or core schema. Schemas are inescapable. The question is more a matter of how the schema limits change. Having a Django Form to validate JSON documents for the "essential" features is far more flexible than having a Django Model class and a mapping to a relational database.

Schema migration becomes a non-issue until we have to expand the essential schema, which changes the validation rules, and may render old documents retroactively invalid. This is not a new problem -- Relational folks cope with this, also -- but if it's the only problem, then we may have streamlined the process of making disruptive business changes.

Thursday, January 27, 2011

FAERIE DUST™

Here's how to recognize a Faerie Dust request:

  1. We have identified a problem. It can be with almost anything: scalability, reliability, auditability, any Quality Measure.
  2. We're pursuing a specific technology. Typically, something that has the lowest impact on our architecture.
  3. We can't address anything other than this specific technology variation -- we can't change the application software or buy hardware.

Once we're in the Faerie Dust realm, what can we do?

Laughing doesn't help. They have a serious problem, they need a solution. The fact that they won't address the cause isn't completely relevant -- we have to work on the denial, anger, negotiation, depression cycle first. Hopefully skipping past the anger, or assuring the anger is directed elsewhere.

Helping doesn't help. If we join the quest for their Faerie Dust, what will we accomplish? We'll burn billable hours to -- eventually -- reach an equivocal non-solution with a complex write-up and recommendations that won't be implemented.

Not helping doesn't help. If we obstinately refuse to join the quest for the Faerie Dust... well... then we've done nothing. We haven't advanced their understanding of their problem.

What's left? Is there a middle road that allows us to join the Faerie Dust quest, but still point out the side roads, other monsters and other treasures along the way?

Perhaps there is, but it would require a kind of saintly patient persistence. We would have to start with an enumeration of problem causes, prioritize them, and then focus on their selected bit of Faerie Dust. My idea is that enumerating the possible causes allows us to identify the missed opportunities, and the possible magnitude of fixing something essential (algorithm or data structure) instead of throwing up window-dressing to cover problems in something inessential (reducing the time required for a table scan).

Example

Here's a concrete example of Faerie Dust.

  1. Pick a data model that doesn't fit the use cases. i.e., lumped many discrete details into a single text field that has "rich semantic content". Work around this mistake by using wild-card matches.
  2. Complained about performance and dug into nuanced details of LIKE clause and full-text search. Lots of study time spent on LIKE clause processing and how to improvement performance.
  3. Refused to discuss the actual use case or the mismatch between data structures and requirements.

The design didn't match the use cases. Faerie Dust won't help.

Tuesday, January 25, 2011

Wild-Card (LIKE-clause) searches are slow. What to do?

Patient: "Doctor, doctor, it hurts when I do this."
Doctor: "Then don't do that."

I got an email with hundreds of words of content. This part made sense: "...doing wild card searches using Oracle's database engine and are wondering why is it so slow and how do they make it go faster."

The rest made very little sense at all. The programmer in question immediately dove into nuances of indexing, Oracle pattern matching, Oracle Text Query and other technical questions. The entire focus was on the technical ins-and-outs.

Not a single word on why wildcards were even being used in the first place. Wildcards appear to solve a business problem; the business problem was never mentioned.

Use Case for Wildcards

After some back-and-forth, the use case emerged. We'll address it below. Essentially, the invoices have names (really) that have "rich semantic content". These invoice names have the form "{customer} {time period} {offering}".

Apparently, the use case is "slice-and-dice" queries. All invoices for a given customer; all invoices in a given time period; all invoices for a given offering; various combinations.

Really. Rather than provide discrete dimensions and use a star schema, they've (a) combined all attributes into a single free-text field and (b) used wild-card searches and now (c) want to complain about it.

We'll return to this use case below.

Basic Rules

Here's are the two rules.

Wild Cards Are The Last Resort For Human-Friendly Search.

Outside Human-Friendly Search, Wild Cards Are Useless.

Let's look at rule 1: Wild Cards Are The Last Resort For Human-Friendly Search.

When a person enters a search string on a web page, we have two choices.
  1. Trust them to enter the exact field as it appears in the database
  2. Presume that people are fallible and cannot be trusted to enter the exact field.
In case #1 (exact match) we might be using an account number, shipping number, an invoice number or some kind of surrogate key. In this case, we do simple equality checks. If the user can't get it right, bummer. In many cases, this is appropriate to prevent snooping.

In case #2 (partial match), we're forced to use a some kind of SQL LIKE clause for the human-friendly search. We have several implementation choices, some in the database, some out of the database. Some in-the-database solutions benefit from clever indexing. Many in-the-database solutions are pretty slow.

Yes, an out-of-the-database solution may actually be faster. Until we benchmark, we can't know. There's no trivial rule that says the database always does search faster. For real speed, we may have to resort to a hybrid solution.

Search Optimization

We might create a small RESTful server for our searchable text fields. This is a cache; the server should handle CRUD rules to assure cache coherence. This search server can uses a Regular Expression engine, or perhaps compute Levenshtein distances or whatever makes sense to optimize user-oriented search.

If we're searching in larger chunks of text, we might want to use a commercial full-text search.

What's essential about this plan is that we're looking at application-specific optimizations. People need flexibility for specific reasons. It's important to look at the actual use cases where a person cannot make an exact match lookup. What problems do they have?

An application may have to deal with customer names. These are often difficult to spell consistently. (Is it "AT&T" or "ATT"?) For this kind of thing Levenshtein Distance might make more sense than wild-card searches.

An application may have to deal with time periods. "2010", "2Q 2010", "July 2010", etc. This is best handled by decomposing time periods into discrete fields and doing appropriate exact match on the specific, relevant fields. The issue is that there are a lot of formulations and some text parsing can be better than a form with a million drop-downs.

An application may have to deal with oddly-named offerings. Marketing calls it one thing. Sales folks call it another. The customer's invoice may call it a third, and the help desk may not use any of those phrases. This may benefit from wild-cards.

Note that we're looking at the business issues. Not the technology issues.

Design Errors

The proper use for LIKE is only to optimize the human-friendly search. Nothing else. Which brings us to rule 2, Outside Human-Friendly Search, Wild Cards are Useless.

Outside human search, every wild-card in a SQL statement indicates a serious database design error. Serious? Error? Yes.

LIKE clauses outside human search indicate a failure to create a design in first normal form (1NF). A field which is used in a LIKE clause has multiple parts, and should have been decomposed into pieces.

Decomposing a multi-part attribute isn't always trivial. There are two cases.
  1. Simple, regular format or punctuation. For example, SSN, US Phones or ZIP codes: 123-45-6789 or (123)555-1234 or 12345-1234.
  2. Complex, irregular format or punctuation. In this case, we have disjoint subtypes in a single table. Most manufacturing part numbers suffer from this.
In case 1, we have two choices: fully decompose or denormalize. In case 2, we can only denormalize because the rules are irregular.

The decomposition solution does not have to lead to a hideous user interface. We can have a web page with a single text field for phone numbers. We can parse that string and decompose the phone number into area code, exchange and number for purposes of database storage. We don't have to thoughtlessly force the users to decompose a field that they don't see as being in three parts.

The denormalization solution means that we have to do some calculation when we accept the input value. We save the full field, plus we extract the various sub-fields based on whatever hellish, complex rules we're faced with.

Implementation Choices

Whenever we have a single text field with "rich semantic content" (i.e., combines multiple disjoint attributes like customer, time period and offering) what we're seeing is a clever way to push database design onto the users. The expectation is that IT will (1) understand the use cases, (2) provide a proper design and (3) optimize performance around that design.

A big text field and wild-card search (and the attendant email traffic) indicates an explicit unwillingness to discuss the real use cases, unwillingness to do design, and a lame hope that somehow wild-card searches can magically be made faster through magical indexing or other super-natural techniques.

The "rich semantic content" field can be decomposed one of two ways.
  • In the GUI. Add drop-downs so users pick the customer, time period, and product offering information.
  • In the Application. Parse the big text field into smaller text fields that don't require wild-card search.
There isn't any magic. If wild-card searches are too slow, they have to be replaced.

Benefits?

The benefit of decomposing (or denormalizing) a complex field is that we can eliminate LIKE processing and wild-cards. Instead of "LONG_TEXT_FIELD LIKE '%2Q 2010%'", we can do "DATE.QUARTER=2 AND DATE.YEAR=2010".

All the technical folderol related to indexing and full-text search and database regular expression engines goes right out the window.

The cost is that we have to "wrap" the INSERT and UPDATE processing in a class definition that does the denormalization. That's what a data model layer is for: these kinds of business rules. The insert/update cost, BTW, will be microscopic compared to the number of SELECTs. The extra time spent at INSERT will be handsomely amortized over all the simplified SELECT operations.

Thursday, October 21, 2010

Code Base Fragmentation

Here's what I love -- an argument that can only add cost and complexity to a project.

It sounds like this to me: "We need to fragment the code base into several different languages. Some of the application programming simply must be written in a language that's poorly-understood, with tools that are not widely available, and supported by a select few individuals that have exclusive access to this code. We haven't benchmarked the technical benefit."

Further, we'll create complex organizational roadblocks in every single project around this obscure, specialized, hard-to-support language.

Perhaps I'm wrong, but database triggers always seem to create more problems than they solve.

They Totally Solve a Problem

The most common argument boils down to application-specific cross-cutting concerns. The claim is that these concerns (logging, validation, data model integrity, whatever) can only be solved with triggers. For some reason, though, these cross-cutting concerns can't be solved through ordinary software design. I'm not sure why triggers are the only solution when simple OO design would be far simpler.

Some folks like to adopt the "multiple application programming languages" argument. That is, that ordinary OO design won't work because the code would have to be repeated in each language. This is largely bunk. It's mostly folks scent-marking territory and refusing to cooperate.

Step 1. Write a library and share it. It's hard to find a language that can't be used to write a sharable library. It's easy to find an organization where the Visual C# programmers are not on speaking terms with the Java programmers and the isolated Python folks are pariahs. This isn't technology. Any one of the languages can create the necessary shared library. A modicum of cooperation would be simpler than creating triggers.

Step 2. Get over it. "Duplicated" business logic is rampant in most organizations. Now that you know about, you can manage it. You don't need to add Yet Another Language to the problem. Just cooperate to propagate the changes.

They're Totally Essential To The Database

The silly argument is that some business rules are "closer to" or "essential to" the database. The reason I can call this silly is because when the data is converted to another database (or extracted to the data warehouse) the triggers aren't relevant or even needed. If the triggers aren't part of "interpreting" or "using" the data, they aren't essential. They're just convenient.

The data really is separate from the processing. And the data is far, far more valuable than the processing. The processing really is mostly application-specific. Any processing that isn't specific to the application really is a cross-cutting concern (see above). There is no "essential" processing that's magically part of the data.

What If...

Life is simpler if all application programming is done in application programming languages. And all triggers are just methods in classes. And everyone just uses the class library they're supposed to use.

"But what if someone doesn't use the proper library? A trigger would magically prevent problems."

If someone refuses to use the application libraries, they need career coaching. As in "find another job where breaking the rules is tolerated."