S.Lott-Software Architect: RDBMS

Showing posts with label RDBMS. Show all posts

Thursday, July 18, 2013

NoSQL Befuddlement: DML and Persistence

It may be helpful to look back at 'How Managers Say "No"' which is about breaking the RDBMS Hegemony.

I got an email in which the simple concepts of "data manipulation" and "persistence" had become entangled with SQL DML to a degree that the conversation failed to make sense to me.

They had been studying Pandas and had started to realize that the RDBMS and SQL were not an essential feature of all data processing software.

I'll repeat that with some emphasis to show what I found alarming.

They had started to realize that the RDBMS and SQL were not an essential feature of all data processing.

Started to realize.

They were so entrenched in RDBMS thinking that the very idea of persistent data outside the RDBMS was novel to them.

They asked me about extending their growing realization to encompass other SQL DML operations: INSERT, UPDATE and DELETE. Clearly, these four verbs were all the data manipulation they could conceive of.

This request meant several things, all of which are unnerving.

They were sure—absolutely sure—that SQL DML was essential for all persistent data. They couldn't consider read-only data? After all, a tool like Pandas is clearly focused on read-only processing. What part of that was confusing to them?
They couldn't discuss persistence outside the narrow framework of SQL DML. It appears that they had forgotten about the file system entirely.
They conflated data manipulation and persistence, seeing them as one thing.

After some back-and-forth it appeared that they were looking for something so strange that I was unable to proceed. We'll turn to this, below.

Persistence and Manipulation

We have lots of persistent data and lots of manipulation. Lots. So many that it's hard to understand what they were asking for.

Here's some places to start looking for hints on persistence.

http://docs.python.org/3/library/persistence.html

http://docs.python.org/3/library/archiving.html

http://docs.python.org/3/library/fileformats.html

http://docs.python.org/3/library/netdata.html

http://docs.python.org/3/library/markup.html

http://docs.python.org/3/library/mm.html

This list might provide some utterly random hints as to how persistent data is processed outside of the narrow confines of the RDBMS.

For manipulation... Well... Almost the entire Python library is about data manipulation. Everything except itertools is about stateful objects and how to change state ("manipulate the data.")

Since the above lists are random, I asked them for any hint as to what their proper use cases might be. It's very difficult to provide generic hand-waving answers to questions about concepts as fundamental as state and persistence. State and persistence pervade all of data processing. Failure to grasp the idea of persistence outside the database almost seems like a failure to grasp persistence in the first place.

The Crazy Request

Their original request was—to me—incomprehensible. As fair as I can tell, they appeared to want the following.

I'm guessing they were hoping for some kind of matrix showing how DML or CRUD mapped to other non-RDBMS persistence libraries.

So, it would be something like this.

SQL	OS	Pandas	JSON	CSV
CREATE	file()	some pandas request	json.dump()	csv.writer()
INSERT	file.write()	depends on the requirements	could be anything	csv.writerow()
UPDATE	file.seek(); file.write()	doesn't make sense	not something that generalizes well	depends on the requirements
DELETE	file.seek(); file.write()	inappropriate for analysis	depends on the requirements	hard to make this up without more details
APPEND -- not part of DML	file.write()	depends on requirements	could be anything	csv.writerow()

The point here is that data manipulation, state and persistence is intimately tied to the application's requirements and processing.

All of which presumes you are persisting stateful objects. It is entirely possible that you're persisting immutable objects, and state change comes from appending new relationships, not changing any objects.

The SQL reductionist view isn't really all that helpful. Indeed, it appears to have been deeply misleading.

The Log File

Here's an example that seems to completely violate the spirit of their request. This is ordinary processing that doesn't fit the SQL DML mold very well at all.

Let's look at log file processing.

Logs can be persisted as simple files in simple directories. Compressed archives are even better than simple files.
For DML, a log file is append-only. There is no insert, update or delete.
For retrieval, a query-like algorithm can be elegantly simple.

Without any brain-cramping, one can create simple map-reduce style processing for log files. See "Map Reduce -- How Cool is that?" for a snippet of Python code that turns each row of an Apache log file into a record-like tuple. It also shows how to scan multiple files and directories in simple map-reduce style loops.

Interestingly, we would probably loose considerable performance if we tried to load a log file into an RDBMS table. Why? The RDBMS file for a table that represents a given log file is much, much larger than the original file. Reading a log file directly involves far fewer physical I/O operations than the table.

Here's something that I can't answer for them without digging into their requirements.

A "filter" could be considered as a DELETE. Or a DELETE can be used to implement a filter. Indeed, the SQL DELETE may work by changing a row's status, meaning the the SQL DELETE operation is actually a filter that rejects deleted records from future queries.

Which is it? Filter or Delete? This little conundrum seems to violate the spirit of their request, also.

Python Code

Here's an example of using persistence to filter the "raw" log files. We keep the relevant events and write these in a more regular, easier-to-parse format. Or, perhaps, we delete the irrelevant records. In this case, we'll use CSV file (with quotes and commas) to speed up future parsing.

We might have something like this:

log_row_pat= re.compile( r'(\d+\.\d+\.\d+\.\d+) (\S+?) (\S+?) (\[[^\]]+?]) ("[^"]*?") (\S+?) (\S+?) ("[^"]*?") ("[^"]*?")' )

def log_reader( row_source ):
for row in row_source:
m= log_row_pat.match( row )
if m is not None:
yield m.groups()

def some_filter( source ):
for row in source:
if some_condition(row):
yield row

with open( subset_file, "w" ) as target:
with open( source_file ) as source:
rdr= log_reader( source )
wtr= csv.writer( target )
wtr.writerows( some_filter( rdr ) )

This is a amazingly fast and very simple. It uses minimal memory and results in a subset file that can be used for further analysis.

Is the filter operation really a DELETE?

This should not be new; it should not even be interesting.

As far as I can tell, they were asking me to show them how is data processing can be done outside a relational database. This seems obvious beyond repeating. Obvious to the point where it's hard to imagine what knowledge gap needs to be filled.

Conclusion

Persistence is not a thing you haphazardly laminate onto an application as an afterthought.

Data Manipulation is not a reductionist thing that has exactly four verbs and no more.

Persistence—like security, auditability, testability, maintainability—and all the quality attributes—is not a checklist item that you install or decline.

Without tangible, specific use cases, it's impossible to engage in general hand-waving about data manipulation and persistence. The answers don't generalize well and depend in a very specific way on the nature of the problem and the use cases.

Tuesday, July 16, 2013

How Managers Say "No": The RDBMS Hegemony Example

Got an email looking for help in attempting break through the RDBMS Hegemony. It's a little confusing, but this is the important part of how management says "no".

"Their response was nice but can you flush [sic] it out more"

[First: the word is "flesh": "flesh it out." Repeat after me: "Flesh it out," "Flesh it out," "Flesh it out." Flesh. Put flesh on the bones. No wonder your presentation went nowhere, either you or the manager or both need help. English as a second language is only an excuse if you never read anything in English.]

There's a specific suggestion for this "more". But it indicates a profound failure to grasp the true nature of the problem. It amounts to a drowning person asking us to throw them a different colored brick. It's a brick! You want a life preserver! "No," they insist, "I want a brick to build steps to climb out."

Yes, RDBMS Hegemony is a real problem. I've talked about it before "Hadoop and SQL/Relational Hegemony". Others have noted it: "NoSQL and NewSQL overturning the relational database hegemony". You can read more concrete details in articles like this: "Introduction to Non-Relational Data Storage using Hbase".

RDBMS Hegemony is most visible when every single in-house project seems to involve the database. And some of those uses of the database are clearly inappropriate.

For example, trying to mash relatively free-form "documents" into an RDBMS is simple craziness. Documents—you know, the stuff created by word processors—are largely unstructured or at best semi-structured. For most RDBMS's, they're represented as Binary Large Objects (BLOBs). To make it possible to process them, you can decorate each document with "metadata" or tags and populate a bunch of RDBMS attributes. Which is fine for the first few queries. Then you realize you need more metadata. Then you need more flexible metadata. Then you need interrelated metadata to properly reflect the interrelationships among the documents. Maybe you flirt with a formal ontology. Then you eventually realize you really should have started with document storage, not a BLOB in an RDBMS.

Yes, some companies offer combo products that do both. The point is this: avoiding the RDBMS pitfall in the first place would have been a big time and money saver. Google Exists. The RDBMS is not the best choice for all problems.

The problem is this:

Getting away from RDBMS Hegemony requires management thinking and action.
Management thinking is a form of pain.
Management action is a form of pain.
Managers hate pain.

In short, the only way to make progress away from the RDBMS is to create or expose existing pain. Or make it possible for the manager to avoid pain entirely.

Let's look at the various approaches.

Doing A "Presentation"

The email hinted at a conversation or presentation on the problem of RDBMS Hegemony.

"I finally convinced my current client that RDBMS's are expensive in terms of adding another layer to the archtiecture [sic] and then trying to maintain it."

It's not clear from the email what the details of this conversation or presentation were, but it clearly involved the two key technical points (1) the RDBMS has specific use cases, and (2) not all applications fit those use cases.

However. Those two key technical points involve no real management pain.

Real pain comes from cost. And since the RDBMS license is usually site-wide, there's no obvious cost to the technology.

The labor cost for DBA support, similarly, is side-wide and already in the budget. So there's no obvious cost to the labor.

No cost means no pain. No pain means no change.

Asking a manger to think, however, causes actual pain. Managers want technical people to do the thinking for them.

Asking a manager to consider the future means they may have to take action in the future. That's potential pain.

Either way, a management presentation on database hegemony is pure pain. No useful action will ever come from a simple, direct encapsulation of how the RDBMS is not really the universal data tool. Management said "no" by asking for more information.

We'll return to the "more information" part below.

It was good to start the conversation.

It's good to continue the conversation. But the specific request was silliness.

Exposing the Existing Pain

What's more important than a hypothetical conversation is showing how the RDBMS is causing pain right now. It's easier to convince managers of the hidden cost of the RDBMS by exposing existing actual pain in the current environment. And it has to be a level of pain that exceeds the pain of thinking and taking action.

What's most clear is a specific and avoidable labor cost. Ideally, this specific—and avoidable—labor cost will obviously be associated with something obviously database-related. It must be obvious or it won't yield a technology-related management understanding. If it's not obvious, management will say "no", by asking for more data; they'll claim it's people or process or measurement error.

The best place to look for avoidable labor is break-fix problem reports, bugs and enhancements. Another good source of avoidable costs are schema migrations: waiting for the DBA's to add columns to a table, or add tables to a database.

If you can point to specific trouble tickets that come from wrong use of an RDBMS, then you might be able to get a manager to think about it.

The Airtight Case

Your goal on breaking RDBMS Hegemony is to have a case that is "airtight". Ideally, so airtight that the manager in question sits up, takes notice, and demands that a project be created to rip out the database and save the company all that cost. Ideally, their action at the end of the presentation is to ask how long it will take to realize the savings.

Ideally.

It is actually pretty easy to make an airtight case. There are often a lot of trouble tickets and project delays due to overuse and misuse of the RDBMS.

However.

Few managers will actually agree to remove the RDBMS from an application that's limping along. Your case may be airtight, and compelling, and backed with solid financials, but that's rarely going to result in actual action.

"If it ain't broke, don't fix it," is often applied to projects with very high thresholds for broken. Very high.

This is another way management says "no". By claiming that the costs are acceptable or the risk of change is unacceptable. Even more farcical claims will often be made in favor of the status quo. They may ask for more cost data, but it's just an elaborate "no".

It's important to make the airtight case.

It's important to accept the "no" gracefully.

Management Rewards

When you look at the management reward structure, project managers and their ilk are happiest when they have a backlog of huge, long-running projects that involve no thinking and no action. Giant development efforts with stable requirements, unchallenging users, mature technology and staff who don't mind multiple-hour status meetings.

A manager with a huge long-running project feels valuable. When the requirements, people and technology are stable, then thinking is effectively prevented.

Suggesting that technology choices are not stable introduces thinking. Thinking is pain. The first response to pain is "no". Usually in the form of "get more data."

Making a technology choice may require that a manager facilitate a conversation which selects among competing technology choices. That involves action. And possible thinking.

Real Management Pain. The response? Some form of "no".

Worse. (And it does get worse.)

Technology selection often becomes highly political. The out-of-favor project managers won't get projects approved because of "risky technology." More Management Pain.

War story. Years ago, I watched the Big Strategic Initiative shot down in flames because it didn't have OS/370 as the platform. The "HIPPO" (Highest Paid Person's Opinion) was that Unix was "too new" and that meant risk. Unix predates OS/370 by many years. When it comes to politics, facts are secondary.

Since no manager wants to think about potential future pain, no manager is going to look outside the box. Indeed, they're often unwilling to look at the edge of the box. The worst are unwilling to admit there is a box.

The "risk" claim is usually used to say "no" to new technology. Or. To say "no" to going back to existing, well-established technology. Switching from database BLOBs to the underlying OS file system can turn into a bizzaro-world conversation where management is sure that the underlying OS file system is somehow less trustworthy than RDBMS BLOBs. The idea that the RDBMS is using the underlying file system for persistence isn't a compelling argument.

It's important to challenge technology choices for every new project every time.

It's necessary to accept the "no" gracefully.

The "stop using the database for everything" idea takes a while to sink in.

Proof Of Concept

The only way to avoid management pain (and the inaction that comes from pain avoidance) is to make the technology choice a fait accompli.

You have to actually build something that actually works and passes unit tests and everything.

Once you have something which works, the RDBMS "question" will have been answered. But—and this is very important—it will involve no management thought or action. By avoiding pain, you also default into a kind of management buy-in.

War Story

The vendors send us archives of spreadsheets. (Really.) We could unpack them and load them into the RDBMS. But. Sadly. The spreadsheets aren't consistent. We either have a constant schema migration problem adding yet another column for each spreadsheet, or we have to get rid of the RDBMS notion of a maximalist schema. We don't want the schema to be an "at most" definition; we'd need the schema be an "at least" that tolerates irregularity.

It turns out that the RDBMS is utterly useless anyway. We're barely using any SQL features. The vendor data is read-only. We can't UPDATE, INSERT or DELETE under any circumstances. The delete action is really a ROLLBACK when we reject their file and a CREATE when they send us a new one.

We're not using any RDBMS features, either. We're not using long-running locks for our transactions; we're using low-level OS locks when creating and removing files. We're not auditing database actions; we're doing our own application logging on several levels.

All that's left are backups and restores. File system backups and restores. It turns out that a simple directory tree handles the vendor-supplied spreadsheet issue gracefully. No RDBMS used.

We had—of course—originally designed a lot of fancy RDBMS tables for loading up the vendor-supplied spreadsheets. Until we were confronted with reality and the inconsistent data formats.

We quietly stopped using the RDBMS for the vendor-supplied data. We wrote some libraries to read the spreadsheets directly. We wrote application code that had methods with names like "query" and "select" and "fetch" to give a SQL-like feel to the code.

Management didn't need to say "no" by asking for more information. They couldn't say no because (a) it was the right thing to do and (b) it was already done. It was cheaper to do it than to talk about doing it.

Failure To See The Problem

The original email continued to say this:

"how you can achieve RDBMS like behavior w/out an actual RDBMS"

What? Or perhaps: Why?

If you need RDBMS-like behavior, then you need an RDBMS. That request makes precious little sense as written. So. Let's dig around in the email for context clues to see what they really meant.

"consider limting [sic] it to
1) CREATE TABLE
2) INSERT
3) UPDATE
An update requires a unique key. Let's limit the key to contain only 1 column.
4) DELETE
A delete requires a unique key. Let's limit the key to contain only 1 column."

Oh. Apparently they really are totally fixated on SQL DML.

It appears that they're unable to conceive of anything outside the SQL DML box.

As noted in the above example, INSERT, UPDATE and DELETE are not generic, universal, always-present use cases. For a fairly broad number of "big data" applications, they're not really part of the problem.

The idea that SQL DML CRUD processing forms a core or foundational set of generic, universal, always-present use cases is part of their conceptual confusion. They're deeply inside the SQL box wondering how they can get rid of SQL.

Back to the drowning person metaphor.

It's actually not like a drowning person asking for a different colored brick because they're building steps to walk out.

It's like a person who fell face down in a puddle claiming they're drowning in the first place. The brick vs. life preserver question isn't relevant. They need to stand up and look around. They're not drowning. They're not even in very deep water.

They've been laying face-down in the puddle so long, they think it's as wide as the ocean and as deep as a well. They've been down so long it looks like up.

Outside the SQL Box

To get outside the SQL box means to actually stop using SQL even for metaphoric conversations about data manipulation, persistence, transactions, auditing, security and anything that seems relevant to data processing.

To FLESH OUT ["flesh", the word is "flesh"] the conversation on breaking the SQL Hegemony, you can't use hypothetical hand-waving. You need tangible real-world requirements. You need something concrete, finite and specific so that you can have a head-to-head benchmark shootout (in principle) between an RDBMS and something not an RDBMS.

You may never actually build the RDBMS version for comparison. But you need to create good logging and measurement hooks around your first noSQL application. The kind of logging and measurement you'd use for a benchmark. The kind of logging and measurement that will prove it actually works outside the RDBMS. And it works well: reliably and inexpensively.

This is entirely about asking for forgiveness instead of asking for permission.

Managers can't give permission, it involves too much pain.

They can offer forgiveness because it requires neither thinking nor action.

Thursday, November 11, 2010

Hadoop and SQL/Relational Hegemony

Here's a nice article on why Facebook, Yahoo and eBay use Hadoop: "Asking Any Question Of All Your Data".

The article has one tiny element of pandering to the SQL hegemonists.

Yes, it sounds like a conspiracy theory, but it seems like there really are folks who will tell you that the relational database is effectively perfect for all data processing and should not be questioned. To bolster their point, they often have to conflate all data processing into one amorphous void. Relational transactions aren't central to all processing, just certain elements of data processing. There, I said it.

Here's the pandering quote: "But this only works if the underlying data storage and compute engine is powerful enough to operate on a large dataset in a time-efficient manner".

What?

Is he saying that relational databases do not impose the same constraint?

Clearly, the RDBMS has the same "catch". The relational database only works if "...the underlying data storage and compute engine is powerful enough to operate on a large dataset in a time-efficient manner."

Pandering? Really?

Here's why it seems like the article is pandering. Because it worked. It totally appealed to the target audience. I saw this piece because a DBA -- a card-carrying member of the SQL Hegemony cabal -- sent me the link, and highlighted two things. The DBA highlighted the "powerful enough" quote.

As if to say, "See, it won't happen any time soon, Hadoop is too resource intensive to displace the RDBMS."

Which appears to assume that the RDBMS isn't resource intensive.

Further, the DBA had to add the following. "The other catch which is not stated is the skill level required of the people doing the work."

As if to say, "It won't happen any time soon, ordinary programmers can't understand it."

Which appears to assume that ordinary programmers totally understand SQL and the relational model. If they did understand SQL and the relational model perfectly, why would we have DBA's? Why would we have performance tuning? Why would we have DBA's adjusting normalization to correct application design problems?

Weaknesses

So the weaknesses of Hadoop are that it (a) demands resources and (b) requires specialized skills. Okay. But isn't that the exact same weakness as the relational database?

Which causes me to ask why an article like this has to pander to the SQL cabal by suggesting that Hadoop requires a big compute engine? Or is this just my own conspiracy theory?

Thursday, June 24, 2010

TDD and Python

First, let me say that TDD rocks.

Few things are as much fun as (1) writing a test script for a feature, and then (2) debugging the feature incrementally until it passes the test. It's fun because a great deal of hand-wringing and over-thinking is taken off the table.

To paraphrase Obi-Wan Kenobi:

Use The Test, Luke.

The essence of TDD is a pleasant two-step process: write tests, write code.

However, leaving things at this simplistic level isn't appropriate.

Code Quality

Most folks describe TDD as a 3-step process. I like to call this "red-green-gold" (The Lithuanian Flag version of TDD.)

Tests don't pass (red).
Tests pass (green).
Refactor the code until things look good (gold).

The point here is that once you have tests that pass, you can trivially engage in refactoring and other engineering tasks to improve the overall quality of the code. You can optimize or make it more readable or more reusable without breaking it.

Even this isn't quite right.

Test Quality

The issue with a too-simplistic view TDD is that we walk a fine line.

Over-engineering the tests.
Under-engineering the tests.

We can -- trivially -- fall into the trap of wringing our hands over every potential nuance of our new piece of code. We can be stalled writing tests. Often we hear complaints from folks who fall into this trap. They spend too much time writing tests and indict all of TDD because they dove into details too early in the process.

We can -- equally easily -- fall into the trap of failing to write suitably robust tests for our software.

TDD is really a 3+1 step process.

Write tests, which don't pass (Red).
Write code until tests pass (Green).
(a) Clean up code to improve quality features. (b) Expand tests to add an appropriate level of robustness.

The operating word here is "appropriate".

Costs and Benefits

Some modules -- because of risk or complexity or visibility -- require extensive testing. Some modules don't require this.

Interestingly, portability -- even in Python -- requires some care in testing. It turns out that MySQL and SQLite are not completely identical in their behavior.

Omitting an order-by in a query can "work by accident" in one database and fail in another. So we need appropriate testing to ferret out these RDBMS-specific issues. Until we have the appropriate level of testing we have an application that works in SQLite but fails in MySQL.

The initial gut reaction can sometimes be "TDD failed us".

But this isn't true. TDD actually helped us by (1) identifying code which passed on one platform and failed on another, and (2) leading us to beef up all tests which depend on ordering. Pleasantly, there aren't many.

Friday, September 4, 2009

RDBMS Issues and Concerns

Check out this blog post: http://cacm.acm.org/browse-by-subject/data-storage-and-retrieval/32212-the-end-of-a-dbms-era-might-be-upon-us/fulltext

The first issue is that the RDBMS code base in ancient. The second issue is that we keep pushing the envelope on the RDBMS model; examples include OLAP and RDF triple-stores.

Some folks want to say "reports of the death of the RDBMS are premature."

Like COBOL, the relational model, and words like "DASD", some technologies will be with us for decades after their useful life.

The decline of COBOL and the Relational Database will be protracted, painful, inevitable and asymptotic with actual death. The old one-size-fits-all COBOL is being replaced by many other languages. Similarly, the one-size-fits-all RDBMS will be fragmented into more specialized data stores. Further, legacy technology never completely goes away.

S.Lott-Software Architect

Moved

Moved. See https://slott56.github.io. All new content goes to the new site. This is a legacy, and will likely be dropped five years after the last post in Jan 2023.