close

Moved

Moved. See https://slott56.github.io. All new content goes to the new site. This is a legacy, and will likely be dropped five years after the last post in Jan 2023.

Showing posts with label star-schema. Show all posts
Showing posts with label star-schema. Show all posts

Tuesday, May 29, 2012

Mid-Atlantic Design Expo (MADExpo)


220x250_4.jpg

I'm looking forward to this.  I'll be talking about Python.  

  • A Python 3.2 tutorial.  I did a dry run with the help of the 757 Python group.  Made a bunch of changes based on their input.
  • A more in-depth presentation on a good architecture of Database Schema Migration scripts.  You know, the "open heart surgery" of a database where the structure changes in some way that's not a trivial ALTER statement.

Thursday, September 1, 2011

Data Warehousing and SQL -- Tread Carefully


"Are you implying that a scalable Data Warehouse solution could be implemented using Python and serialised files?"

Not "implying".  I'm trying to state it as clearly as I can.

A scalable data warehouse solution involves a lot of flat file processing.

ETL, for example, is mostly a flat-file pipeline.  It starts with source application extract (to create a flat file) and proceeds through a number of transformation steps to filter, cleanse, recode, conform dimensions, and eventually relate facts to dimensions.  This is generally very, very fast when done with simple flat files and considerably slower when done with a database.

This is the "Data Warehouse Bus" that Kimball describes in chapter 9 of The Data Warehouse Lifecycle Toolkit.

Ultimately, the cleansed, conformed files will lay around in a "staging area" forever.  When a datamart is built, then a subset of these files can be (rapidly) loaded into an RDBMS for query processing.

Doing this in Python is no different from doing it in Java, C++ or (for that matter) Syncsort.  Yes.  You can build a data warehouse using processing steps written around Syncsort and be quite successful.

The important part of this is to recognize the following.

When trying to do data warehouse flat-file processing in C++ (or Java) you have the ongoing schema maintenance issue.  The source data changes.  You must tweak the schema mapping from source to warehouse.  You can encode this schema mapping as property files or some such, or you can simply use an interpreted language like Python and encode the mappings as Python code.

The "Data Warehouse Bus" is a lot of applications that are trivially written as simple, parallel, multi-processing, small, read-match-write programs.  Forget threads.  Simply use heavy-weight, OS-level processes so that you can maximize the I/O bandwidth.  (Remember: when one thread makes an I/O request, the entire process waits; an I/O-bound application isn't helped by multi-threading.)

    with open('some_data','rb') as source:
        rdr= csv.DictReader( source )
        wtr= csv.DictWriter( sys.stdout, some_schema )
        for row in rdr:
            if exclude( row ): continue
            clean = cleanse( row )
            wtr.writerow( clean )

This example writes to stdout so that it can be connected in a pipeline with other steps in the processing.  Programs running in an OS pipeline run concurrently.  They tie up all the cores available without any real programming effort other than decomposing the problem into discrete parallel steps that apply to each row being touched.

Simple file processing is much, much faster than SQL processing.  Why?  No overheads for locking or buffer pooling or rollback segments, or logging, or after-image journaling or deadlock detection, etc.

Note that a data warehouse database has no need for sophisticated locking.  All of the "updates" are bulk loads.  80% of the activity is "insert".  With some Slowly Changing Dimension (SCD) operations there is a trivial status-change update, but this can be handled with a single database-wide lock during insert.

The primary reason for using SQL is to handle "SELECT something ... GROUP BY" queries.  SQL does this reasonably well most of the time.  Python does it pretty well, also.

    sum_col1 = defaultdict( float )
    count_group = defaultdict( int )
    with connection.cursor() as c:
        c.execute( "SELECT COL1, GROUP FROM..." )
        for row in c.fetchall():
            sum_col1[row.group] += col1
            count_group[row.group] += 1
    print( sum_col1, count_group )

That's clearly wordier than SQL.  But not much wordier.  The SELECT statement embedded in the Python is simpler because it omits the GROUP BY clause.  Since it's simpler, it's more likely to benefit from being reused in the RDBMS.

The Python may actually run faster than a pure SQL query because it avoids the (potentially expensive) RDBMS sort step.  The Python defaultdict (or Java HashMap) is how we avoid sorting.  If we need to present the keys in some kind of user-friendly order, we have limited the sort to just the distinct key values, not the entire join result.

Because of the huge cost of group by, there are two hack-arounds.  One is "materialized views".  The idea is that a group-by view is updated when the base tables are updated to avoid the painful cost of sorting at query time.  In addition to this, there are reporting tools which are "aggregate aware".  They can leverage the materialized view to avoid the sort.

How about we avoid all the conceptual overhead of materialized views and aggregate aware reporting. Instead we can write simple Python procedures that do the processing we want.

Bottom Line

Data Warehouse does not imply SQL.  Indeed, it doesn't even suggest SQL except for datamart processing of flexible ad-hoc queries where there's enough horsepower to endure all the sorting.

Monday, October 12, 2009

Sometimes the universe appears multidimensional -- but isn't

Had a knock-down drag-out fight with another architect recently over "status" and "priority".

She claimed that the backlog priority and the status where the same thing. I claimed that you can easily have this.

Priority: 1, Status: Not Started
Priority: 2, Status: In Process
Priority: 3, Status: Completed

See? It's obvious that they're independent dimensions.

She said that it's just as obvious that you're doing something wrong.

Here's her point:
  • If you have priority 1 items that aren't in process now, then they're really priority 2. Fix them to honestly say priority 2.
  • If you have priority 2 items that "somehow" jumped ahead of priority 1 items, they were really priority 1. Fix them to say priority 1. And don't hand her that "in the real world, you have managers or customers that invert the priorities". Don't invert the priorities, just change them and be honest about it.
  • The only items that are done must have been priority 1, passed through an "in-process" state and then got finished. Once they're done, they're not priority 1 any more. They're just done.
  • Things that hang around in "in-process, not done" have two parts. The part that's done, and some other part that's in the backlog and not priority 1.
She says that priority and status are one thing with the following values.
  • Done.
  • Priority 1 = in process right now.
  • Priority 2 = will be in process next. Not eventually. Next.
  • Priority 3 through ∞ = eventually, in order by priority.
Any more complex scheme is simply misleading (Priority 1 not being done right now? Is it a resource issue? A priority issue? Why aren't you doing it?)