close

Moved

Moved. See https://slott56.github.io. All new content goes to the new site. This is a legacy, and will likely be dropped five years after the last post in Jan 2023.

Showing posts with label hadoop. Show all posts
Showing posts with label hadoop. Show all posts

Tuesday, July 21, 2015

A Surprising Confusion

Well, it was surprising to me.

And it should not have been a surprise.

This is something I need to recognize as a standard confusion. And rewrite some training material to better address this.

The question comes up when SQL hackers move beyond simple queries and canned desktop tools into "Big Data" analytics. The pure SQL lifestyle (using spreadsheets, or Business Objects, or SAS) leads to an understanding of data that's biased toward working with collections in an autonomous way.

Outside the SELECT clause, everything's a group or a set or some kind of collection. Even in spreadsheet world, a lot of Big Data folks slap summary expressions on the top of a column to show a sum or a count without too much worry or concern.

But when they start wrestling with Python for loops, and the atomic elements which comprise a set (or list or dict), then there's a bit of confusion that creeps in.

An important skill is building a list (or set or dict) from atomic elements. We'll often have code that looks like this:

some_list = []
for source_object in some_source_of_objects:
    if some_filter(source_object):
        useful_object = transform(source_object)
        some_list.append(useful_object)

This is, of course, simply a list comprehension. In some case, we might have a process that breaks one of the rules of using a generator and doesn't work out perfectly cleanly as a comprehension. This is somewhat more advanced topic.

The transformation step is what seems to causes confusion. Or -- more properly -- it's the disconnect between the transformation calculations on atomic items and the group-level processing to accumulate a collection from individual items.

The use of some_list.append() and some_list[index] and some_list is something that folks can't -- trivially -- articulate. The course material isn't clarifying this for them. And (worse) leaping into list comprehensions doesn't seem to help.

These are particularly difficult to explain if the long version isn't clear.

some_list = [transform(source_object) for source_object in some_source_of_objects if some_filter(source_object)]

and

some_list = list( map(transform, filter(some_filter, some_source_of_objects)) )

I'm going to have to build some revised course material that zeroes in on the atomic vs. collection concepts. What we do with an item (singular) and what we do with a list of items (plural).

I've been working with highly experienced programmers too long. I've lost sight of the n00b questions.

The goal is to get to the small data map-reduce. We have some folks who can make big data work, but the big data Hadoop architecture isn't ideal for all problems. We have to help people distinguish between big data and small data, and switch gears when appropriate. Since Python does both very nicely, we think we can build enough education to school up business experts to also make a more informed technology choice.


Thursday, November 11, 2010

Hadoop and SQL/Relational Hegemony

Here's a nice article on why Facebook, Yahoo and eBay use Hadoop: "Asking Any Question Of All Your Data".

The article has one tiny element of pandering to the SQL hegemonists.

Yes, it sounds like a conspiracy theory, but it seems like there really are folks who will tell you that the relational database is effectively perfect for all data processing and should not be questioned. To bolster their point, they often have to conflate all data processing into one amorphous void. Relational transactions aren't central to all processing, just certain elements of data processing. There, I said it.

Here's the pandering quote: "But this only works if the underlying data storage and compute engine is powerful enough to operate on a large dataset in a time-efficient manner".

What?

Is he saying that relational databases do not impose the same constraint?

Clearly, the RDBMS has the same "catch". The relational database only works if "...the underlying data storage and compute engine is powerful enough to operate on a large dataset in a time-efficient manner."

Pandering? Really?

Here's why it seems like the article is pandering. Because it worked. It totally appealed to the target audience. I saw this piece because a DBA -- a card-carrying member of the SQL Hegemony cabal -- sent me the link, and highlighted two things. The DBA highlighted the "powerful enough" quote.

As if to say, "See, it won't happen any time soon, Hadoop is too resource intensive to displace the RDBMS."

Which appears to assume that the RDBMS isn't resource intensive.

Further, the DBA had to add the following. "The other catch which is not stated is the skill level required of the people doing the work."

As if to say, "It won't happen any time soon, ordinary programmers can't understand it."

Which appears to assume that ordinary programmers totally understand SQL and the relational model. If they did understand SQL and the relational model perfectly, why would we have DBA's? Why would we have performance tuning? Why would we have DBA's adjusting normalization to correct application design problems?

Weaknesses

So the weaknesses of Hadoop are that it (a) demands resources and (b) requires specialized skills. Okay. But isn't that the exact same weakness as the relational database?

Which causes me to ask why an article like this has to pander to the SQL cabal by suggesting that Hadoop requires a big compute engine? Or is this just my own conspiracy theory?