Showing posts with label Sharmila Mulligan. Show all posts
Showing posts with label Sharmila Mulligan. Show all posts

Tuesday, January 05, 2010

Game-Changing Architectural Advances Take Data Analytics to New Performance Heights

Transcript of a BriefingsDirect podcast on how new advances in collocating applications with data architecturally provides analytics performance breakthroughs.

Listen to the podcast. Find it on iTunes/iPod and Podcast.com. Download the transcript. Learn more. Sponsor: Aster Data Systems.

Dana Gardner: Hi, this is Dana Gardner, principal analyst at Interarbor Solutions, and you're listening to BriefingsDirect.

Today, we present a sponsored podcast discussion on how new architectures for data and logic processing are ushering in a game-changing era of advanced analytics.

These new approaches support massive data sets to produce powerful insights and analysis, yet with unprecedented price-performance. As we enter 2010, enterprises are including more forms of diverse data into their business intelligence (BI) activities. They're also diversifying the types of analysis that they expect from these investments.

We're also seeing more kinds and sizes of companies and government agencies seeking to deliver ever more data-driven analysis for their employees, partners, users, and citizens. It boils down to giving more communities of participants what they need to excel at whatever they're doing. By putting analytics into the hands of more decision makers, huge productivity wins across entire economies become far more likely.

But such improvements won’t happen if the data can't effectively reach the application's logic, if the systems can't handle the massive processing scale involved, or the total costs and complexity are too high.

In this discussion we examine how convergence of data and logic, of parallelism and MapReduce -- and of a hunger for precise analysis with a flood of raw new data -- all are setting the stage for powerful advanced analytics outcomes.

Here to help us learn how to attain advanced analytics and to uncover the benefits from these new architectural activities for ubiquitous BI, are Jim Kobielus, senior analyst at Forrester Research. Welcome, Jim.

Jim Kobielus: Hi, Dana. Hi, everybody.

Gardner: We're also joined by Sharmila Mulligan, executive vice president of marketing at Aster Data. Welcome, Sharmila.

Sharmila Mulligan: Thank you. Hello, everyone.

Gardner: Jim, let me start with you. We're looking at a shift now, as I have mentioned, in response to oceans of data and the need for analysis across different types of applications and activities. What needs to change? The demands are there, but what needs to change in terms of how we provide the solution around these advanced analytical undertakings?

Rethinking platforms

Kobielus: First, Dana, we need to rethink the platforms with which we're doing analytical processing. Data mining is traditionally thought of as being the core of advanced analytics. Generally, you pull data from various sources into an analytical data mart.

That analytical data mart is usually on a database that's specific to a given predictive modeling project, let's say a customer analytics project. It may be a very fast server with a lot of compute power for a single server, but quite often what we call the analytical data mart is not the highest performance database you have in your company. Usually, that high performance database is your data warehouse.

As you build larger and more complex predictive models -- and you have a broad range of models and a broad range of statisticians and others building, scoring, and preparing data for these models -- you quickly run into resource constraints on your existing data-mining platform, really. So, you have to look for where you can find the CPU power, the data storage, and the I/O bandwidth to scale up your predictive modeling efforts. That's the number one thing. The data warehouse is the likely suspect.

Also, you need to think about the fact that these oceans of data need to be prepared, transformed, cleansed, meshed, merged, and so forth before they can be brought into your analytical data mart for data mining and the like.

Quite frankly, the people who do predictive modeling are not specialists at data preparation.



Quite frankly, the people who do predictive modeling are not specialists at data preparation. They have to learn it and they sometimes get very good at it, but they have to spend a lot of time on data mining projects, involved in the grunt work of getting data in the right format just to begin to develop the models.

As you start to rethink your whole advanced analytics environment, you have to think through how you can automate to a greater degree all these data preparation, data loading chores, so that the advanced analytics specialists can do what they're supposed to do, which is build and tune models of various problem spaces. Those are key challenges that we face.

But, there is one third challenge, which is advanced analytics producing predictive models. Those predictive models increasingly are deployed in-line to transactional applications, like your call center, to provide some basic logic and rules that will drive such important functions as "next best offer" being made to customers based on a broad variety of historical and current information.

How do you inject predictive logic into your transactional applications in a fairly seamless way? You have to think through that, because, right now, quite often analytical data models, predictive models, in many ways are not built for optimal embedding within your transactional applications. You have to think through how to converge all these analytical models with the transactional logic that drives your business.

Gardner: Okay. Sharmila, are your users or the people that you talk to in the market aware that this shift is under way? Do they recognize that the same old way of doing things is not going to sustain them going forward?

New data platform

Mulligan: What we see with customers is that the advanced analytics needs and the new generation of analytics that they are trying to do is driving the need for a new data platform.

Previously, the choice of a data management platform was based primarily on price-performance, being able to effectively store lots of data, and get very good performance out of those systems. What we're seeing right now is that, although price performance continues to be a critical factor, it's not necessarily the only factor or the primary thing driving their need for a new platform.

What's driving the need now, and one of the most important criteria in the selection process, is the ability of this new platform to be able to support very advanced analytics.

Customers are very precise in terms of the type of analytics that they want to do. So, it's not that a vendor needs to tell them what they are missing. They are very clear on the type of data analysis they want to do, the granularity of data analysis, the volume of data that they want to be able to analyze, and the speed that they expect when they analyze that data.

There is a big shift in the market, where customers have realized that their preexisting platforms are not necessarily suitable for the new generation of analytics that they're trying to do.



They are very clear on what their requirements are, and those requirements are coming from the top. Those new requirements, as it relates to data analysis and advanced analytics, are driving the selection process for a new data management platform.

There is a big shift in the market, where customers have realized that their preexisting platforms are not necessarily suitable for the new generation of analytics that they're trying to do.

Gardner: Let's take a pause and see if we can't define these advanced analytics a little better. Jim, what do we mean nowadays when we say "advanced analytics?"

Kobielus: Different people have their definitions, but I'll give you Forrester's definition, because I'm with Forrester. And, it makes sense to break it down into basic analytics versus advanced analytics.

What is basic analytics? Well, that's BI. It's the core of BI that you build your decision support environment on. That's reporting, query, online analytical processing, dashboarding, and so forth. It's fairly clear what's in the core scope of BI.

Traditional basic analytics is all about analytics against deep historical datasets and being able to answer questions about the past, including the past up to the last five seconds. It's the past that's the core focus of basic analytics.

What's likely to happen

Advanced analytics is focused on how to answer questions about the future. It's what's likely to happen -- forecast, trend, what-if analysis -- as well as what I like to call the deep present, really current streams for complex event processing. What's streaming in now? And how can you analyze the great gushing streams of information that are emanating from all your applications, your workflows, and from social networks?

Advanced analytics is all about answering future-oriented, proactive, or predictive questions, as well as current streaming, real-time questions about what's going on now. Advanced analytics leverages the same core features that you find in basic analytics -- all the reports, visualizations, and dashboarding -- but then takes it several steps further.

First and foremost, it's all about amassing a data warehouse or a data mart full of structured and unstructured information and being able to do both data mining against the structured information, and text analytics or content analytics against the unstructured content.

Then, in the unstructured content, it's being able to do some important things, like natural language processing to look for entities and relationships and sentiments and the voice of the customer, so you can then extrapolate or predict what might happen in the future. What might happen if you make a given offer to a given customer at a given time? How are they likely to respond? Are they likely to jump to the competition? Are they likely to purchase whatever you're offering? All those kinds of questions.

The query and reporting aspect continues to be very important, but the difference now is that the size of the data set is far larger than what the customer has been running with before.



Gardner: Sharmila, do you have anything to offer further on defining advanced analytics in this market?

Mulligan: Before I go into advanced analytics, I'd like to add to what Jim just talked about on basic analytics. The query and reporting aspect continues to be very important, but the difference now is that the size of the data set is far larger than what the customer has been running with before.

What you've got is a situation where they want to be able to do more scalable reporting on massive data sets with very, very fast response times. On the reporting side, in terms of the end result to the customer, it is similar to the type of report they are trying to achieve, but the difference is that the quantity of data that they're trying to get at, and the amount of data that these reports are filling up is far greater than what they had before.

That's what's driving a need for a new platform underneath some of the preexisting BI tools that are, in themselves, good at reporting, but what the BI tools need is a data platform beneath them that allows them to do more scalable reporting than you could do before.

Kobielus: I just want to underline that, Sharmila. What Forrester is seeing is that, although the average data warehouse today is in the 1-10 terabyte range for most companies, we foresee the average warehouse size going, in the middle of the coming decade, into the hundreds of terabytes.

In 10 years or so, we think it's possible, and increasingly likely, that petabyte-scale data warehouses or content warehouses will become common. It's all about unstructured information, deep history, and historical information. A lot of trends are pushing enterprises in the direction of big data.

Managing big data

Mulligan: Absolutely. That is obviously the big topic here, which is, how do you manage big data? And, big data could be structured or it could be unstructured. How do you assimilate all this in one platform and then be able to run advanced analytics on this very big data set?

Going back to what Jim discussed on advanced analytics, we see two big themes. One is
the real-time nature of what our customers want to do. There are particular use cases, where what they need is to be able to analyze this data in near real-time, because that's critical to being able to get the insights that they're looking for.

Fraud analytics is a good example of that. Customers have been able to do fraud analytics, but they're running fraud checks after the fact and discovering where fraud took place after the event has happened. Then, they have to go back and recover from that situation. Now, what customers want, is to be able to run fraud analytics in near real-time, so they can catch fraud while it's happening.

What you see is everything from cases in financial services companies related to product fraud, as well as, for example, online gaming sites, where users of the system are collaborating on the site and trying to commit fraud. Those type of scenarios demand a system that can return the fraud analysis data near real-time, so it can block these users from conducting fraud while it's happening.

The other big thing we see is the predictive nature of what customers are trying to do. Jim talked about predictive analytics and modeling analytics. Again, that's a big area that we see massive new opportunity and a lot of new demand. What customers are trying to do there is look at their own customer base to be able to analyze data, so that they can predict trends in the future.

. . . The other big theme we see is the push toward analysis that's really more near real time than what they were able to do before.



For example, what are the buying trends going to be, let's say at Christmas, for consumers who live in a certain area? There is a lot around behavior analysis. In the telco space, we see a lot of deep analysis around trying to model behavior of customers on voice usage of their mobile devices versus data usage.

By understanding some of these patterns and the behavior of the users in more depth, these organizations are now able to better service their customers and offer them new product offerings, new packages, and a higher level or personalization, by understanding the behavior of their customers in more depth.

Predictive analytics is a term that's existed for a while, and is something that customers have been doing, but it's really reaching new levels in terms of the amount of data that they're trying to analyze for predictive analytics, and in the granularity of the analytics itself in being able to deliver deeper predictive insight and models.

As I said, the other big theme we see is the push toward analysis that's really more near real time than what they were able to do before. This is not a trivial thing to do when, it comes to very large data sets, because what you are asking for is the ability to get very, very quick response times and incredibly high performance on terabytes and terabytes of data to be able to get these kind of results in real-time.

Gardner: Jim, these examples that Sharmila has shared aren't just rounding errors. This isn't a movement toward higher efficiency. These are game changers. These are going to make or break your business. This is going to allow you to adjust to a changing economy and to shifting preferences by your customers. We're talking about business fundamentals here.

Social network analysis

Kobielus: We certainly are. Sharmila was discussing behavioral analysis, for example, and talking about carrier services. Let's look at what's going to be a true game changer, not just for business, but for the global society. It's a thing called social network analysis.

It's predictive models, fundamentally, but it's predictive models that are applied to analyzing the behaviors of networks of people on the web, on the Internet, Facebook, and Twitter, in your company, and in various social network groupings, to determine classification and clustering of people around common affinities, buying patterns, interests, and so forth.

As social networks weave their way into not just our consumer lives, but our work lives, our life lives, social network analysis -- leveraging all the core advanced analytics of data mining and text analytics -- will take the place of the focus group. In an online world, everything is virtual. As a company, you're not going to be able, in any meaningful way, to bring together your users into a single room and ask them what they want you to do or provide for them.

What you're going to do, though, is listen to them. You're going to listen to all their tweets and their Facebook updates and you're going to look at their interactions online through your portal and your call center. Then, you're going to take all that huge stream of event information -- we're talking about complex event processing (CEP) -- you're going to bring it into your data warehousing grid or cloud.

You're also going to bring historical information on those customers and their needs. You're going to apply various social network behavioral analytics models to it to cluster people into the categories that make us all kind of squirm when we hear them, things like yuppie and Generation X and so forth. Professionals in the behavioral or marketing world are very good at creating segmentation of customers, based on a broad range of patterns.

They can get a sense of how a product or service is being perceived in real-time, so that the the provider of that product or service can then turn around and tweak that marketing campaign . . .



Social network analysis becomes more powerful as you bring more history into it -- last year, two years, five years, 10 years worth of interactions -- to get a sense for how people will likely respond likely to new offers, bundles, packages, campaigns, and programs that are thrown at them through social networks.

It comes down to things like Sharmila was getting at, simple things in marketing and sales, such as a Hollywood studio determining how a movie is being perceived by the marketplace, by people who go out to the theater and then come out and start tweeting, or even tweeting while they are in the theater -- "Oh, this movie is terrible" or "This movie rocks."

They can get a sense of how a product or service is being perceived in real-time, so that the the provider of that product or service can then turn around and tweak that marketing campaign, the pricing, and incentives in real-time to maximize the yield, the revenue, or profit of that event or product. That is seriously powerful and that's what big data architectures allow you to do.

If you can push not just the analytic models, but to some degree bring transactional applications, such as workflow, into this environment to be triggered by all of the data being developed or being sifted by these models, that is very powerful.

Gardner: We know that things are shifting and changing. We know that we want to get access to the data and analytics. And, we know what powerful things those analytics can do for us. Now, we need to look at how we get there and what's in place that prevents us.

Let's look at this architecture. I'm looking into MapReduce more and more. I am even hearing that people are starting to write MapReduce into their requests for proposals (RFPs), as they're looking to expand and improve their situation. Sharmila, what's wrong with the current environment and why do we need to move into something a bit different?

Moving the data

Mulligan: One of the biggest issues that the preexisting data pipeline faces is that the data lives in a repository that's removed from where the analytics take place. Today, with the existing solutions, you need to move terabytes and terabytes of data through the data pipeline to the analytics application, before you can do your analysis.

There's a fundamental issue here. You can't move boulders and boulders of data to an application. It's too slow, it's too cumbersome, and you're not factoring in all your fresh data in your analysis, because of the latency involved.

One of the biggest shifts is that we need to bring the analytics logic close to the data itself. Having it live in a completely different tier, separate from where the data lives, is problematic. This is not a price-performance issue in itself. It is a massive architectural shift that requires bringing analytics logic to the data itself, so that data is collocated with the analytics itself.

MapReduce, which you brought up earlier, plays a critical role in this. It is a very powerful technology for advanced analytics and it brings capabilities like parallelization to an application, which then allows for very high-performance scalability.

What we see in the market these days are terms like "in-database analytics," "applications inside data," and all this is really talking about the same thing. It's the notion of bringing analytics logic to the data itself.

One of the biggest shifts is that we need to bring the analytics logic close to the data itself.



I'll let Jim add a lot more to that since he has developed a lot of expertise in this area.

Gardner: Jim, are we in a perfect world here, where we can take the existing BI applications and apply them to this new architecture of joining logic and data in proximity, or do we have to come up with whole new applications in order to enjoy this architectural benefit?

Kobielus: Let me articulate in a little bit more detail what MapReduce is and is not. MapReduce is, among other things, a set of extensions to SQL -- SQL/MapReduce (SQL/MR). So, you can build advanced analytic logic using SQL/MR that can essentially do the data prep, the data transformations, the regression analyses, the scoring, and so forth, against both structured data in your relational databases and unstructured data, such as content that you may source from RSS feeds and the like.

To the extent that we always, or for a very long time, have been programming database applications and accessing the data through standard SQL, SQL/MR isn't radically different from how BI applications have traditionally been written.

Maximum parallelization

But, these are extensions and they are extensions that are geared towards enabling maximum parallelization of these analytic processes, so that these processes can then be pushed out and be executed, not just in-databases, but in file systems, such as the Hadoop Distributed File System, or in cloud data warehouses.

MapReduce, as a programming model and as a language, in many ways, is agnostic as to the underlying analytic database, file system, or cloud environment where the information, as a whole lives, and how it's processed.

But no, you can't take your existing BI applications, in terms of the reporting, query, dashboarding, and the like, transparently move them, and use MapReduce without a whole lot of rewriting of these applications.

You can't just port your existing BI applications to MapReduce and database analytics. You're going to have to do some conversions, and you're going to have to rewrite your applications to take advantage of the parallelism that SQL/MR enables.

MapReduce, in many ways, is geared not so much for basic analytics. It's geared for advanced analytics. It's data mining and text mining. In many ways, MapReduce is the first open framework that the industry has ever had for programming the logic for both data mining and text mining in a seamless way, so that those two types of advanced analytic applications can live and breathe and access a common pool of complex data.

In the marriage of SQL with MapReduce, the real intent is to bring the power of MapReduce to the enterprise, so that SQL programmers can now use that technology.



MapReduce is an open standard that Aster clearly supports, as do a number of other database and data warehousing vendors. In the coming year and the coming decade, MapReduce and Hadoop -- and I won't go to town on what Hadoop is -- will become fairly ubiquitous within the analytics arena. And, that’s a good thing.

So, any advanced analytic logic that you build in one tool, in theory, you can deploy and have it optimized for execution in any MapReduce-enabled platform. That’s the promise. It’s not there yet. There are a lot of glitches, but that’s the strong promise.

Mulligan: I'd like to add a little bit to that Dana. In the marriage of SQL with MapReduce, the real intent is to bring the power of MapReduce to the enterprise, so that SQL programmers can now use that technology. MapReduce alone does require some sophistication in terms of programming skills to be able to utilize it. You may typically find that skill set in Web 2.0 companies, but often you don’t find developers who can work with that in the enterprise.

What you do find in enterprise organizations is that there are people who are very proficient at SQL. By bringing SQL together with MapReduce what enterprise organizations have is the familiarity of SQL and the ease of using SQL, but with the power of MapReduce analytics underneath that. So, it’s really letting SQL programmers leverage skills they already have, but to be able to use MapReduce for analytics.

Important marriage

Over time, of course, it’s possible that there will be more expertise developed within enterprise organizations to use MapReduce natively, but at this time and, we think, in the next couple of years, the SQL/MapReduce marriage is going to be very important to help bring MapReduce across the enterprise.

Hadoop, itself, obviously is an interesting platform too in being able to store lots of data cost effectively. However, often customers will also want some of the other characteristics of a data warehouse, like workload management, failover, backup recovery, etc., that the technology may not necessarily provide.

MapReduce right now, available with massive parallel processing (MPP), the new generation of MPP data warehouse is such a vast data solution, does bring kind of the best of both worlds. It brings what companies need in terms of the enterprise data warehouse capabilities. It lets you put application logic near data, as we talked about earlier. And, it brings MapReduce, but through the SQL/MapReduce framework, which really primarily is designed to ease adoption and use of MapReduce within the enterprise.

Gardner: Jim, we are on a journey. It’s going to be several years before we are getting to where we want to go, but there is more maturity in some areas than others. And, there is an opportunity to take technologies that are available now and do some real strong business outcomes and produce those outcomes.

Give me a sense of where you see the maturity of the architecture, of the SQL, and the tools and making these technologies converge? Who is mature? How is this shaking out a little bit?

Kobielus: Maturity is a best practice, in this case in-database analytics. As I said, it’s widely supported through proprietary approaches by many vendors.

In terms of the maturity, it's judged by adoption of an open industry framework with cross-vendor interoperability.



In terms of the maturity, it's judged by adoption of an open industry framework with cross-vendor interoperability. it's not mature yet, in terms of MapReduce and Hadoop. There are pioneering vendors like Aster, but there are a significant number of established big data warehousing vendors that have varying degrees of support now or in the near future for these frameworks. We're seeing strong indications. In fact, Teradata already is rolling out MapReduce and Hadoop support in their data warehousing offerings.

We're not yet seeing a big push from Oracle, or from Microsoft for that matter, in the direction of support for MapReduce or Hadoop, but we at Forrester believe that both of those vendors, in particular, will come around in 2010 with greater support.

IBM has made significant progress with its support for Hadoop and MapReduce, but it hasn’t yet been fully integrated into that particular vendor's platform.

Looking to 2010, 2011

If we look at a broad range of other data warehousing vendors like Sybase, Greenplum, and others, most vendors have it on their roadmap. To some degree, various vendors have these frameworks in in development right now. I think 2010 and 2011 are the years when most of the data warehousing and also data mining vendors will begin to provide mature, interoperable implementations of these standards.

There is a growing realization in the industry that advanced analytics is more than just being able to mine information at rest, which is what MapReduce and Hadoop are geared to doing. You also need to be able to mine and do predictive analytics against data in motion. That’s CEP. MapReduce and Hadoop are not really geared to CEP applications of predictive modeling.

There needs to be, and there will be over the next five years or so, a push in the industry to embed MapReduce and Hadoop. There are few vendors that are showing some progress toward CEP predictive modeling, but it’s not widely supported yet, and it’s in proprietary approaches.

In this coming decade, we're going to see predictive logic deployed into all application environments, be they databases, clouds, distributed file systems, CEP environments, business process management (BPM) systems, and the like. Open frameworks will be used and developed under more of a service-oriented architecture (SOA) umbrella, to enable predictive logic that’s built in any tool to be deployed eventually into any production, transaction, or analytic environment.

It will take at least 3 to 10 years for a really mature interoperability framework to be developed, for industry to adopt it, and for the interoperability issues to be worked out.



It will take at least 3 to 10 years for a really mature interoperability framework to be developed, for industry to adopt it, and for the interoperability issues to be worked out. It’s critically important that everybody recognizes that big data, at rest and in motion, needs to be processed by powerful predictive models that can be deployed into the full range of transactional applications, which is where the convergence of big data, analytics, and transactions come in.

Data warehouses, as the core of your analytics environment, need to evolve to become in their own right application servers that can handle both the analytic applications or traditional data warehousing in BI and data mining, as well as the transactional logic, and really handle it all with full security and workload isolation, failover, and so forth in a way that’s seamless.

I'm really excited, for example, by what Aster has rolled out with their latest generation, 4.0 of the Data-Application Server. I see a little bit of progress by Oracle on the Exadata V2. I'm looking forward to seeing if other vendors follow suit and provide a cloud-based platform for a broad range of transactional analytics.

Gardner: Sharmila, Jim has painted a very nice picture of where he expects things to go. He mentioned Aster Data 4.0. Tell us a little bit about that, and where you see the stepping stones lining up.

Mulligan: As I mentioned earlier, one of the biggest requirements in order to be able to do very advanced analytics on terabyte- and petabyte-level data sets, is to bring the application logic to the data itself. Earlier, I described why you need to do this. You want to eliminate as much data movement as possible, and you want to be able to do this analysis in as near real-time as possible.

What we did in Aster Data 4.0 is just that. We're allowing companies to push their analytics applications inside of Aster’s MPP database, where now you can run your application logic next to the data itself, so they are both collocated in the same system. By doing so, you've eliminated all the data movement. What that gives you is very, very quick and efficient access to data, which is what's required in some of these advanced analytics application examples we talked about.

Pushing the code

What kind of applications can you push down into the system? It can be any app written in Java, C, C++, Perl, Python, .NET. It could be an existing custom application that an organization has written and that they need to be able to scale to work on much larger data sets. That code can be pushed down into the apps database.

It could be a new application that a customer is looking to write to do a level of analysis that they could not do before, like real-time fraud analytics, or very deep customer behavior analysis. If you're trying to deliver these new generations of advanced analytics apps, you would write that application in the programming language of your choice.

You would push that application down into the Aster system, all your data would live inside of the Aster MPP database, and the application would run inside of the same system collocated with the data.

In addition to that, it could be a packaged application. So, it could be an application like software as a service (SaaS) that you want to scale to be able to analyze very large data sets. So, you could push a packaged application inside the system as well.

One of the fundamental things that we leverage to allow you to do more powerful analytics with these applications is MapReduce. You don’t have to MapReduce enable an application when you push it down into the apps system, but you could choose to and, by doing so, you automatically parallelize the application, which gives you very high performance and scalability when it comes to accessing large datasets. You also then leverage some of the analytics capabilities of MapReduce that are not necessary inherent in something like SQL.

That's a very attractive feature, because fundamentally the data warehousing cloud is an analytic application server.



The key components of 4.0 drive to where it's providing you a platform that can efficiently and cost effectively store massive amounts of data, plus give you a platform that allows you to do very advanced and sophisticated analytics. To run through those key things that we've done in 4.0, is first, the ability to push applications inside the system, so apps are collocated with the data.

We also offer SQL/MapReduce as the interface. Business analysts who are working with this application on a regular basis don’t have to learn MapReduce. They can use SQL/MR and leverage their existing SQL skills to work with that app. So, it makes it very easy for any number of business analysts in the organization to leverage their preexisting SQL skills and work with this app that's pushed down into the system.

Finally, in order to support the ability to run application inside a data, which as I said earlier is nontrivial, we added fundamental new capabilities like Dynamic Mix Workload Management. Workload Management in the Aster system works not just on data queries, but on the application processes as well, so you can balance workloads when you have a system that's managing data and applications.

Kobielus: Sharmila, I think the greatest feature of the 4.0 is simply the ability to run predictive models developed in SaaS or other tools in their native code without converting them necessarily to SQL/MR. That means that your customers can then leverage that huge installed piece of intellectual property or pool of intellectual property, all those models, bring it in, and execute it natively within your distributed grid or cloud, as a way of avoiding having to do that rewrite. Or, if they wish, they can migrate them or convert them over to SQL/MR. It's up to them.

That's a very attractive feature, because fundamentally the data warehousing cloud is an analytic application server. Essentially, you want that ability to be able to run disparate legacy models in parallel. That's just a feature that needs to be adopted by the industry as a whole.

The customer decides

Mulligan: Absolutely. I do want to clarify that the Aster 4.0 solution can be deployed in the cloud, or it can be installed in a standard implementation on-premise, or it could be adopted in an appliance mode. We support all three. It's up to the customer which of those deployment models they need or prefer.

To talk in a little bit more detail about what Jim is referring to, the ability to take an existing app, have to do absolutely no rewrite, and push that application down is, of course, very powerful to customers. It means that they can immediately take an analytics app they already have and have it operate on much larger data sets by simply taking that code and pushing it down.

That can be done literally within a day or two. You get the Aster system, you install it, and then, by the second day, you could be pushing your application down.

If you choose to leverage the MapReduce analytics capabilities, then as I said earlier, you would MapReduce enable an app. This simply means you take your existing application and, again, you don’t have to do any rewrite of that logic. You just add MapReduce functions to it and, by doing so, you have now MapReduce-enabled it. Then, you push it down and you have SQL/MR as an interface to that app.

The process of MapReduce enabling an app also is very simple. It's a couple of days process. This is not something that takes weeks and weeks to do. It literally can be done in a couple of days.

It means that they can immediately take an analytics app they already have and have it operate on much larger data sets by simply taking that code and pushing it down.



We had a retailer recently who took an existing app that they had already written, a new type of analytics application that they wanted to deploy. They simply added MapReduce capabilities to it and pushed it down into the Aster system, and it's now operating on very, very large data sets, and performing analytics that they weren't able to originally do.

The ease of application push down and the ease of MapReduce enabling is definitely key to what we have done in 4.0, and it allows companies to realize the value of this new type of platform right away.

Gardner: I know it's fairly early in the roll out. Do you have any sense of metrics, from some of these users? What do they get back? We talked earlier in the examples about what could be done and what should be done nowadays with analysis. Do you have any sense of what they have able to do with 4.0?

Reducing processing times

Mulligan: For example, we have talked about customers like comScore who are processing 1.6 billion rows of data on a regular basis, and their data volumes continue to grow. They have many business analysts who operate the system and run reports on a daily basis, and they are able to get results very quickly on a large data set.

We have customers who have gone from 5-10 minute processing times on their data set, to 5 seconds, as a result of putting the application inside of the system.

We have had fraud applications that would take 60-90 minutes to run in the traditional approach, where the app was running outside the database, and now those applications run in 60-90 seconds.

Literally, by collocating your application logic next to the data itself, you can see that you are immediately able to go from many minutes of processing time, down to seconds, because you have eliminated all the data movement altogether. You don’t have to move terabytes of data.

Add to that the fact that you can now access terabyte-sized data sets, versus what customers have traditionally been left with, which is only the ability to process data sets in the order of several tens of gigabytes or hundreds of gigabytes. Now, we have telcos, for example, processing four- or five-terabyte data sets with very fast response time.

We're talking about a collision of two cultures, or more than two cultures. Data warehousing professionals and data mining professionals live in different worlds, as it were.



It's the volume of data, the speed, the acceleration, and response time that really provide the fundamental value here. MapReduce, over and above that, allows you to bring in more analytics power.

Gardner: A final word to you, Jim Kobielus. This really is a good example of how convergence is taking place at a number of different levels. Maybe you could give us an insight into where you see convergence happening, and then we'll have to leave it there.

Kobielus: First of all, with convergence the flip side is collision. I just want to point out a few issues that enterprises and users will have to deal with, as they move toward this best practice called in-database analytics and convergence of the transactions and analytics.

We're talking about a collision of two cultures, or more than two cultures. Data warehousing professionals and data mining professionals live in different worlds, as it were. They quite often have an arm's length relationship to each other. The data warehouse traditionally is a source of data for advanced analytics.

This new approach will require a convergence, rapprochement, or a dialog to be developed between these two groups, because ultimately the data warehouse is where the data mining must live. That's going to have to take place, that coming together of the tribes. That's one of the best emerging practices that we're recommending to Forrester clients in that area.

Common framework

Also, transaction systems -- enterprise resource planning (ERP) and customer relationship management (CRM) -- and analytic systems -- BI and data warehousing -- are again two separate tribes within your company. You need to bring together these groups to work out a common framework for convergence to be able to take advantage of this powerful new architecture that Sharmila has sketched out here.

Much of your transactional logic will continue to live on source systems, the ERP, CRM, supply chain management, and the like. But, it will behoove you, as an organization, as a user to move some transactional logic, such as workflow, in particular, into the data warehousing cloud to be driven by real-time analytics and KPIs, metrics, and messages that are generated by inline models built with MapReduce, and so forth, and pushed down into the warehousing grid or cloud.

Workflow, and especially rules engines, increasingly we will find to be tightly integrated or brought into a warehousing or analytics cloud that's got inline logic.

Another key trend for convergence is that data mining and text mining are coming together as a single discipline. When you have structured and unstructured sources of information or you have unstructured information from new sources like social networks and Twitter, Facebook, and blogs, it's critically important to bring it together into your data mining environment. A key convergence also is that data at rest and data in motion are converging, and so a lot of this will be real-time event processing.

Those are the key convergence and collision avenues that we are looking at going forward.

Gardner: Very good. We've been discussing how new architectures for data and logic processing are ushering in this game-changing era of advanced analytics. We've been joined by Jim Kobielus, senior analyst at Forrester Research. Thanks so much, Jim.

Kobielus: No problem. I enjoyed it.

Gardner: Also, we have been talking with Sharmila Mulligan, executive vice president of marketing at Aster Data. Thank you Sharmila.

Mulligan: Thanks so much, Dana.

Gardner: This is Dana Gardner, principal analyst at Interarbor Solutions. You've been listening to a sponsored BriefingsDirect podcast. Thanks for listening, and come back next time.

Listen to the podcast. Find it on iTunes/iPod and Podcast.com. Download the transcript. Learn more. Sponsor: Aster Data Systems.

Transcript of a BriefingsDirect podcast on how new advances in collocating applications with data architecturally provides analytics performance breakthroughs. Copyright Interarbor Solutions, LLC, 2005-2010. All rights reserved.

You may also be interested in: