How WibiData Works
Over the past year, we've had the privilege of helping a lot of great customers get more value out of their data by using WibiData to store, analyze and serve information about their users. What makes WibiData such a powerful tool? Why is it more effective than just cobbling together your own solution on top of Hadoop and HBase? In this post, we'll peel back the covers and look at how some key components of WibiData work to give you more leverage over your data.
Data about users has challenges associated with it that you don't necessarily see with other large-scale data.
- To analyze users, you need to digest large volumes of log-oriented transactional data as well as more concise profile data
- You need to serve recommendations and other derived data interactively
- There's a mix of batch (offline) and on-the-fly calculations required to deliver recommendations at web speed
WibiData is designed to store this transactional data side-by-side with profile and other derived data attributes. Keeping data logically and physically close enables high-performance analysis of the entire data picture surrounding a user. By using Apache HBase, we retain the abilily to update these records on-the-fly, continuously recording new information about users as they interact with a web site, product or service. Furthermore, the ability to add new ad-hoc columns to a table enables more flexible analysis: output data that is the result of one analytic pipeline is stored adjacent to its input data, meaning that you can easily use this as input to second- or third-order derived data as well. Every cell records a "first class" fact about a user. WibiData's table layout management system, powered by Apache Avro, gracefully adapts to the changing data collection and analysis needs of your organization. And WibiData's analysis and serving framework allows you to seamlessly compute over your user data in both offline and online capacities. WibiData works in a broad variety of domains: in addition to classic web-based contexts like advertising and content recommendation, it also has applications in mobile, online gaming, healthcare, finance, and several other industries. For purposes of illustration, though, this post draws its examples from a web-based application scenario, describing how WibiData's many components work together to deliver focused content recommendations to users.
WibiData's storage layer is built around Apache HBase. The three-dimensional storage model offers a unique capability to organize information that would ordinarily require separate tables in a single condensed view that is sliced along a common characteristic, like in this example: . Each row in a table stored in WibiData represents an individual user. Individual columns in this table can be multi-valued; the "info:web_hits" and "derived:recs" columns store multiple timestamped values. The "web_hits" column stores all the data about how a user has visited a web site over time. The "recs" column represents the most current content recommendations for this user; they change over time as we learn more about a user's preferences. This multi-valued data sits alongside more traditional data about the user: name, email address, and other properties that only have a single value at a time. But HBase is only one component in a rich ecosystem of open source and proprietary components: One of HBase's strengths is its flexibility: rows can contain as many columns as you'd like, with any names you want. The "cells" in these columns can contain arbitrary bytes -- no data type information is tracked by HBase itself. This "schema free" approach to data makes HBase a great platform on which to build any system.
Schema freedom: not the whole story
But "schema free" isn't what you really want. Your actual data has a schema: a set of conventions that you understand about the information and how it interacts with your business process. Whether this schema is enforced by the program code that reads and writes this data, or by an underlying datastore, it still applies to your data. After all, you don't think about your users in terms of bytes; you model them with records and fields: string-based email addresses, integer-based dates and timestamps, and other useful data types. Whenever you serialize and deserialize data, a schema comes into play. What is most important is the ability to be flexible in how you apply a schema to your data. WibiData uses Apache Avro for serialization. This framework is designed with two important criteria in mind:
- The programs that read records and the programs that write them may need to be updated independently of one another
- The programs that access your records may not have access to precompiled classes and other custom libraries
Avro schemas can vary over time; you can easily add a field to a record, or delete a field. As your data collection practices evolve, Avro keeps up with the pace. Does your web site track a new cookie? This can be added as a new field. But even though you start collecting that new data, your existing analysis pipelines can treat records like they always did; programs that don't yet know about the new cookie are still compatible with both the old records already collected, and the new records with the additional field. New programs fill in default values for old data recorded before a field was added, applying the new schema at read time. Many other serialization systems require that you precompile your schemas (or "protocols") into classes which are used to interpret data at runtime. Avro's "generic" library allows you to work with data from any source, in a program that doesn't know about your specific classes--a program like WibiData. WibiData makes use of Avro's generic deserialization capabilities to load the data into Pig or export records into a SQL database so analysts can work with data for ad-hoc exploration. Furthermore, schemas for every column are stored in a data dictionary that matches column names with their schemas, as well as human-readable descriptions of the data. This ensures that clients of the system do not need to read the source code of every application updating the table to discover all the data they have at their disposal. WibiData's graphical data explorer lets everyone who works with your data understand what information can be combined for further analysis:
A new analysis calculus
MapReduce provides a generic framework for computing over large volumes of data. But the programming model of
reduce functions are often cumbersome to adapt to your analysis needs. WibiData introduces new operators that help you think about your users in a more row-oriented data model. Producers are computation functions that update a row in a WibiData table: Unlike mappers or reducers, which create new (key, value)-pair datasets based on reading separate input datasets, producers run a function (like a classifier or content recommendation routine) over individual rows (users) in a table, updating an output column (like "profile") by applying the function to some of the input columns, like the web browsing history stored for a user in the "web_hits" column. Of course, you still need to aggregate data over all your users. How many ads did you serve today, and what was the average conversion rate? How many monthly active users do you have over the past quarter? What machine learning model best recommends new content to users, based on the expressed and implicit preferences demonstrated by your user activity history this week? Gatherers close the gap between row-oriented data stored in WibiData, and the key-value pairs of MapReduce. With a gatherer you can extract a projection from your dataset to work with in a more conventional MapReduce pipeline. The results of this aggregate process can then be applied on a row-by-row basis with a producer.
Combining batch and real-time analysis
Producers and gatherers can be run within Hadoop MapReduce, guaranteeing fault-tolerant execution of your analytic pipelines across millions of users and terabytes of data. But sometimes MapReduce falls short of your needs. As users continue to interact with a live site, the recommendations, advertisements, content, and other personalization characteristics need to continually adapt to the user's input. WibiData's unique framework creates a layer of abstraction between your user analysis routines, and the underlying execution model. Since your producers are not tied to MapReduce, they can be executed in either a batch or an on-the-fly context: New data in the "info:web_hits" column renders the recommendations in the "derived:recs" column stale. But as the application fetches a user's recommendations, producers can be executed to keep them up-to-date, without requiring MapReduce. WibiData's Data Access Server provides a wrapper around the underlying data, storing and serving information via REST or Apache Thrift APIs, and running producers as needed to keep results fresh on an event-driven basis.
Putting it together
WibiData's library of analytic producers and gatherers give you a head-start on building a powerful analytic pipeline to make your product more dynamic. The integration of Hadoop, HBase, and Avro form a powerful and flexible basis for performing analysis at scale. And WibiData's extensions to this data model like locality groups ensure that computation and serving run smoothly and with high performance. Curt Monash, author of DBMS2 also does a great job of summarizing some of the highlights of WibiData, as well as helping clarify how WibiData fits into the taxonomy of investigative, operational, and other analytic purposes. We've helped a lot of clients get further with their analysis goals than they would have without WibiData. And we're excited at the prospect of helping more great organizations wrangle their data. Does WibiData sound interesting to you? Let us know what you think!