Companies, like Facebook and Google, have been tracking their user’s behavioral events for decades now, generally using cookies in combination with some sort of tracking pixel. These user behavioral events, which I will hereafter call a “beacon” for brevity, are generally tracked by loading a 1x1 pixel into the user’s web page with additional parameters of interest populated into the URL string. There are dozens of applications for beacon data, but I assume you already know why you want to track behavior (i.e., you already know how you want to apply this knowledge).

What does a Tracking Pixel look like?

Let’s say we wish to track a user’s unique ID, the product they interacted with, the interaction type (event), and their self-provided postal code (ZIP code for the Americans out there). You would generally wish to collect the following data, considered the beacon data.

Data:
userId=A49FGD
productId=738444
postalCode=M7N1J3
event=click

The tracking pixel is simply the mechanism that sends the beacon data to the backend. It may look like the following:

http://path/to/trackingpixel.gif?event=click&productId=738444&userId=A49FGD&postalCode=M7N1J3

This is known as a Plain-Text Beacon, as the contents are sent in plain text.

This event would be received over HTTP by a company’s webserver, with the data then parsed and loaded according to their data handling frameworks. Some companies may apply encryption to the data being sent over the wire, whereas others may leave it in the clear if it’s nothing confidential.

What is beacon data used for?

Here are some common applications of beacon data:

Advertising and Billing

One of the most common use-cases is to embed a pixel into a website to track engagements with the property. Many companies generate revenue by getting users to click through their ads into an underlying property that is owned by another company (e.g., Facebook to Shopify store). They have a pay-per-click or pay-per-engagement model that can be easily collected using these tracking pixels.

Machine Learning and Recommendations

Nearly every web company today is using machine learning to generate product recommendations. Beacons provide a rich source of data in the form of a user’s product interests, demographics and geolocation, all of which are then provided as inputs to generate more accurate machine learning predictions.

Campaign Health Monitoring

Beacons can act like a heartbeat. They arrive in real-time and provide us with an up-to-date picture of how any given advertising campaign is going. If engagements appear to be lagging behind projected values we can react early and accordingly. We can review the content itself to ensure that advertisements and content are rendered properly, that they are functional, and that there are no bugs in our system code that would prevent us from tracking these beacons.

Retailer Reporting

We have an extensive reporting framework built on top of our Beacon data. At Flipp in particular, we have a large number of retail partners who wish to track things like cost-per-click, cost-per-engagement, time-on-flyer, top-X-flyer-items, and a myriad of other metrics, including custom metrics. These reports provide extremely important insight for the retailers into their campaign performance at Flipp and directly influence budget growth. The most important part here is that a retailer must be able to trust the data that we are providing them.

As you can imagine, organizing this data and ensuring integrity is crucial to trustworthy data. There are a few alternatives in structuring data — I’ll explain what we do at Flipp, and why:

The Pros and Cons of Unstructured Plain-Text Beacons

First, a reminder of what the data in a plain text beacon may look like, as part of the URL of the tracking pixel:

userId=A49FGD
productId=738444
postalCode=M7N1J3
event=click

This data is referred to as unstructured because there are no enforcement of names, types or domains. The data can be structured however the sender of the beacon desires, for better or worse. There are a number of benefits and drawbacks with this approach.

The Pros:

  • Easy to add new content. Simply add the new key and value.
  • Human Readable - Easier to test

The Cons:

  1. Typos
    Even with rigorous testing, key values can suffer from typos, especially when multiple platforms are trying to use the same format.

    • event=click -> evnet=click
    • event=click -> event=cilck
  2. Drifting Value Domains
    Your team might start off only sending true/false, but a future release begins to send 1/0… or “yes”/”no”. Perhaps the underlying logic has changed and it could be considered valid, but older data follows the older format.

  3. Deprecation of Data
    Let’s say a feature is removed, and we can now remove a corresponding beacon, a parameter from another beacon, or both. How do we do this without causing a downstream failure, perhaps in reporting code? How can you know the impact of removal?

  4. Inconsistent Evolutionary Strategies
    Let’s say you add a new product type to engage with. For a Flipp-specific example, this could be the addition of coupons alongside of flyers. Do you use the existing “event=engagement” type to track this event? Do you create a new event type? Do you specify a subtype? What are the impacts to the downstream consumers?

Impacts of Unstructured Plain-Text Data

At Flipp, we currently use a mixture of Apache Hive and Apache Spark for a large portion of our regular report generating pipelines. We use Hive to define a schema on top of the plain text data for querying by downstream analytics processes and many of our legacy pipelines use this approach.

Our newer batch pipelines are implemented using Spark, using Dataframes to access the data directly from an HDFS implementation. During this time we have also ported the storage of our data from plain text to Parquet, but the aforementioned downsides of using plain text beacons still follows us into this implementation.

Say we receive the events below and we wish to insert it into our data store. The events are shown as a table for readability, and I have highlighted the potentially erroneous data in red.

What do you do with potentially bad data, or data that does not conform to the underlying schema? Do you discard it? Do you transform it? The unfortunate part is that there is no good answer here, or one that will be satisfactory for all of your downstream data consumers.

Example Consumer 1: I want to know all of our unique user IDs to determine weekly active users (unique count on userId).

Example Consumer 2: I want to know which products are clicked on the most in my Flyer (unique count on productId grouping).

Example Consumer 3: I want to know which postal codes have the highest activity (Geo Operations on postal/zip codes).

The issue is that we can’t tell for certain what a business unit wants to report on, nor can we predict what it is in the future that they would wish to report. If we were to drop any rows that we deem malformed, reporting queries that should be unaffected by the malformed data will be affected.

On the other hand, if we leave malformed data in the tables, then we leave it up to each consumer of the data to repair it, which leads to costly duplication of logic in each consumer application. Consumer applications begin to harden around their strategies for handling bad data, but with time these strategies tend to diverge and inconsistencies show up.

A particularly bad way in which these inconsistencies can show up is in divergent customer reporting. A retailer could obtain a set of reports from us where the billing data does not match the engagement data.

Trust is easy to lose and hard to gain back, so it is in everyone’s interest to keep the data clean.

What about a Clean Table?

One approach is to take the malformed data from the input source, clean up any malformations and output it as a clean data source for downstream consumption. At first glance, this seems like a fine idea, and in practice it will work for any malformations that you know about ahead of time. However, this approach falls short in a few areas:

  1. Doesn’t account for removed fields
    • If data is missing, do you populate it? How do you know it’s missing and not simply deprecated?
  2. Can’t eliminate all typos
    • What decision do you make if you’re collecting both “browsertype” and “browser_type” ? Are they the same values or were they renamed for some reason?
  3. Doesn’t know the range of valid domains for each beacon, especially for data evolution.
    • What if true/false becomes true/false/unknown, or true/false/null? Is that a bug or expected change?

What ends up happening in practice is that the clean table ends up becoming a reactionary system. The Data Engineering team finds out about a data failure or inconsistency either due to batch data quality reports (kinda slow), in real-time (better, but by the time we find out our data is already becoming inconsistent) or when an external client notifies us that something doesn’t look right (worst of all).

In each of these scenarios we only detect a failure once it has already occurred, at which point we can only react, improve the cleaning process, and reprocess data. When dealing with large amounts of data, this back processing can become expensive.

So, now what?

Most of the problem here stems from using plain-text, freely-defined beacons. While it gives developers an unprecedented freedom to collect various parameters, it is tightly coupled to application release cycles, extremely error prone, and simply defers the resolution of data types and schemas to downstream consumers, further introducing errors.

I’ve covered the first part of how we manage our data at Flipp, and in the second, I’ll unpack how we use Avro to organize our data with event-driven microservice architecture (a pattern used also by LinkedIn and Airbnb). That’s a whole other can of worms — so I’ll write more about it in the second part!

For now, I recommend looking at how your team’s data is structured — and the kind of problems you’re running into, and how you plan to clean things up or work with a new structure.