Thoughts on Building Relational Data Sets: The Reverse Collector/Appraiser Algorithm

Published in

DataDrivenInvestor

21 min readJun 22, 2018

Previously, I wrote about potential applications of AI to the food industry. This time, taking a deeper dive into a narrower (but closely related) topic, I’d like to explore a single idea for identifying meaningful matches in data:

The Reverse Collector/Appraiser is a theoretical algorithm for determining associations between entities in a data set. It’s a tool for querying a data set and generating a pool of matches based on the criteria provided.

On the surface, that might not sound all that interesting (there are, after all, many existing models for executing a search) — but when applied recursively or sequentially, and/or when given access to external libraries or other sources of relational data, I think it has the potential to become something much bigger and ultimately more useful. More specifically, I believe it has significant potential in the field of machine learning.

If you’re curious about the name, I’ll give you the 30-second TL;DR:

The “Appraiser” algorithm determines the value of the items in the data set
The “Collector” identifies the items they are interested in based on that valuation
It’s “reverse” because the Appraiser acts before the Collector (determining relevance so the Collector can select), rather than the Collector selecting and then looking to the Appraiser for a valuation. I toyed with other names (Reverse Roadshow, Antique Store, etc.), but I was getting too far into the weeds and forced myself to settle on this one. It’s less important than the actual functionality, even if the creative in me still wants to noodle on it.

Also, full disclosure: this exploration will be long, it will bury itself in the weeds, and it may at times be difficult to follow. For that reason, I’m going to include a concise summary at the end, followed by what I think are some interesting applications of the idea. Look for “Now, about that concise review I promised…” if that’s more your speed.

First, let’s take a quick look at the Collector portion of the algorithm, which controls the selection of results from among the pool of potential matches. I know that’s a bit backwards (since the Appraiser actually performs its work first), but I think it helps set the stage.

The Collector’s behavior exists as points on a continuum (ranging in selection strategy from broad/unfocused to narrow/focused), but I think there are three significant points on that continuum. Let’s call those points Strategy 1, Strategy 2, and Strategy 3:

Strategy 1 gathers all entities present in a data set which have a minimal number of attributes in common (as few as one). These attributes need not have been previously identified as key. The selection of a particular seed entity is largely irrelevant — only the selection of an attribute/value pair matters.

In this mode, the Collector algorithm will select a broad grouping of matches. There will be little specificity in the result; the function is close to that of a one- or two-dimensional filter. (When I initially sketched out this idea, I was going to call Strategy 1 “The Hoarder”, because its goal is to cast as wide a net as possible while maintaining some degree of initial relevance)

Strategy 2 gathers all entities present in a data set which have a moderate number of attributes in common. One of these attributes must have previously been identified as key to the selected seed entity (more on this later in the exploration).

In this mode, the Collector algorithm will select for an increasingly specific result. (When I initially sketched the idea, this strategy was simply called “The Collector”, because it falls in the middle)

Strategy 3 gathers all entities present in a data set that have a high number of attributes in common. One or more of these attributes will have previously been identified as key. One or more entities will have been selected as a seed.

In this mode, the Collector algorithm will select for a highly specific result, modeling the selection based on refined criteria. (When I initially sketched the concept, I was going to call Strategy 3 “The Curator”, because it acts with a high degree of selectiveness)

The “blend” between these three strategies is determined by the following properties (which could either be specified by a human performing or guiding the search, or a process/application executing the search as part of a larger routine — the latter speaks to the potential for recursion that I mentioned in the introduction):

Attribute(s) selected for target model. (Are the attributes key or non-key?)
Entity(s) selected for target model (The attribute(s) being matched can be supplied directly as input, passed from another seed entity, or passed from a cohort of entities (blended))
What constitutes a “match” (the Appraiser provides association scores, but the Collector ultimately determines what qualifies
The “Curiosity” setting of the algorithm (more on this later)

The Collector algorithm itself is a “dumb” process, in that it doesn’t actually know anything about the data set (or even perform any analysis on it at all). Instead, the Collector is acting on an abstraction of the data set. This allows it to be completely agnostic to the nature of the entities or the “meaning” of the data set, which ultimately makes it more flexible and able to be adapted to much more than a straightforward search scenario.

The actual assessment of the entities in a data set, according to their attributes, is performed by The Appraiser algorithm. The Appraiser crawls the data set and analyzes it to determine the relationships between entities and their attributes, and serving the result back to the Collector as an accessible abstraction.

A high-level overview of how the Appraiser algorithm generates matches. I’ll dig deeper into each aspect.

The Appraiser algorithm operates in three “passes”, each building on the foundation laid by the one before it:

Pass 1: Prepare the Data

In this first pass, the algorithm takes stock of the data set, performs some clean-up, and begins to trace paths that will later be explored by subsequent passes:

Determine the total number of entities in the data set
Determine the total number of attributes per entity in the data set
Determine the data type for each attribute (e.g. integer, string, boolean)

This can be determined algorithmically using heuristics and reference data, or through human input (e.g. tagging the attributes prior to import)

4. Based on the data type, Identify and flag null or treat-as-null values for attributes at the entity level

Entities with null values for certain attributes can either be excluded from comparisons for those particular attributes (default behavior), purged from the data set altogether (in scenarios requiring exceptional rigor), or augmented using other outside sources of data (with the donor entity selected as a match candidate using the Appraiser algorithm itself, even)

5. Determine the model for comparison of each attribute (This can be determined algorithmically or through advance human input), based on the following:

By comparison of the literal contents of the attribute field (e.g. exact string match, Boolean match). This could be an inflexible match (e.g. all entities where Attribute A = TRUE), or a flexible match (e.g. all entities where Attribute B contains “chicken”)
By comparison of the numerical value of the attribute field (e.g. all values falling within a specified range)
By comparison of the meaning of the contents of an attribute field through the use of an outside data source (e.g. thesaurus or dictionary synonym lookup, comparison of postal codes against seed list representing a particular geographic location)
The unity of an attribute cohort will eventually be determined based on the matching strategy selected, with the Appraiser process calculating spread/variance of attribute/value pairs.

6. “Normalize” attribute values to improve the quality of match results (This action can be automatic or human-approved to allow for flexibility), and includes:

Flattening case variants in strings (e.g. upper, lower, mixed)
Standardizing punctuation or markers in attributes where format can be assumed (e.g. phone, email)
Removing non-functioning or overly specific decimal variations in numerical values (e.g. 4.00, 4, 4.012 = 4.0)
Converting non-standard or non-uniform boolean values to standard values (e.g. “true”, “TRUE”, “checked”, “1”)

7. Assign a unique ID value for each entity in the data set

The output of Pass 1 will be:

A normalized copy of the data set for use by all future processes, with the unique ID added to each entity’s attribute set
A simplified layout file that represents the attributes attached to entities in the data set, with each connected to its data type and model for comparison.
These two files will serve as the heads of the relational data set that is produced by the Appraiser in subsequent passes.

Pass 2: Score the Data

In this second pass, the algorithm analyzes and scores the data set to establish attribute ranges and establish key/non-key attributes:

Determine the number of unique values present for each attribute in the data set
Based on a strict equivalence match, determine the number of entities possessing each of the unique attribute/value pairs for each attribute (strict cohort size)
Based on the number of unique attribute/value pairs for a given attribute, flag that value as Key, Non-Key, or Possible Key.

Key Attributes are attributes whose values display a high degree of uniqueness in the data set. That is to say, these attribute/value pairs belong to small cohorts. The level of uniqueness that qualifies as “key” is relative to the specific data set being analyzed — but a target threshold can also be set.
Non-Key Attributes are attributes whose values display a high degree of commonality in the data set. The corresponding attribute/value pairs belong to large cohorts within the overall data set.

4. Determine the uniqueness/diversity of the data set at the entity and attribute levels

The diversity calculation for an attribute factors in the total number of entities in the set and the total number of unique values for the selected attribute (uniques). This relationship also represents the average cluster size of the attribute.
The uniqueness calculation for a specific attribute/value pair factors in the total number of entities where the attribute/value pair is present (cluster size), the total number of unique values for the attribute, the total number of entities in the data set, and the size of the specific value-match cluster relative to other clusters in the data set for that particular attribute.

5. The uniqueness scores of a single entity’s attributes can then be averaged to determine the overall uniqueness of that entity relative to the data set.

6. The relative cluster size of each of an entity’s attribute/value pairs can be averaged to determine its Relative Peer Match score. This serves as a rough indicator of the size of the cohorts associated with each attribute/value pair.

The output of Pass 2 will be a set of relational data files that capture:

Linked to the Attribute (layout file):

The total number of unique values for the attribute present in the data set (this also functions as Cohort count)
The average cluster size associated with the attribute
A score that represents the likelihood of the attribute being Key, Non-Key, or Possible Key relative to the entire data set (hinging on the diversity/uniqueness of the attribute in the data set)

Linked to the Cohort:

A unique ID that connects the cohort back to its associated Attribute
The value (field contents) associated with the cohort, tagged with a unique ID
A count of the number of entities belonging to the cohort (cluster size)
An array of unique IDs for each entity that belongs to the cohort for this particular attribute/value pair.
A score that represents the specific cohort’s uniqueness for the attribute

Linked to the Entity:

(Through the Attribute) The ID of the cohort associated with each of the entity’s attributes
The average cluster size across all of the entity’s attributes
The average uniqueness across all the entity’s attributes

Pass 3: Group the Data

Now that the Appraiser has (in essence) built a relational database that ties together entities and their attributes with the value cohorts those attributes fall into, as well as the relative distribution of those values throughout the data set, it needs to perform a more detailed analysis of the relationships between that high level data:

Based on the model for comparison associated with an attribute, analyze the value of each value cohort relative to every other cohort present for that attribute.
Because this will produce an exponentially large relational table, there will be significant efficiency gains if the algorithm is able to build an abstraction layer — that is, to cluster similar cohorts and reference relationships between them, rather than between each individual cohort, whenever possible.
Using the relational scoring produced in the first step, build Cohort Clusters that connect cohorts whose values are “close” (this value is relative to the data set as a whole, but can be tweaked as necessary).

Each cohort cluster will contain a Unity score that indicates the variance displayed in the range of clustered attribute/value pairs (diversity)

4. If the depth of the data set (or the amount of granular variance in attribute/value pairs) supports it, the Appraiser algorithm can be called to create clusters of clusters, building additional layers of abstraction.

This function does not need to be used symmetrically across all attributes in the data set; some attribute/value cohorts may support additional clustering, while others may be too shallow.
A cluster can contain a single cohort if its attribute is “out of cohorts” and unable to build additional groupings. This avoids the need to process comparisons between unclustered cohorts and cohort clusters falling under the same attribute.

The output of Pass 3 will be a set of relational data files that capture:

Linked to the Cohort:

A set of granular numerical comparisons (match scores) that compare the cohort to all other cohorts associated with the attribute
A Cohort Cluster ID (if needed) that connects the cohort and its similar peers “one level up” to each other
An array of the entities that belong to the cohort for that attribute/value pair

Linked to the Cohort Cluster:

A set of granular numerical comparisons (match scores) that compare the cohort cluster to all other cohort clusters associated with the attribute
An array of the IDs of the cohorts associated with the cluster
An array of the entities that belong to the cohorts in the cluster
A Unity score that indicates the diversity in the cohorts included in the cluster
A Cohort Cluster ID (if needed) that connects the cohort cluster and its similar peers “one level up” to each other. The algorithm can in essence operate “blind” to the depth of the clustering present in the data set — only delving a layer deeper if the Unity score prompts it (based on the preference of the Collector)

Important to note here is that the Appraiser builds a relational data file that makes it easy to move hierarchically through the relationship tree. Once it has crunched the data set, there is no need to perform additional calculations to move from Cohort to Cohort Cluster, to Entity, to Attribute, etc.

Now that we’ve explored in detail the functionality of the Appraiser, we can return to the Collector and take a look at how it actually executes a request.

Step 1: Define the parameters for the query

The Collector cannot begin its search until it knows where to start. Since the algorithm depends on a previously scored data, that “where” is a uniquely identified node in the data set. The node could be:

An attribute/value pair drawn from the data set.
A specific entity drawn from the data set
An attribute/value pair that originates externally (a “blind node”)

Once the node has been selected, we then need to get a bit more information about the strategy the Collector will be using to determine its matches:

The acceptable unity/diversity score that qualifies a match. This will determine whether the Collector accepts results at the Cohort Cluster level or digs deeper to identify increasingly specific cohorts.
The Curiosity setting being used for the query

The curiosity setting, in essence, allows the Collector to define a new node at the end of a query, triggering a second query the results of which will be appended to the first.

With Curiosity set to 0, the Collector will explore only the initial node.
With Curiosity set to N, the Collector will query a total of N+1 nodes (the initial plus N subsequent passes).

Now, let’s take another look at the three strategies I outlined at the start of the piece, given the functionality we’ve covered since then (again bearing in mind that these are simplified points on a continuum of behavior):

Strategy 1 (The “Hoarder”) for identifying broad matches

Strategy 1 (Hoarder) selects a large pool of broadly associated matches. It might be achieved by directing the Collector to:

Set its target node to an attribute/value pair
Set the acceptable unity/diversity score to a low value (e.g. 50%)
Set the curiosity value to a number greater than 0

Strategy 2 (The “Collector”) for identifying closely associated matches.

Strategy 2 (Collector) selects a medium-sized pool of more closely associated matches. It might be achieved by directing the Collector to:

Set its target node(s) to multiple attribute/value pairs
Set the acceptable unity/diversity score to a modest value (e.g. 75%)
Set the curiosity value to 0 or 1

Strategy 3 (The “Curator”) for identifying highly specific matches

Strategy 3 (Curator) selects a small pool of closely related matches. It might be achieved by directing the Collector to:

Set its target node(s) to multiple attribute/value pairs
Additionally, specify an entity from the data set to serve as a seed
Set the acceptable unity/diversity score to a high value (e.g. 90%)
Set the curiosity value to 0

Step 2: Execute the query

At this point, much of the “work” has already been done. The Collector’s only real task is to “walk” through the database, tracing relationships already established by the Appraiser, until it arrives at a result that it finds acceptable; then, it is done.

The Collector algorithm as it moves “up and out” through the data set towards its final match pool.

Let’s step through an example to get a better feel for how this would work. For this example, the Collector will be executing the following strategy:

The target node is set to a single attribute/value pair from the data set (Attribute B, drawn from Entity A)
The acceptable unity/diversity score is set to 80%
The curiosity value is set to 0

First, the Collector needs to orient itself:

Set initial position to Entity A
Referencing Attribute B, determine its value is associated with Cohort C
Move to Cohort C

Second, the Collector assess the scoring of the cohort:

Compare Unity score for Cohort C to acceptable value (Cohort C has 95% unity, vs. 80% acceptable. Because we are over the threshold, the Collector is free to move up a level)
Referencing Cohort C, determine that it belongs to Cohort Cluster D
Compare Unity score for Cohort Cluster D to acceptable value (Cohort Cluster D has 85% unity, vs. 80% acceptable. Multiply the unities of each level to arrive at calculated unity for the pool. In this case, 95% and 85% produce 80.75%. Because we are over the threshold, the Collector is free to move up a level)
Referencing Cohort Cluster D, determine that it belongs to Cohort Cluster E
Compare Unity score for Cohort Cluster E to acceptable value (Cohort Cluster D has 70% unity (56.53% when multiplied out), 80% acceptable. Because we are under the threshold, the Collector will not move up a level)
Fall back to Cohort Cluster D

Finally, the Collector selects its results:

Referencing the entity ID array, pull all entities that belong to any of the cohorts belonging to Cohort Cluster D
Write the results to a relational data file that contains:

A unique ID for the query
The IDs for the Attribute, Entity, and Cohort used as reference for the query
The IDs for the Entities in the results pool
The Unity score and Curiosity setting used for the query
The calculated unity score for the entities included in the results pool
A date/time stamp of when the query was executed
An optional field to record the ID of a linked query triggered by Curiosity

3. Serve the results file back to the process or user

But what about Curiosity? How does the Collector identify its selection of a secondary node after it has chosen its initial results?

Curiosity, to put it lightly, is complicated. Because the intent of the setting is to allow the Collector to identify initially-out-of-scope matches that are still meaningfully connected to the original query (and because “meaningfully connected” is ambiguous and only means something in context of the scenario), there is an opportunity for machine learning to drive the recommendation of additional nodes.

However, let’s step through one possible tactics, just to get a feel for what Curiosity might allow the Collector to do:

Select a new node based on a likely Key Attribute

Reference the data layout file to determine the most likely (or, in the event that a key attribute was already selected for, next likely) candidate for Key Attribute in the data file
Identify the Cohort for that Key Attribute that is best represented in the set of original matches
Move the algorithm to that Cohort, and perform the same set of calculations as with the initial query (compare cohort unity, ascend cohort levels until the unity falls below threshold, fall back)
Write the results to a relational data file, appending the unique ID of the new query back onto the link field in the prior query
Repeat (if necessary)

Key to the above is the link field that connects queries to their originating (or originated) queries. This allows the Collector to determine which entities in the set of results were primary matches, which were secondary, etc. Speaking to the usefulness of machine learning (mentioned above) in refining the Curiosity portion of the algorithm, the performance of these secondary/linked queries could be fed back into the data structure as a “usefulness” score, reinforcing the strategy and encouraging its future use.

Now, about that concise review I promised…

Strip away the database definitions and field types. Blur the step-by-step movements through the algorithm as it produces the data set. Cut the detailed settings for strategies and the wide ranges of behavior. Ignore the under-explored possibilities for fringe data, and the redundancies inherent in a rambling first-draft written early in the morning, at night, occasionally during lunch, and mumbled under my breath during my morning and evening commutes.

Let’s just take a high-level look at what this search technique does. I’ll include the visual aids from before, for reference.

The Appraiser:

It cleans and catalogs the data set. That’s important because even when we try our hardest to avoid it, data is often messy and inconsistent.
It determines what the available values for each attribute are. How many, how many of each, what are they like?
It groups all the entities that share a value together, and it links them so we can get to them quickly.
Then, it groups the groups. This is done to speed things up, and also to make the data more useful. It can also group groups of groups, and so on.
Finally, it saves all of its work in a format that makes it easy to “retrace its steps” (because that’s just what the Collector does).

The Collector:

It picks something that it would like to find more of. This could be an entity, a specific value from an entity, several values, etc.
It lets you define how specific or generic you’d like the matches it gathers to be.
Starting at the thing you picked, it gathers everything that’s close enough to it, using the groups the Appraiser has already created.
It broadens its search until the results are too generic, then stops searching.
It saves the result to a file, along with information about what it did, so that later on the process can be tightened up.

Even as I was writing this, I kept finding myself getting stuck on considering this a “search” algorithm. That’s not inaccurate, but in the context of my own day-to-day (searches for products and terms executed by customers in an e-commerce environment), it feels like that’s too limiting. Also, it masks just how interesting the Reverse Collector/Appraiser could be.

For a moment, let’s think big — as big as we can. What conditions or assumptions can we remove, that might limit how we consider this algorithm?

I’d like to propose the following “rules” for conceptualizing search queries for the purposes of this piece (none of what I’m about to write is new — but I think it’s important to cover all the same):

Search does not always mean something typed into a box by a human using an interface.

Search can be driven automatically by an application to display relevant information (even if the user is unaware that a search has been executed).
Search can be called by a process within an application as part of a larger piece of functionality
Search can be executed silently, recursively, and repetitively — perhaps simply to build a store of data for later use.

Search results are not always shown to a human in a manner that presents them as literal search results.

Search results can be served to another algorithm in an application, acting as data for additional processes to act on.
Even when serves to a human, search results can be further filtered before presentation, even to the extent that only a single item is presented (perhaps in a context where its origin as search-driven data is completely obscured).

Just because it’s common for searchable data to be stored tabularly, does not mean that all data to be searched originates in tables (or even ever lives in a table — though it’s analysis may).

Images, chunks of text, and audio files are all searchable entities that don’t exist as strict tables of attributes containing values — but again, they can be parsed into something more digestible and before being analyzed for search.
This, of course, happens every day. Without it, none of the advances in machine learning as applied to difficult, abstract data types would be possible.

Ultimately, rather than referring to Reverse Collector/Appraiser as a “search” algorithm, I would rather call this a discovery algorithm. Sure, it can search. It probably would spend a lot of its time searching. But the creative applications of it are even more interesting.

So, what can we do with an algorithm like the Reverse Collector/Appraiser?

I don’t want this piece to go on forever, so I’m going to settle on three examples that are a departure from the abstract view of the algorithm that I’ve been framing so far:

Parse subject lines, headlines, and other marketing copy into discrete, relevant values (e.g. number of words, number of characters, number of syllables, parts of speech used, character/symbol set, presence/absence of punctuation, language(s) used, etc.)

Identify other pieces of copy from the data set that are similar based on the target values identified, for further use or study. For example, testing response rates using a continuous A/B program.
Identify individual elements in a piece of copy that we would like to identify alternatives for (e.g. replace a word, change punctuation, etc.), calculate acceptable matches, and serve up alternative copy blocks for review.
Thinking recursively, you could build a machine-aided piece of copy from the bottom-up, building a model for words, then phrases, then clauses, sentences, paragraphs, sections, etc. — abstracting the previous level each time you move up

Perform a visual analysis of an image to identify basic attributes (color space, dimensions, resolution, pixel count, gamut, relative contrast, file size), extended/calculated attributes (e.g. presence of JPG grid distortion, identification of dominant color(s), overall temperature/hue of image), scenario specific elements (e.g. optical character recognition, fingerprint features), or machine-learning-assisted elements (e.g. identification of people in the photo, presence of smile or closed eyes, object identification, logo or brand identification). Use a subset of these attributes to:

Identify when an image is low-resolution (number of pixels, physical dimension, context of usage)
Identify when an image is likely to be poor quality (presence of JPG grid artifacts, small file size relative to the average for images of a similar dimension, color space, file type, etc.)
Identify products in a photo for tagging, to enable e-commerce integration, smarter images search, etc.
Recommend an alternative photo that is like the original but different in desired ways (warmer color palette, fewer people, larger file, higher contrast, different color space).

Analyze a recipe to determine its ingredients, flavor profile, cost, etc. Note that a recipe could be instructional, the ingredients list or bill of materials for a food product, etc.

Target one or more ingredients in the recipe, along with the attributes that are key to their character, and generate likely matches for other ingredients that could generate interest, regional flair, uniqueness, cost savings, caloric reduction, or any number of other outcomes.
Identify a selection of recipes that make use of a high number of ingredients similar to (or identical to) the seed recipe. Use these to build a model of common pairings for key ingredients in the recipe. From there, identify other recipes that possess these commonly paired ingredients but not the key ingredient(s) in the original recipe. Generate variants of the resulting recipes using the “missing” key ingredients.

And those are just a few examples. If it can be broken down into defining characteristics, and if those characteristics can be compared with the characteristics of other entities from the same data set, I think there’s potential for the Reverse Collector/Appraiser to define and select meaningful matches based on it.

The applications of the model are limited only by our ability to conceptualize the data set (which machine learning can assist with), the completeness or availability of the data in the set (again, which machine learning can help us build out), and our ability to integrate the strategies and outcomes into our computing tasks (I think we are collectively game for that).

The data is out there. Let’s use it.

Thoughts on Building Relational Data Sets: The Reverse Collector/Appraiser Algorithm

Written by Ian Marquis