Data Discovery & Search Infrastructure: Part 1

From Keywords to Smart Examples

March 2025

How do we find the right data in a world drowning in datasets?

Imagine you're looking for information about schools in your city. You could Google "Chicago schools," but what if you want something very specific - like schools similar to "Ogden International High School" with "International Baccalaureate programs" and "Level 1 ratings"? This is where Data Discovery & Search Infrastructure makes these targeted searches possible.

The Two-Layer Foundation: Index + Interface

In Spring 2025, Professor Sainyam’s “Data to Decisions” course introduced two core components of any data search system.

Search Interface

← What users see and interact with

Data Index

← Where datasets are stored and organized

Think of it like a library: the Data Index is like the organized shelves of books, while the Search Interface is like the librarian who helps you find exactly what you need.

But here's the challenge - we're not just dealing with a few hundred books, we're dealing with millions of datasets scattered across the internet! Hence real world implementations demand highly scalable architectures.

The Keyword Search Revolution: Enter the Inverted Index

The Problem

Let's say you have these datasets:

D₁ = {cornell, university}
D₂ = {ithaca}
D₃ = {cornell, ithaca}
D₄ = {Ithaca}
D₅ = {university}

If someone searches for "cornell," the naive approach would check each dataset individually. With millions of datasets, this becomes impossibly slow - like reading every single book in the library to find mentions of "cornell."

The Solution: Inverted Index

Instead, we flip the problem on its head and create an inverted index:

{
  cornell: [D₁, D₃], 
  ithaca: [D₂, D₃, D₄],
  university: [D₁, D₅]
}

Now when someone searches for "cornell," we instantly know it appears in datasets D₁ and D₃. Time complexity: O(1) - fast!

Pseudocode for Inverted Index Construction:

InvertedIndex = {}
for each dataset D in AllDatasets:
    for each word w in D:
        if w not in InvertedIndex:
            InvertedIndex[w] = []
        InvertedIndex[w].append(D)

The Ranking Dilemma: Quality vs. Quantity

But wait - what if your search returns 1000 datasets? Let's look at two approaches:

Approach 1: Rank by Coverage

Simply list datasets by how much they cover your search topic. Like getting search results ranked by relevance score.

Problem: Information overload! You get everything but can't process it all.

Approach 2: Diversify by Coverage

This is where the mathematical elegance comes in. We want to maximize coverage while minimizing redundancy.

The Max-Cover Problem:

Input: Datasets D₁, D₂, ..., Dₙ
Goal: Choose k datasets that maximize Cov(S) = |⋃ᵢ∈S Dᵢ|
Translation: Pick k datasets that collectively cover the most unique topics

Simple Example:
If you search for "Obama" and want datasets covering different aspects:

D₁ covers topics: {politics, speeches, healthcare}
D₂ covers topics: {politics, foreign policy}
D₃ covers topics: {healthcare, economics}

Choosing D₁ and D₂ gives coverage = {politics, speeches, healthcare, foreign policy} = 4 topics
Choosing D₁ and D₃ gives coverage = {politics, speeches, healthcare, economics} = 4 topics

Both are equally good mathematically, but provide different perspectives!

Query-by-Example: The FlashFill Revolution

What if instead of typing keywords, you could just show examples of what you want? Query-by-Example reframes search: provide sample records, and the system infers the underlying query pattern from those examples.

The FlashFill Demo Concept

Traditional approach: "Find me datasets about Chicago schools with ratings"

Query-by-Example approach:

School Type	Rating
Ogden International High School	Level 1
Hyde Park High School	Level 2

System Output:

long_name	program_group	overall_rating
Ogden International High School	International Baccalaureate	Level 1
Hyde Park Academy High School	General Education	Level 2
Chicago Academy High School	Honors	Level 1
Chicago High School for the Arts	Fine & Performing Arts	Level 1

When is this powerful?

You have a specific information need
The desired output format can be expressed through examples
You know what you want but can't easily describe it in keywords

The Technical Challenge

The system needs to:

Understand the pattern from your examples
Search the inverted index to find datasets containing similar information
Combine multiple datasets if needed (we'll get to this!)

Pseudocode for Query-by-Example:

function QueryByExample(examples):
    pattern = ExtractPattern(examples)
    candidateDatasets = SearchInvertedIndex(pattern.keywords)
    
    results = []
    for dataset in candidateDatasets:
        if MatchesPattern(dataset, pattern):
            results.append(dataset)
    
    return RankByRelevance(results)

Dataset Combination: When One Dataset Isn't Enough

Here's the caveat: "Relevant information is rarely present in a single table."

Consider this scenario - you want school ratings, but the data is split:

School Details Table:

Id	Name	Code	Program
1	Ogden International High School	OIH123	International Baccalaureate
2	Hyde Park Academy High School	HPA12	General Education

School Performance Table:

Id	Code	Rating
1	OIH123	Level 1
2	GHA124	Level 2

Join Operations

We need to join these tables on the Code field:

Pseudocode for Hash Join:

function HashJoin(Table1, Table2, joinKey):
    hashTable = {}
    
    // Build hash table from first table
    for row in Table1:
        key = row[joinKey]
        hashTable[key] = row
    
    // Probe with second table
    results = []
    for row in Table2:
        key = row[joinKey] 
        if key in hashTable:
            combinedRow = Combine(hashTable[key], row)
            results.append(combinedRow)
    
    return results

Union Operations

Sometimes you need to combine rows from different sources that have similar but not identical structures. This requires matching column headers and merging complementary information.

Optimizing with Bloom Filters

Let's introduce ourselves to a clever optimization technique: Bloom Filters for faster joins.

The Problem

When joining large tables, we waste time processing rows that won't match. Can we quickly check if a key exists without storing the entire dataset in memory?

The Bloom Filter Solution

A Bloom Filter is a space-efficient probabilistic data structure that tells us:

Definitely NOT in the set (100% accurate)
Probably in the set (small chance of false positives)

Construction:

Initialize bit array of length m, all zeros
Choose k hash functions h₁, h₂, ..., hₖ
For each element r in dataset R:
- Set bits h₁(r), h₂(r), ..., hₖ(r) to 1

Lookup:

Return YES if ALL bits h₁(s), h₂(s), ..., hₖ(s) are 1
Return NO otherwise

Example:

Bit Array: [0,0,0,1,0,0,1,0,1,0]
Positions:  0 1 2 3 4 5 6 7 8 9

To check if "Anna" exists:
h₁(Anna) = 6, h₂(Anna) = 8
Both positions 6 and 8 are 1 → "Probably exists"

False Positive Rate: (1 - e^(-k|R|/m))^k

Where:

k = number of hash functions
|R| = number of elements in set
m = length of bit array

Federated Data Marketplaces: The Future

Let's envisions a world where data discovery happens across federated marketplaces - imagine multiple organizations sharing dataset metadata profiles rather than raw data facilitating cross-domain search while preserving confidentiality.

The Arrow Information Paradox

Buyer: "I want to know what the data contains so I can decide how much to pay"
Seller: "I won't share my data unless I get guaranteed payment"

The Solution: Metadata Profiles

Instead of sharing raw data, organizations share dataset profiles:

Dataset Profile:
├── Name = "Census Data 2023"
├── Rows = 3014  
├── Columns = {id, name, age, income, ...}
├── Description = "Population census data..."
├── Synopses = "50% under age 60, 30% income >30K"
└── Privacy Level = "Anonymized"

This enables search without revealing sensitive information!

Key Takeaways

Inverted indexes makes keyword search fast - O(1) lookup time
Coverage-based ranking prevents information overload through smart selection
Query-by-example lets users show what they want instead of describing it
Dataset combination through joins and unions handles real-world data fragmentation
Bloom filters optimize operations with probabilistic efficiency
Federated marketplaces enable privacy-preserving data discovery

The next time you search for data, remember: behind that simple search box lies a sophisticated infrastructure making sense of our data-rich world. As Professor Sainyam often says, the key isn't just finding data, but finding the right data efficiently and intelligently.