Data Discovery & Search Infrastructure: Part 1
From Keywords to Smart Examples
How do we find the right data in a world drowning in datasets?
Imagine you're looking for information about schools in your city. You could Google "Chicago schools," but what if you want something very specific - like schools similar to "Ogden International High School" with "International Baccalaureate programs" and "Level 1 ratings"? This is where Data Discovery & Search Infrastructure makes these targeted searches possible.
The Two-Layer Foundation: Index + Interface
In Spring 2025, Professor Sainyam’s “Data to Decisions” course introduced two core components of any data search system.
Think of it like a library: the Data Index is like the organized shelves of books, while the Search Interface is like the librarian who helps you find exactly what you need.
But here's the challenge - we're not just dealing with a few hundred books, we're dealing with millions of datasets scattered across the internet! Hence real world implementations demand highly scalable architectures.
The Keyword Search Revolution: Enter the Inverted Index
The Problem
Let's say you have these datasets:
- D₁ = {cornell, university}
- D₂ = {ithaca}
- D₃ = {cornell, ithaca}
- D₄ = {Ithaca}
- D₅ = {university}
If someone searches for "cornell," the naive approach would check each dataset individually. With millions of datasets, this becomes impossibly slow - like reading every single book in the library to find mentions of "cornell."
The Solution: Inverted Index
Instead, we flip the problem on its head and create an inverted index:
{
cornell: [D₁, D₃],
ithaca: [D₂, D₃, D₄],
university: [D₁, D₅]
}
Now when someone searches for "cornell," we instantly know it appears in datasets D₁ and D₃. Time complexity: O(1) - fast!
Pseudocode for Inverted Index Construction:
InvertedIndex = {}
for each dataset D in AllDatasets:
for each word w in D:
if w not in InvertedIndex:
InvertedIndex[w] = []
InvertedIndex[w].append(D)
The Ranking Dilemma: Quality vs. Quantity
But wait - what if your search returns 1000 datasets? Let's look at two approaches:
Approach 1: Rank by Coverage
Simply list datasets by how much they cover your search topic. Like getting search results ranked by relevance score.
Problem: Information overload! You get everything but can't process it all.
Approach 2: Diversify by Coverage
This is where the mathematical elegance comes in. We want to maximize coverage while minimizing redundancy.
The Max-Cover Problem:
- Input: Datasets D₁, D₂, ..., Dₙ
- Goal: Choose k datasets that maximize Cov(S) = |⋃ᵢ∈S Dᵢ|
- Translation: Pick k datasets that collectively cover the most unique topics
Simple Example:
If you search for "Obama" and want datasets covering different aspects:
- D₁ covers topics: {politics, speeches, healthcare}
- D₂ covers topics: {politics, foreign policy}
- D₃ covers topics: {healthcare, economics}
Choosing D₁ and D₂ gives coverage = {politics, speeches, healthcare, foreign policy} = 4 topics
Choosing D₁ and D₃ gives coverage = {politics, speeches, healthcare, economics} = 4 topics
Both are equally good mathematically, but provide different perspectives!
Query-by-Example: The FlashFill Revolution
What if instead of typing keywords, you could just show examples of what you want? Query-by-Example reframes search: provide sample records, and the system infers the underlying query pattern from those examples.
The FlashFill Demo Concept
Traditional approach: "Find me datasets about Chicago schools with ratings"
Query-by-Example approach:
School Type | Rating |
---|---|
Ogden International High School | Level 1 |
Hyde Park High School | Level 2 |
System Output:
long_name | program_group | overall_rating |
---|---|---|
Ogden International High School | International Baccalaureate | Level 1 |
Hyde Park Academy High School | General Education | Level 2 |
Chicago Academy High School | Honors | Level 1 |
Chicago High School for the Arts | Fine & Performing Arts | Level 1 |
When is this powerful?
- You have a specific information need
- The desired output format can be expressed through examples
- You know what you want but can't easily describe it in keywords
The Technical Challenge
The system needs to:
- Understand the pattern from your examples
- Search the inverted index to find datasets containing similar information
- Combine multiple datasets if needed (we'll get to this!)
Pseudocode for Query-by-Example:
function QueryByExample(examples):
pattern = ExtractPattern(examples)
candidateDatasets = SearchInvertedIndex(pattern.keywords)
results = []
for dataset in candidateDatasets:
if MatchesPattern(dataset, pattern):
results.append(dataset)
return RankByRelevance(results)
Dataset Combination: When One Dataset Isn't Enough
Here's the caveat: "Relevant information is rarely present in a single table."
Consider this scenario - you want school ratings, but the data is split:
School Details Table:
Id | Name | Code | Program |
---|---|---|---|
1 | Ogden International High School | OIH123 | International Baccalaureate |
2 | Hyde Park Academy High School | HPA12 | General Education |
School Performance Table:
Id | Code | Rating |
---|---|---|
1 | OIH123 | Level 1 |
2 | GHA124 | Level 2 |
Join Operations
We need to join these tables on the Code field:
Pseudocode for Hash Join:
function HashJoin(Table1, Table2, joinKey):
hashTable = {}
// Build hash table from first table
for row in Table1:
key = row[joinKey]
hashTable[key] = row
// Probe with second table
results = []
for row in Table2:
key = row[joinKey]
if key in hashTable:
combinedRow = Combine(hashTable[key], row)
results.append(combinedRow)
return results
Union Operations
Sometimes you need to combine rows from different sources that have similar but not identical structures. This requires matching column headers and merging complementary information.
Optimizing with Bloom Filters
Let's introduce ourselves to a clever optimization technique: Bloom Filters for faster joins.
The Problem
When joining large tables, we waste time processing rows that won't match. Can we quickly check if a key exists without storing the entire dataset in memory?
The Bloom Filter Solution
A Bloom Filter is a space-efficient probabilistic data structure that tells us:
- Definitely NOT in the set (100% accurate)
- Probably in the set (small chance of false positives)
Construction:
- Initialize bit array of length m, all zeros
- Choose k hash functions h₁, h₂, ..., hₖ
- For each element r in dataset R:
- Set bits h₁(r), h₂(r), ..., hₖ(r) to 1
Lookup:
- Return YES if ALL bits h₁(s), h₂(s), ..., hₖ(s) are 1
- Return NO otherwise
Example:
Bit Array: [0,0,0,1,0,0,1,0,1,0]
Positions: 0 1 2 3 4 5 6 7 8 9
To check if "Anna" exists:
h₁(Anna) = 6, h₂(Anna) = 8
Both positions 6 and 8 are 1 → "Probably exists"
False Positive Rate: (1 - e^(-k|R|/m))^k
Where:
- k = number of hash functions
- |R| = number of elements in set
- m = length of bit array
Federated Data Marketplaces: The Future
Let's envisions a world where data discovery happens across federated marketplaces - imagine multiple organizations sharing dataset metadata profiles rather than raw data facilitating cross-domain search while preserving confidentiality.
The Arrow Information Paradox
- Buyer: "I want to know what the data contains so I can decide how much to pay"
- Seller: "I won't share my data unless I get guaranteed payment"
The Solution: Metadata Profiles
Instead of sharing raw data, organizations share dataset profiles:
Dataset Profile:
├── Name = "Census Data 2023"
├── Rows = 3014
├── Columns = {id, name, age, income, ...}
├── Description = "Population census data..."
├── Synopses = "50% under age 60, 30% income >30K"
└── Privacy Level = "Anonymized"
This enables search without revealing sensitive information!
Key Takeaways
- Inverted indexes makes keyword search fast - O(1) lookup time
- Coverage-based ranking prevents information overload through smart selection
- Query-by-example lets users show what they want instead of describing it
- Dataset combination through joins and unions handles real-world data fragmentation
- Bloom filters optimize operations with probabilistic efficiency
- Federated marketplaces enable privacy-preserving data discovery
The next time you search for data, remember: behind that simple search box lies a sophisticated infrastructure making sense of our data-rich world. As Professor Sainyam often says, the key isn't just finding data, but finding the right data efficiently and intelligently.