What if data management worked like a democracy? In a world drowning in information, not all data deserves eternal storage. This is the story of Selective Analytics - a democratic approach where data “campaigns” for permanent storage, and only the most valuable information gets “elected” to stay long-term.
The Challenge: 500GB Daily, 20GB Storage Limit
Picture this scenario: your system generates 500GB of web traffic logs daily. Millions of requests flowing through your infrastructure every minute. Storage costs are exploding. Query performance is degrading over time. But you still need real-time analytics and can only afford 20GB of total storage.
Sounds impossible? Selective Analytics treats data like citizens in a democracy, where information must earn its right to permanent storage through consistent value and activity.
The Democratic Data Solution
The core idea behind Selective Analytics is beautifully simple: treat data management like a democracy. Not all data is created equal, so why store it all equally?
Instead of keeping everything forever or arbitrarily deleting based on age, we create a system where:
- Data “campaigns” for long-term storage through activity and relevance
- Regular “elections” determine which information deserves permanent seats
- Storage limits are enforced democratically - when capacity is reached, the least active data loses its place
- The system automatically adapts to changing patterns and priorities
Foundation: The Hypertable
First, we need a foundation that can handle massive data ingestion. TimescaleDB’s hypertables provide automatic partitioning by time:
-- Simple table structure for web logs
CREATE TABLE "logs" (
"time" timestamp with time zone NOT NULL,
"domain" text NOT NULL
);
-- Convert to hypertable with 1-minute chunks
SELECT create_hypertable('logs', 'time',
chunk_time_interval => INTERVAL '1 minute',
if_not_exists => TRUE
);
-- Automatic retention: only keep raw logs for 5 minutes
SELECT add_retention_policy('logs', INTERVAL '5 minutes',
schedule_interval => INTERVAL '1 minute');
This gives us a table that can ingest millions of rows per minute while automatically cleaning up old data. The raw logs exist just long enough to be processed into our aggregation pipeline.
The Aggregation Pipeline
The magic happens with the aggregation system:
Real-time Aggregation (1-minute buckets)
CREATE MATERIALIZED VIEW website_stats_1m
WITH (timescaledb.continuous, timescaledb.materialized_only=true) AS
SELECT
time_bucket('1 minute', time) as bucket,
domain,
count(*) as total
FROM logs
GROUP BY 1,2
WITH NO DATA;
Hourly Summaries with Statistics
The hierarchical aggregation allows to use same timescaledb.continuous
strategy but instead of getting the raw data, you can get minute processed data. In this case, we’ll build hourly summaries, expanding the details with the stats_agg
for statistics like slope, average or standard deviation, and the percentile_agg
give access to disc distribution using approx_percentile(0.5, percentile_agg) as median or
, other types of statistics.
CREATE MATERIALIZED VIEW website_stats_1h
WITH (timescaledb.continuous, timescaledb.materialized_only=false) AS
SELECT
time_bucket('1 hour', bucket) as bucket,
domain,
sum(total) as total,
stats_agg(total) as stats_agg,
percentile_agg(total) as percentile_agg
FROM website_stats_1m
GROUP BY 1,2
WITH NO DATA;
Percentiles are available on native postgresql but Timescale brought optimized methods for time series data through the toolkit hyperfunctions. Check this note:
Think about the types of percentiles you’re most interested in. tdigest is optimized for more accurate estimates at the extremes, and less accurate estimates near the median. If your workflow involves estimating ninety-ninth percentiles, then choose tdigest. If you’re more concerned about getting highly accurate median estimates, choose uddsketch.
See more on the official Tiger Data Docs.
Daily Analytics
CREATE MATERIALIZED VIEW website_stats_1d
WITH (timescaledb.continuous, timescaledb.materialized_only=false) AS
SELECT
time_bucket('1 day', bucket) as bucket,
domain,
sum(total) as total,
rollup(stats_agg) as stats_agg,
rollup(percentile_agg) as percentile_agg
FROM website_stats_1h
GROUP BY 1,2
WITH NO DATA;
The rollup will merge tuples of statistics, allowing you to get the true average of the average. This advanced components overcome the plain issues we’d get if we do avg(avg_column)
bringing a distorted result.
Continuous Aggregation Policies
The system runs automatically with these policies:
-- Refresh 1-minute aggregates every 30 seconds
SELECT add_continuous_aggregate_policy('website_stats_1m',
start_offset => INTERVAL '5 minutes',
end_offset => INTERVAL '30 seconds',
schedule_interval => INTERVAL '30 seconds');
-- Refresh hourly aggregates every 5 minutes
SELECT add_continuous_aggregate_policy('website_stats_1h',
start_offset => INTERVAL '3 hours',
end_offset => INTERVAL '5 minutes',
schedule_interval => INTERVAL '5 minutes');
-- Refresh daily aggregates every hour
SELECT add_continuous_aggregate_policy('website_stats_1d',
start_offset => INTERVAL '7 days',
end_offset => INTERVAL '1 hour',
schedule_interval => INTERVAL '1 hour');
The Election System
Here’s how the democratic “election” process works:
- Raw logs arrive (500GB daily) and get processed immediately
- Candidates emerge - websites that meet minimum traffic thresholds
- Elections happen regularly - the most active websites get “voted” into long-term storage
- Winners get permanent seats while others are cycled out
- Storage stays constant at around 20GB regardless of input volume
The Candidate Election Background Workers
The selective analytics system runs on two coordinated background jobs that automate the entire election process:
Job 1: Candidate Population Worker
This job runs every 6 minutes and identifies potential candidates:
CREATE OR REPLACE PROCEDURE populate_website_candidates_job(job_id INTEGER, config JSONB)
LANGUAGE plpgsql AS $$
DECLARE
min_hits INTEGER;
v_election_window INTERVAL;
candidate_count INTEGER;
BEGIN
-- Get configuration from our config table
SELECT min_hits_threshold, election_window
INTO min_hits, v_election_window
FROM top_websites_config
LIMIT 1;
-- Find domains that qualify as candidates
INSERT INTO top_websites_candidates (domain, time, hits)
SELECT
domain,
MAX(bucket) AS time,
SUM(total) AS hits
FROM website_stats_1m
WHERE bucket >= NOW() - INTERVAL '30 minutes' -- Look at recent activity
GROUP BY domain
HAVING SUM(total) >= min_hits -- Must meet minimum threshold (default: 100 hits)
ON CONFLICT (domain, time) DO UPDATE SET hits = EXCLUDED.hits;
GET DIAGNOSTICS candidate_count = ROW_COUNT;
RAISE NOTICE 'Populated % website candidates.', candidate_count;
END;
$$;
This is incredibly performant because we’re making election decisions based on the 1-minute aggregations. After candidates are identified, the raw data can be safely discarded.
Job 2: Election Worker
This job runs every hour and promotes the best candidates to long-term storage:
CREATE OR REPLACE PROCEDURE elect_top_websites(job_id INTEGER, config JSONB)
LANGUAGE plpgsql AS $$
DECLARE
max_entries INTEGER;
retention_period INTERVAL;
BEGIN
-- Get configuration
SELECT max_longterm_entries, longterm_retention_period
INTO max_entries, retention_period
FROM top_websites_config
LIMIT 1;
-- The actual "election" - promote candidates to long-term storage
INSERT INTO top_websites_longterm (
domain, first_seen, last_seen, total_hits, last_election_hits, times_elected
)
SELECT
domain,
time AS first_seen,
time AS last_seen,
hits AS total_hits,
hits AS last_election_hits,
1 AS times_elected
FROM top_websites_candidates tc
WHERE tc.time = (SELECT MAX(time) FROM top_websites_candidates WHERE domain = tc.domain)
ON CONFLICT (domain) DO UPDATE SET
last_seen = EXCLUDED.last_seen,
total_hits = top_websites_longterm.total_hits + EXCLUDED.total_hits,
last_election_hits = EXCLUDED.last_election_hits,
times_elected = top_websites_longterm.times_elected + 1;
-- Enforce storage limits - remove oldest entries if we exceed max_entries
DELETE FROM top_websites_longterm
WHERE domain IN (
SELECT domain FROM top_websites_longterm
ORDER BY last_seen ASC
OFFSET max_entries
);
-- Clean up old candidates
DELETE FROM top_websites_candidates
WHERE time < now() - retention_period;
RAISE NOTICE 'Election complete. % websites in long-term storage.',
(SELECT COUNT(*) FROM top_websites_longterm);
END;
$$;
Scheduling the Democratic Process
The beauty is in the automated scheduling:
CREATE OR REPLACE PROCEDURE schedule_candidates_election()
LANGUAGE plpgsql AS $$
DECLARE
config_record RECORD;
BEGIN
-- Get our election configuration
SELECT * INTO config_record FROM top_websites_config LIMIT 1;
-- Schedule candidate population (every 6 minutes)
PERFORM add_job('populate_website_candidates_job',
(config_record.election_schedule_interval / 10)::interval,
initial_start => NOW() + INTERVAL '10 seconds'
);
-- Schedule elections (every hour)
PERFORM add_job('elect_top_websites',
config_record.election_schedule_interval,
initial_start => NOW() + INTERVAL '1 minute'
);
END;
$$;
The Democratic Configuration
Everything is configurable through a simple config table:
CREATE TABLE top_websites_config (
min_hits_threshold integer DEFAULT 100, -- Votes needed to become candidate
election_window interval DEFAULT '1 hour', -- How far back to look for votes
max_longterm_entries integer DEFAULT 1000, -- Maximum "senate seats"
election_schedule_interval interval DEFAULT '1 hour' -- How often elections run
);
This creates a completely autonomous system where:
- Candidates are nominated every 6 minutes based on recent activity
- Elections happen every hour to promote the most active domains
- Term limits exist - inactive domains eventually lose their seats
- Storage is capped - only the top 1000 domains get permanent storage
The Three-Tier Strategy in Action
Level 1: Raw Logs (5 minutes) - Everything comes in, gets processed for immediate analytics, then gets discarded. Think of it as election day counting.
Level 2: Aggregated Data (7-90 days) - Useful statistics and trends, kept for medium-term analysis. These are our “campaign records.”
Level 3: Elected Champions (permanent) - Only the top performers make it here. These domains proved their worth through consistent traffic and earned their permanent seat.
Generating Realistic Test Data
One of the most powerful aspects of this system is the ability to test it with realistic data using TimescaleDB’s background job scheduler:
-- Procedure to generate synthetic log data every second
CREATE OR REPLACE PROCEDURE generate_log_data_job(job_id_param int, config_param jsonb)
LANGUAGE plpgsql AS $$
DECLARE
amount INTEGER;
domain TEXT;
use_specific_domain BOOLEAN;
BEGIN
-- Get configuration from JSONB
amount := (config_param->>'amount')::INTEGER;
domain := config_param->>'domain';
use_specific_domain := COALESCE((config_param->>'use_specific_domain')::BOOLEAN, false);
-- Generate logs based on configuration
IF use_specific_domain THEN
-- Generate logs for specific domain (like google.com)
INSERT INTO logs (time, domain)
SELECT
now() - (random() * interval '1 minute'),
domain
FROM generate_series(1, amount);
RAISE NOTICE 'Generated % log entries for % at %', amount, domain, now();
ELSE
-- Generate logs for random domains
INSERT INTO logs (time, domain)
SELECT
now() - (random() * interval '1 minute'),
'random-domain-' || floor(random() * 1000)::text
FROM generate_series(1, amount);
RAISE NOTICE 'Generated % random log entries at %', amount, now();
END IF;
END;
$$;
You can simulate realistic traffic patterns:
-- Schedule data generation for top websites (every 5 seconds)
CREATE OR REPLACE PROCEDURE schedule_top_websites_generation(avg_hits_per_minute int DEFAULT 30)
LANGUAGE plpgsql AS $$
DECLARE
top_websites text[] := ARRAY['google.com', 'facebook.com', 'youtube.com', 'twitter.com', 'instagram.com'];
website text;
amount int;
BEGIN
FOREACH website IN ARRAY top_websites LOOP
amount := avg_hits_per_minute + (random() * avg_hits_per_minute * 0.2)::integer;
-- Add a background job for each website
PERFORM add_job(
'generate_log_data_job',
INTERVAL '5 seconds',
config => jsonb_build_object(
'use_specific_domain', true,
'domain', website,
'amount', amount
)
);
END LOOP;
RAISE NOTICE 'Scheduled generation for % top websites', array_length(top_websites, 1);
END;
$$;
Want to simulate 500GB daily? Just call:
-- This will generate approximately 1 million hits per minute!
CALL schedule_additional_volume_generation(1000000);
The Remarkable Results
The numbers speak for themselves:
- 99.6% storage reduction (500GB → 20GB)
- Real-time analytics with <5ms query latency
- Automatic identification of top-performing websites
- Scalable regardless of input volume
- Cost-controlled storage that doesn’t grow exponentially
Beyond Web Analytics
The beauty of Selective Analytics isn’t limited to web traffic. This democratic approach can be applied to any high-volume data scenario:
IoT Sensor Data
- Millions of sensor readings per hour
- Only anomalies and trend changes get permanent storage
- Normal readings are aggregated and eventually discarded
Financial Transactions
- High-frequency trading data
- Significant price movements earn long-term storage
- Routine transactions are summarized and archived
Application Logs
- Massive log volumes from distributed systems
- Error patterns and performance anomalies get preserved
- Routine debug information is temporary
The Technical Implementation
For those interested in the complete implementation, the system leverages:
- TimescaleDB hypertables for automatic partitioning and scalability
- Continuous aggregates for real-time summary generation
- Retention policies for automatic cleanup of raw data
- Background jobs for the democratic election process
- Materialized results for lightning-fast analytics queries
The elegance lies in the automation - once configured, the system self-manages, automatically identifying what deserves long-term storage while maintaining consistent performance and predictable costs.
Philosophical Implications
What excites me most about Selective Analytics isn’t just the technical solution, but the paradigm it represents. In our data-driven world, we need democratic approaches to decide what’s worth preserving.
This principle challenges us to think differently about data:
- Not everything deserves eternal storage
- Value should be demonstrated, not assumed
- Systems should adapt to changing patterns
- Storage decisions should be data-driven, not arbitrary
Open Source Implementation
True to the spirit of democratic systems, the complete implementation is open source and available in the timescaledb-templates repository under the webtop
folder.
You can adapt these concepts to your own data challenges, whether you’re dealing with:
- Web analytics at scale
- IoT sensor networks
- Financial time series
- Application monitoring
- Social media analytics
Getting Started
Ready to implement Selective Analytics in your system? Here’s how to begin:
- Assess your data volumes - How much data are you generating vs. storing?
- Identify value patterns - What makes some data more valuable than others?
- Design your tiers - What aggregation levels make sense for your use case?
- Configure the elections - How often should the system reevaluate what to keep?
- Test with realistic data - Use background jobs to simulate your actual traffic patterns
The next election is always around the corner - make sure your data gets the vote it deserves! 🗳️
Facing your own data volume challenges? Want to implement selective analytics in your system? I’d love to hear about your use case! Connect with me on LinkedIn or check out my other articles on data engineering and analytics.