ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Add this skill
npx mdskills install sickn33/cc-skill-clickhouse-ioComprehensive ClickHouse reference with excellent patterns, but lacks agent-specific instructions
1---2name: clickhouse-io3description: ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.4author: affaan-m5version: "1.0"6---78# ClickHouse Analytics Patterns910ClickHouse-specific patterns for high-performance analytics and data engineering.1112## Overview1314ClickHouse is a column-oriented database management system (DBMS) for online analytical processing (OLAP). It's optimized for fast analytical queries on large datasets.1516**Key Features:**17- Column-oriented storage18- Data compression19- Parallel query execution20- Distributed queries21- Real-time analytics2223## Table Design Patterns2425### MergeTree Engine (Most Common)2627```sql28CREATE TABLE markets_analytics (29 date Date,30 market_id String,31 market_name String,32 volume UInt64,33 trades UInt32,34 unique_traders UInt32,35 avg_trade_size Float64,36 created_at DateTime37) ENGINE = MergeTree()38PARTITION BY toYYYYMM(date)39ORDER BY (date, market_id)40SETTINGS index_granularity = 8192;41```4243### ReplacingMergeTree (Deduplication)4445```sql46-- For data that may have duplicates (e.g., from multiple sources)47CREATE TABLE user_events (48 event_id String,49 user_id String,50 event_type String,51 timestamp DateTime,52 properties String53) ENGINE = ReplacingMergeTree()54PARTITION BY toYYYYMM(timestamp)55ORDER BY (user_id, event_id, timestamp)56PRIMARY KEY (user_id, event_id);57```5859### AggregatingMergeTree (Pre-aggregation)6061```sql62-- For maintaining aggregated metrics63CREATE TABLE market_stats_hourly (64 hour DateTime,65 market_id String,66 total_volume AggregateFunction(sum, UInt64),67 total_trades AggregateFunction(count, UInt32),68 unique_users AggregateFunction(uniq, String)69) ENGINE = AggregatingMergeTree()70PARTITION BY toYYYYMM(hour)71ORDER BY (hour, market_id);7273-- Query aggregated data74SELECT75 hour,76 market_id,77 sumMerge(total_volume) AS volume,78 countMerge(total_trades) AS trades,79 uniqMerge(unique_users) AS users80FROM market_stats_hourly81WHERE hour >= toStartOfHour(now() - INTERVAL 24 HOUR)82GROUP BY hour, market_id83ORDER BY hour DESC;84```8586## Query Optimization Patterns8788### Efficient Filtering8990```sql91-- ✅ GOOD: Use indexed columns first92SELECT *93FROM markets_analytics94WHERE date >= '2025-01-01'95 AND market_id = 'market-123'96 AND volume > 100097ORDER BY date DESC98LIMIT 100;99100-- ❌ BAD: Filter on non-indexed columns first101SELECT *102FROM markets_analytics103WHERE volume > 1000104 AND market_name LIKE '%election%'105 AND date >= '2025-01-01';106```107108### Aggregations109110```sql111-- ✅ GOOD: Use ClickHouse-specific aggregation functions112SELECT113 toStartOfDay(created_at) AS day,114 market_id,115 sum(volume) AS total_volume,116 count() AS total_trades,117 uniq(trader_id) AS unique_traders,118 avg(trade_size) AS avg_size119FROM trades120WHERE created_at >= today() - INTERVAL 7 DAY121GROUP BY day, market_id122ORDER BY day DESC, total_volume DESC;123124-- ✅ Use quantile for percentiles (more efficient than percentile)125SELECT126 quantile(0.50)(trade_size) AS median,127 quantile(0.95)(trade_size) AS p95,128 quantile(0.99)(trade_size) AS p99129FROM trades130WHERE created_at >= now() - INTERVAL 1 HOUR;131```132133### Window Functions134135```sql136-- Calculate running totals137SELECT138 date,139 market_id,140 volume,141 sum(volume) OVER (142 PARTITION BY market_id143 ORDER BY date144 ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW145 ) AS cumulative_volume146FROM markets_analytics147WHERE date >= today() - INTERVAL 30 DAY148ORDER BY market_id, date;149```150151## Data Insertion Patterns152153### Bulk Insert (Recommended)154155```typescript156import { ClickHouse } from 'clickhouse'157158const clickhouse = new ClickHouse({159 url: process.env.CLICKHOUSE_URL,160 port: 8123,161 basicAuth: {162 username: process.env.CLICKHOUSE_USER,163 password: process.env.CLICKHOUSE_PASSWORD164 }165})166167// ✅ Batch insert (efficient)168async function bulkInsertTrades(trades: Trade[]) {169 const values = trades.map(trade => `(170 '${trade.id}',171 '${trade.market_id}',172 '${trade.user_id}',173 ${trade.amount},174 '${trade.timestamp.toISOString()}'175 )`).join(',')176177 await clickhouse.query(`178 INSERT INTO trades (id, market_id, user_id, amount, timestamp)179 VALUES ${values}180 `).toPromise()181}182183// ❌ Individual inserts (slow)184async function insertTrade(trade: Trade) {185 // Don't do this in a loop!186 await clickhouse.query(`187 INSERT INTO trades VALUES ('${trade.id}', ...)188 `).toPromise()189}190```191192### Streaming Insert193194```typescript195// For continuous data ingestion196import { createWriteStream } from 'fs'197import { pipeline } from 'stream/promises'198199async function streamInserts() {200 const stream = clickhouse.insert('trades').stream()201202 for await (const batch of dataSource) {203 stream.write(batch)204 }205206 await stream.end()207}208```209210## Materialized Views211212### Real-time Aggregations213214```sql215-- Create materialized view for hourly stats216CREATE MATERIALIZED VIEW market_stats_hourly_mv217TO market_stats_hourly218AS SELECT219 toStartOfHour(timestamp) AS hour,220 market_id,221 sumState(amount) AS total_volume,222 countState() AS total_trades,223 uniqState(user_id) AS unique_users224FROM trades225GROUP BY hour, market_id;226227-- Query the materialized view228SELECT229 hour,230 market_id,231 sumMerge(total_volume) AS volume,232 countMerge(total_trades) AS trades,233 uniqMerge(unique_users) AS users234FROM market_stats_hourly235WHERE hour >= now() - INTERVAL 24 HOUR236GROUP BY hour, market_id;237```238239## Performance Monitoring240241### Query Performance242243```sql244-- Check slow queries245SELECT246 query_id,247 user,248 query,249 query_duration_ms,250 read_rows,251 read_bytes,252 memory_usage253FROM system.query_log254WHERE type = 'QueryFinish'255 AND query_duration_ms > 1000256 AND event_time >= now() - INTERVAL 1 HOUR257ORDER BY query_duration_ms DESC258LIMIT 10;259```260261### Table Statistics262263```sql264-- Check table sizes265SELECT266 database,267 table,268 formatReadableSize(sum(bytes)) AS size,269 sum(rows) AS rows,270 max(modification_time) AS latest_modification271FROM system.parts272WHERE active273GROUP BY database, table274ORDER BY sum(bytes) DESC;275```276277## Common Analytics Queries278279### Time Series Analysis280281```sql282-- Daily active users283SELECT284 toDate(timestamp) AS date,285 uniq(user_id) AS daily_active_users286FROM events287WHERE timestamp >= today() - INTERVAL 30 DAY288GROUP BY date289ORDER BY date;290291-- Retention analysis292SELECT293 signup_date,294 countIf(days_since_signup = 0) AS day_0,295 countIf(days_since_signup = 1) AS day_1,296 countIf(days_since_signup = 7) AS day_7,297 countIf(days_since_signup = 30) AS day_30298FROM (299 SELECT300 user_id,301 min(toDate(timestamp)) AS signup_date,302 toDate(timestamp) AS activity_date,303 dateDiff('day', signup_date, activity_date) AS days_since_signup304 FROM events305 GROUP BY user_id, activity_date306)307GROUP BY signup_date308ORDER BY signup_date DESC;309```310311### Funnel Analysis312313```sql314-- Conversion funnel315SELECT316 countIf(step = 'viewed_market') AS viewed,317 countIf(step = 'clicked_trade') AS clicked,318 countIf(step = 'completed_trade') AS completed,319 round(clicked / viewed * 100, 2) AS view_to_click_rate,320 round(completed / clicked * 100, 2) AS click_to_completion_rate321FROM (322 SELECT323 user_id,324 session_id,325 event_type AS step326 FROM events327 WHERE event_date = today()328)329GROUP BY session_id;330```331332### Cohort Analysis333334```sql335-- User cohorts by signup month336SELECT337 toStartOfMonth(signup_date) AS cohort,338 toStartOfMonth(activity_date) AS month,339 dateDiff('month', cohort, month) AS months_since_signup,340 count(DISTINCT user_id) AS active_users341FROM (342 SELECT343 user_id,344 min(toDate(timestamp)) OVER (PARTITION BY user_id) AS signup_date,345 toDate(timestamp) AS activity_date346 FROM events347)348GROUP BY cohort, month, months_since_signup349ORDER BY cohort, months_since_signup;350```351352## Data Pipeline Patterns353354### ETL Pattern355356```typescript357// Extract, Transform, Load358async function etlPipeline() {359 // 1. Extract from source360 const rawData = await extractFromPostgres()361362 // 2. Transform363 const transformed = rawData.map(row => ({364 date: new Date(row.created_at).toISOString().split('T')[0],365 market_id: row.market_slug,366 volume: parseFloat(row.total_volume),367 trades: parseInt(row.trade_count)368 }))369370 // 3. Load to ClickHouse371 await bulkInsertToClickHouse(transformed)372}373374// Run periodically375setInterval(etlPipeline, 60 * 60 * 1000) // Every hour376```377378### Change Data Capture (CDC)379380```typescript381// Listen to PostgreSQL changes and sync to ClickHouse382import { Client } from 'pg'383384const pgClient = new Client({ connectionString: process.env.DATABASE_URL })385386pgClient.query('LISTEN market_updates')387388pgClient.on('notification', async (msg) => {389 const update = JSON.parse(msg.payload)390391 await clickhouse.insert('market_updates', [392 {393 market_id: update.id,394 event_type: update.operation, // INSERT, UPDATE, DELETE395 timestamp: new Date(),396 data: JSON.stringify(update.new_data)397 }398 ])399})400```401402## Best Practices403404### 1. Partitioning Strategy405- Partition by time (usually month or day)406- Avoid too many partitions (performance impact)407- Use DATE type for partition key408409### 2. Ordering Key410- Put most frequently filtered columns first411- Consider cardinality (high cardinality first)412- Order impacts compression413414### 3. Data Types415- Use smallest appropriate type (UInt32 vs UInt64)416- Use LowCardinality for repeated strings417- Use Enum for categorical data418419### 4. Avoid420- SELECT * (specify columns)421- FINAL (merge data before query instead)422- Too many JOINs (denormalize for analytics)423- Small frequent inserts (batch instead)424425### 5. Monitoring426- Track query performance427- Monitor disk usage428- Check merge operations429- Review slow query log430431**Remember**: ClickHouse excels at analytical workloads. Design tables for your query patterns, batch inserts, and leverage materialized views for real-time aggregations.
Full transparency — inspect the skill content before installing.