ClickHouse is a fast open-source OLAP database management system. It is column-oriented and allows generating reports (analytical) using SQL queries in real-time.
Before we discuss more ClickHouse, why must columnar Database? You check more the reason in this article .
Download the Datasetβ
Run the command:
1
2
wget -c \
http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-complete.csv
Download will take about 2 minutes with good internet connection.
Running Scenario with MySQL
Before importing and playing with ClickHouse, i want to show alternative way using MySQL, it’s different but worthed.
Create the Tableβ
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
CREATE TABLE ` uk_price_paid ` (
` price ` int DEFAULT '0' ,
` date ` date DEFAULT NULL ,
` postcode ` varchar ( 100 ) DEFAULT NULL ,
` type ` varchar ( 10 ) DEFAULT NULL ,
` is_new ` varchar ( 10 ) DEFAULT NULL ,
` duration ` varchar ( 10 ) DEFAULT NULL ,
` addr1 ` varchar ( 255 ) DEFAULT NULL ,
` addr2 ` varchar ( 255 ) DEFAULT NULL ,
` street ` varchar ( 255 ) DEFAULT NULL ,
` locality ` varchar ( 255 ) DEFAULT NULL ,
` town ` varchar ( 255 ) DEFAULT NULL ,
` district ` varchar ( 255 ) DEFAULT NULL ,
` county ` varchar ( 255 ) DEFAULT NULL ,
` category ` varchar ( 10 ) DEFAULT NULL ,
` id ` int NOT NULL AUTO_INCREMENT ,
PRIMARY KEY ( ` id ` )
) ENGINE = InnoDB DEFAULT CHARSET = utf8mb4 COLLATE = utf8mb4_0900_ai_ci ;
Import the Data
Configure MySQL cnf file with additional configs
1
secure-file-priv = "$HOME/user_files"
Replace $HOME
with your home path, create user_files directory when not exist.
This is for prevent error query for load file
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '--secure-file-priv' at line 1
Run this query:
1
2
3
4
LOAD DATA LOCAL INFILE 'pp-complete.csv'
INTO TABLE demo . uk_price_paid FIELDS TERMINATED BY ',' ENCLOSED BY '"'
LINES TERMINATED BY '\n'
( @ dummy , price , date , postcode , type , is_new , duration , addr1 , addr2 , street , locality , town , district , county , category , @ dummy );
@dummy vars used for ignoring column from unneeded csv
Check the data:
1
2
3
4
5
6
7
mysql > SELECT count ( * ) FROM uk_price_paid ;
+ ----------+
| count ( * ) |
+ ----------+
| 27359803 |
+ ----------+
1 row in set ( 11 . 43 sec )
Yes, using common RDBMS for storing large data set and counting directly to it is not really recommended, if you still decide using RDMS please use another table for counting only
Trying ClickHouse
After installing it on my desktop, start connecting to the database.
You can connect to ClickHouse using the built-in UI or the ClickHouse client.
Use the Built-in UIβ
The ClickHouse server listens for HTTP clients on port 8123 by default. There is a built-in UI for running SQL queries at http://127.0.0.1:8123/play
(change the hostname accordingly).
Play UI
Notice in your Play UI that the username was populated with default and the password text field was left empty. If you assigned a password to the default user, enter it into the password field.
Try running a query. For example, the following returns the names of the predefined databases:
Click the RUN button and the response is displayed in the lower portion of the Play UI:
Show Databases result
Create the Database
Use the CREATE DATABASE
command to create a new database in ClickHouse
1
CREATE DATABASE IF NOT EXISTS demo
Create the Tableβ
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
CREATE TABLE demo . uk_price_paid
(
price UInt32 ,
date Date ,
postcode LowCardinality ( String ),
type Enum8 ( 'terraced' = 1 , 'semi-detached' = 2 , 'detached' = 3 , 'flat' = 4 , 'other' = 0 ),
is_new UInt8 ,
duration Enum8 ( 'freehold' = 1 , 'leasehold' = 2 , 'unknown' = 0 ),
addr1 String ,
addr2 String ,
street LowCardinality ( String ),
locality LowCardinality ( String ),
town LowCardinality ( String ),
district LowCardinality ( String ),
county LowCardinality ( String ),
category UInt8
) ENGINE = MergeTree ORDER BY ( postcode , addr1 , addr2 );
Preprocess and Import Dataβ
We will use Play UI tool for data preprocessing.
In this example, we define the structure of source data from the CSV file and specify a query to preprocess the data with Play UI.
The preprocessing is:
splitting the postcode to two different columns postcode1 and postcode2 that is better for storage and queries;
coverting the time field to date as it only contains 00:00 time;
ignoring the UUid field because we don’t need it for analysis;
transforming type and duration to more readable Enum fields with function transform;
transforming is_new and category fields from single-character string (Y/N and A/B) to UInt8 field with 0 and 1.
Copy csv file to $HOME/user_files
folders, run this sql to Play UI
1
2
3
4
5
6
7
8
9
10
11
SELECT
c2 ,
toDate ( toDateTime ( CONCAT ( c3 , ':00' ))),, c4 ,
transform ( c5 , [ 'T' , 'S' , 'D' , 'F' , 'O' ], [ 'terraced' , 'semi-detached' , 'detached' , 'flat' , 'other' ]) AS type ,
c6 = 'Y' AS is_new ,
transform ( c7 , [ 'F' , 'L' , 'U' ], [ 'freehold' , 'leasehold' , 'unknown' ]) AS duration ,
c8 , c9 , c10 , c11 , c12 , c13 , c14 ,
c15 = 'B' AS category
FROM file (
'pp-complete.csv'
)
Result Query
Wow, we can also see huge of csv file content direct via ClickHouse.
Run this sql to Play UI
1
2
3
4
5
6
7
8
9
10
11
12
13
INSERT INTO demo . uk_price_paid
SELECT
c2 ,
toDate ( toDateTime ( CONCAT ( c3 , ':00' ))),
c4 ,
transform ( c5 , [ 'T' , 'S' , 'D' , 'F' , 'O' ], [ 'terraced' , 'semi-detached' , 'detached' , 'flat' , 'other' ]) AS type ,
c6 = 'Y' AS is_new ,
transform ( c7 , [ 'F' , 'L' , 'U' ], [ 'freehold' , 'leasehold' , 'unknown' ]) AS duration ,
c8 , c9 , c10 , c11 , c12 , c13 , c14 ,
c15 = 'B' AS category
FROM file (
'pp-complete.csv'
)
It will take about 40 seconds.
Validate the Dataβ
Query:
1
SELECT count () FROM demo . uk_price_paid ;
Result:
1
2
3
4
βββcount()ββ
β 27359802 β
ββββββββββββ
1 row in set. Elapsed: 0.009 sec.
The size of dataset in ClickHouse is just 317 MiB, check it.
Query:
1
SELECT formatReadableSize ( total_bytes ) FROM system . tables WHERE name = 'uk_price_paid' ;
Result:
1
2
3
4
ββformatReadableSize(total_bytes)ββ
β 317.41 MiB β
βββββββββββββββββββββββββββββββββββ
1 row in set. Elapsed: 0.003 sec.
Run Some Queriesβ
Query Average Price Per Yearβ
Query:
1
SELECT toYear ( date ) AS year , round ( avg ( price )) AS price , bar ( price , 0 , 1000000 , 80 ) FROM demo . uk_price_paid GROUP BY year ORDER BY year ;
Result:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
ββ year ββ¬ββ price ββ¬β bar ( round ( avg ( price )), 0 , 1000000 , 80 ) ββ
β 1995 β 67933 β ββββββ β
β 1996 β 71508 β ββββββ β
β 1997 β 78536 β βββββββ β
β 1998 β 85440 β βββββββ β
β 1999 β 96038 β ββββββββ β
β 2000 β 107487 β βββββββββ β
β 2001 β 118888 β ββββββββββ β
β 2002 β 137947 β βββββββββββ β
β 2003 β 155893 β βββββββββββββ β
β 2004 β 178888 β βββββββββββββββ β
β 2005 β 189359 β ββββββββββββββββ β
β 2006 β 203532 β βββββββββββββββββ β
β 2007 β 219375 β ββββββββββββββββββ β
β 2008 β 217055 β ββββββββββββββββββ β
β 2009 β 213419 β βββββββββββββββββ β
β 2010 β 236107 β βββββββββββββββββββ β
β 2011 β 232802 β βββββββββββββββββββ β
β 2012 β 238381 β βββββββββββββββββββ β
β 2013 β 256928 β βββββββββββββββββββββ β
β 2014 β 279982 β βββββββββββββββββββββββ β
β 2015 β 297264 β ββββββββββββββββββββββββ β
β 2016 β 313494 β βββββββββββββββββββββββββ β
β 2017 β 346348 β ββββββββββββββββββββββββββββ β
β 2018 β 350547 β ββββββββββββββββββββββββββββ β
β 2019 β 352105 β βββββββββββββββββββββββββββββ β
β 2020 β 375843 β ββββββββββββββββββββββββββββββ β
β 2021 β 380525 β βββββββββββββββββββββββββββββββ β
β 2022 β 354501 β βββββββββββββββββββββββββββββ β
ββββββββ΄βββββββββ΄βββββββββββββββββββββββββββββββββββββββββ
28 rows in set . Elapsed : 0 . 111 sec . Processed 27 . 36 million rows , 164 . 16 MB ( 245 . 43 million rows / s ., 1 . 47 GB / s .)
Summary
Pro:
You can use for analyzing large dataset csv or text without creating table
Efficient storage and (soft) real-time query over the analytics data
Incredibly fast
Can integrate with another Database
Designed for analytical purpose
Realtime
Open Source
SQL
Cons:
Just insert data, no update or deleted records
References