Load sample data into ClickHouse®#
The official ClickHouse® website offers a list of example datasets to get you started. Each dataset has a description on how to download, upload, and transform the data samples as needed.
This article takes a closer look at how to use the
Anonymized Web Analytics Data example dataset. This contains two tables:
hits_v1, with data for every user action
visits_v1, with information about every user session
The steps below show you how to download the dataset, set up a connection with the server, and load the data into your cluster. ClickHouse already offers detailed instructions on setting up this dataset, but these steps add some more details on how to run commands by using a ClickHouse client running in Docker.
Download the dataset#
Download the original dataset directly from the dataset documentation page. You can do this using cURL, where the generic command looks like this:
curl address_to_file_in_format_tsv_xz | unxz --threads=`nproc` > file-name.tsv
nproc Linux command, which prints the number of processing units, is not available on macOS. To use the above command, add an alias for
nproc into your
alias nproc="sysctl -n hw.logicalcpu".
This command allows you to download and extract data from the URLs specified in the ClickHouse documentation.
Once done, you should have two files available:
Set up the service and database#
If you don’t yet have an Aiven for ClickHouse service, follow the steps in our getting started guide to create one.
When you create a service, a default database was already added. However, you can create separate databases specific to your use case. We will create a database with the name
datasets, keeping it the same as in the ClickHouse documentation.
To create the new database, take the following steps:
Log in to the Aiven web console, and select your service from the Services page.
In your service’s page, select Databases and tables from the sidebar.
In the Databases and tables page, select Create database > ClickHouse database.
In the Create ClickHouse database window, enter name
datasetsfor your database and select Create database.
Connect to the ClickHouse database#
We will be using the ClickHouse client to connect to the server. Follow the separate guide to familiarize yourself with how to set up and start using the ClickHouse client.
To connect to the server, use the connection details that you can find in the Connection information section of the Overview page in the Aiven web console. You will need Host, Port, User, and Password.
docker run --interactive \ --rm clickhouse/clickhouse-server clickhouse-client \ --user USERNAME \ --password PASSWORD \ --host HOST \ --port PORT \ --secure
Once you’re connected, you can run queries from within the ClickHouse client.
The next step is to add new tables to your newly created database. The ClickHouse documentation includes sample
CREATE TABLE commands with the recommended table structure, use it to create the tables for both
CREATE TABLE datasets.hits_v1 [...]
CREATE TABLE datasets.visits_v1 [...]
If no database is specified, the default one is used.
Now that you have a dataset with two empty tables, we’ll load data into each of the tables. However, because we need to access files outside the docker container, we’ll run the command specifying
--query parameter. To do this:
Go to the folder where you stored the downloaded files for
Run the following command:
cat hits_v1.tsv | docker run \ --interactive \ --rm clickhouse/clickhouse-server clickhouse-client \ --user USERNAME \ --password PASSWORD \ --host HOST \ --port PORT \ --secure \ --max_insert_block_size=100000 \ --query="INSERT INTO datasets.hits_v1 FORMAT TSV"
hits_v1.tsvcontains approximately 7Gb of data. Depending on your internet connection, it can take some time to load all the items.
Run the corresponding command for
cat visits_v1.tsv | docker run \ --interactive \ --rm clickhouse/clickhouse-server clickhouse-client \ --user USERNAME \ --password PASSWORD \ --host HOST \ --port PORT \ --secure \ --max_insert_block_size=100000 \ --query="INSERT INTO datasets.visits_v1 FORMAT TSV"
You should now see the two tables in your database and you are ready to try out some queries.
Once the data is loaded, you can run queries against the sample data you imported. For example, here is a command to query the number of items in the hits_v1 table:
SELECT COUNT(*) FROM datasets.hits_v1
Another example uses some additional query features to find the longest lasting sessions:
SELECT StartURL AS URL, MAX(Duration) AS MaxDuration FROM datasets.visits_v1 GROUP BY URL ORDER BY MaxDuration DESC LIMIT 10