The 1 billion row challenge was put forth by Gunnar Morling in Jan 2024. He specifically put it out to see how fast java could read in 1 billion lines and do some simple calculations. The goal was to push java and see how everyone optimizes things.
I thought it would be fun to try this in Pick. The goal is to load the data into a hash file and then do a query to get the answer. Someone already did this in sql with duckdb and so it gives something to compare pick against.
It was pretty easy to get started with the challenge as it used java to generate the test data but that was pretty much all the java was needed for.
The github for the 1 billion row challenge is here.
I needed to install java under arch first. This really highlighted why I want to use [nix] . I'm installing java for a 1 off project. I'm not using java regularly and there is no reason it should be cluttering up my system. It would be nice to install java with nix-shell and then forget it ever existed once I'm down with it.
Until I get on nix though...
sudo pacman -Syu jdk-openjdk
Now we can start generating the challenge
git clone https://github.com/gunnarmorling/1brc.git
cd 1brc
./mvnw clean verify
I created 3 measurement files so that I could test without running against the full billion records.
./create_measurements.sh 1000
./create_measurements.sh 1000000
./create_measurements.sh 1000000000
These files generate to measurements.txt so I renamed them into small and med for 1000 and 1 million.
An example of a file:
Johannesburg;7.7
Saint-Pierre;11.7
Murmansk;5.6
Reykjavík;-2.8
Hobart;7.0
San Juan;43.5
Wellington;24.1
Villahermosa;29.5
Rabat;11.4
Antsiranana;30.6
A very simple data set, all it is is a weather station and a temperature. We need to track the temperatures per station and then get the min, max and average of each station. This is something a database should excel at.
Unfortunately my machine is extremely underpowered such that even 1 million records took a hot minute. There are probably some factors that I could gain but I'll need to do some testing. I think this could be quite fun to see how far I can push it on my computer until I can test it somewhere that I can load the entire dataset into memory. Though possible because I'm using hash files it could be slow regardless. Perhaps ssd would help things instead of loading into ram.
The current code is in a gist and I've posted to the pick google groups so hopefully I'll see some other ideas over time.
My results for 1 million:
100000: 7.746s
200000: 15.4179s
300000: 23.4662s
400000: 31.8649s
500000: 40.67s
600000: 49.9114s
700000: 60.1258s
800000: 70.3226s
900000: 81.1328s
1000000: 92.3357s
Loading: 92.3359s
Calculate: 76.7499s
This is on a Intel Xeon E5520 with 16GB of RAM and 16 CPUs.
I tried a few things:
Modifying the internal hashmaps to be specific for the problem was the best modification. Splitting out the setting up of the data and writing out was also a big bonus.After some major re-work I got it 3x faster now. This isn't terrible but it would still take a hot minute for 1 billion records. Reporting is still terrible and I'm not sure if that can be improved.
My results for 1 million:
100000: 2.7016s
200000: 5.0639s
300000: 7.4481s
400000: 9.8667s
500000: 12.318s
600000: 14.8079s
700000: 17.3236s
800000: 19.8695s
900000: 22.4393s
1000000: 25.0427s
Loading: 25.0933s
Calculate: 80.3474s
Still on a Intel Xeon E5520 with 16GB of RAM and 16 CPUs.
I simplified my hashing algorithm and inlined it for a relatively big gain. The hashing function is suspect though.
100000: 0.9489s
200000: 1.686s
300000: 2.4405s
400000: 3.2122s
500000: 3.9982s
600000: 4.7998s
700000: 5.6146s
800000: 6.4472s
900000: 7.2924s
1000000: 8.1559s
Loading: 8.1961s
Still on a Intel Xeon E5520 with 16GB of RAM and 16 CPUs.
I changed my READSEQ to READBLK and got some speed. About 5 seconds now.
100000: 0.714s
200000: 1.1275s
300000: 1.5513s
400000: 1.9863s
500000: 2.433s
600000: 2.8977s
700000: 3.3733s
800000: 3.859s
900000: 4.3582s
1000000: 4.8729s
Loading: 4.9124s