All kinds of things too precious not to share, in short articles. Constantly under construction or destruction.
Fabian Kurz → Ham Radio → Stuff
This article is a work in progress - it will describe the technical side of the Reverse Beacon Network (RBN) activity charts available at https://foc.dj1yfk.de/activity/DJ1YFK including source code and the rationale behind the design of the system.
The goal of the RBN activity charts is to get a very quick overview of the activity of a callsign, as reported by the RBN. This includes all modes supported by the RBN receivers, currently this is mainly CW with some Skimmers also supporting RTTY, PSK31 and other digital modes.
A desired representation includes a heat-map style calendar view for at least one year back, where the reported RBN activity per day is mapped to a color intensity, very similar to the commit activity maps over at at github. Activity points are awarded for each hour of the day, and within each hour one point for each band on which the RBN reported a spot, plus the number of continents on which the station was heard during that hour. This gives a hopefully more meaningful picture of the activity than just counting the raw number of spots, which can (with a decent signal and suitable propagation) nowadays easily reach 50 for a single CQ.
Additionally, a band breakdown of the reported activity is desired, to get a quick idea of the activity focus of a certain callsign. To round it off, a graph showing the activity versus the hours of the day gives an interesting insight into typical operating hours, and with a little extra work, this chart can also include the bands on which the activity took place.
Finally, a summary active hours per day, total for the last year, and the number of hours in which the callsign was detected by a skimmer on each of the continents rounds off the information presented.
The finished "product" can be seen on the URL mentioned above. Just for illustration, you may find some of the following exemplary overviews interesting:
The raw aggregated data from the RBN is available at reversebeacon.net in CSV format, one file a day. After unzipping the data, typical file sizes range between about 10 MB (150k spots, week days) and over 150 MB (2.5M spots, contest weekend) for a single day, i. e. more than 1 GB per month, and consequently over 12 GB or something in the range of 50 million spots a year. This really is "big data".
The naive way to handle this kind of data would be to pay a lot of money for a high performance server or rent something at AWS, throw each line into a database with a fancy name, install all the frameworks that currently are trendy, and then for each query to generate an activity chart, search through millions of database rows and let the CPU melt away a bit. "Cloud developers" love this approach.
The smart way, however, is to process the data once, convert it into a suitable format, and save it efficiently. That's the way I chose, because I love small and scalable solutions, much more than throwing money and ressources at problems until they disappear (which they often won't).
For the type of activity report we want to generate, there's a simple data relationship, which is fundamentally different from the "table" view of a CSV file with all the spots in single rows: What we need is a key → value relationship between a callsign, and all data that is associated with this callsign.
Furthermore, in order to generate the kind of report described above, we can do away with a lot of redundancy that is contained in the RBN raw data. If we only like to know if DJ1YFK was heard by the RBN on 20m in the 23:00 UTC hour on January 1st, 2015, we don't need to save 50 spots (each with 50 or so bytes) for that, because it's only a few bits of actual information. It doesn't matter which precise skimmers received the call, the continents are sufficient. The same is true for the precise frequency, SNR, CW speed or any of the other information that's available in a fine grained form in the raw data. In the end, the whole information that is saved for one hour in which a callsign was active, is a bit field with 16 bits (which conveniently fits into two bytes), as tabulated below.
The next question was: How to save this bitfield in an efficient manner, and not only for one single hour, but for any hour and date. And it needs to be saved in a way that makes is very easy not only to retrieve this data, but also to update any particular hour or day without much overhead and moving data around.
After considering and rejecting formats like JSON, the best solution was infinitely simpler and faster: All data was stuffed into an array of bytes, where a certain starting point (in this case January 1st, 2015), was byte 0, and for each hour from that point two bytes were allocated. The appropriate bits in this array are then filled day by day with the new raw data.
The nice thing about this way of representing the data is that you can immediately access any hour without doing a lot of math (calculating the hours from a certain starting point is something that library functions do perfectly), and since all hours are pre-allocated, there is no shifting around when adding new data (possibly in a non-chronological way). A year of this data takes 365 * 24 * 2 = 17520 bytes.
With more than 500k callsigns appearing on the RBN within a year's frame, this is a lot of data, more than 8 GB. If you want to save two consecutive years, it's twice that, and with ten years, it's ten times that. Not practical! Fortunately, the data typically contains a lot of zeroes, simply because the activity of most radio amateurs is not spread evenly over the hours of the day, and most hams are not active every single day. This means you can compress the data very well, which - in my case simply with gzip - results in an average size of just above 110 bytes per callsign, a compression factor of about 160. The penalty in speed and additional CPU load introduced by gzip is a small price to pay for that.
With the data representation solved, the next issue was how to store it. In a first version, this was done simply in the filesystem: The data for each callsign was saved in a single file - in order to avoid a single folder with half a million files, instead there were a number of folders with the first two letters of the MD5 hash of the call, over which the files were spread out.
This was very ineffective, with every file, no matter how small it is, taking 4 kB on the file system, making the best compression useless. It's also a nightmare to backup so many small files.
So a "real" database was needed. While this sounds like a perfect use case for one of the many key-value store databases, I took a short cut and used MySQL (because it was there already) with a very simple table that had two fields, the callsign (indexed) and a blob data field for the gzip compressed array described above. Without taking a benchmark against anything else this solution "felt" absolutely fine, with very short query times and a total database size of less than 150 MB, I stopped to look for anything better. No need to over-engineer things!
Now everything was in place, the last thing that was needed was the actual software filling the database with the daily batch of RBN data. With all design questions answered, this was written as a small Perl script that is started by a cron job every night. The source files are listed below.
Added: 07-Jul-2016. Last modified: 07-Jul-2016.
Fabian Kurz → Ham Radio → Stuff