- Building a key-value database in Ruby using TDD: Part 01
- Building a key-value database in Ruby using TDD: Part 02
- Building a key-value database in Ruby using TDD: Part 03
- Building a key-value database in Ruby using TDD: Part 04
- Building a key-value database in Ruby using TDD: Part 05
Table of Contents
- Serializer
- Preventing name mangling
- Creating a CRC-32-compliant checksum
- Rerouting
- Thinking about the epoch timestamp
- Rerouting one more time
- Understanding the key size
- Renaming serialize to pack
- Renaming deserialize to unpack
- Refactoring to avoid name confusions between key and value
- Handling errors when trying to pack invalid types
- Handling errors when trying to unpack invalid types
- Thinking about how serialize should actually work
- Making the test for serialize pass (no, not implementing serialize)
Serializer
Preventing name mangling
The last thing I did in the first part was to develop serialize
and deserialize
and implement some local constants that map data types to integers, symbols or formats. Now, I've had bad experiences with constants that are objects after building an application in JS (yeah, these maps are objects in my book, they map stuff, but they're good-ol'-fashioned objects with keys and values). Creating a build involves minifying the code, and turns out that the names of those constants can change because of name mangling. Mangling means that these constants would lose their original names and use some human-unreadable name, but the references to them would be preserved using the name I gave them. This, of course, makes the code no longer to work because I'm referencing something that doesn't exist. To me, in a development server, would work, but the production-minified code wouldn't work.
This is why I like to freeze the objects that are constant, so they cannot be mangled and they cannot be mutated, meaning that I cannot add, edit or delete values. This gives me the confidence that I won't get any unexpected behavior. In Ruby, this is as simple as just calling the freeze
method on the object. I need to freeze all these constant objects, so I'll just apply it to all of them:
DATA_TYPE_SYMBOL = {
Integer: :Integer,
Float: :Float,
String: :String
}.freeze
DATA_TYPE_INTEGER = {
DATA_TYPE_SYMBOL[:Integer] => 1,
DATA_TYPE_SYMBOL[:Float] => 2,
DATA_TYPE_SYMBOL[:String] => 3
}.freeze
DATA_TYPE_FORMAT = {
DATA_TYPE_SYMBOL[:Integer] => "q<",
DATA_TYPE_SYMBOL[:Float] => "E"
}.freeze
The tests still pass, of course. Let's remember that I did all of this so I could store the type of the key and the value as an integer.
Creating a CRC-32-compliant checksum
Going back to what I need to store in a record:
- A CRC-32-compliant checksum
- An epoch timestamp
- The key size
- The value size
- The key type
- The value type
- Key
- Value
Now I have all I need to store everything, except the CRC-32-compliant checksum. I could focus on what I already know that I can serialize and deserialize, but if the data needs to be stored following the order above (as an array of bytes), then I think it's better to focus on creating a test for the CRC-32 checksum, find the way to make it pass and have this missing implementation. So I will create a test for it:
it "creates a CRC-32-compliant checksum" do
expected = 123
expect(KVDatabase::Serializer.crc32("Hello, world!")).to eq(expected)
end
The test will fail because crc32
isn't implemented in the Serializer
module. To make the test pass, I'll fake the implementation and just return what's expected.
def self.crc32(value)
return 123
end
No keyword parameter because this function will only receive one. After this, the test will pass. To make the real implementation, as happened with the binary data packing and unpacking, Ruby provides a library called Zlib
that includes a crc32
method. Again, these functions might seem like simple wrappers, but they help me to see the path I should be following.
require "zlib" # Don't forget to import
def self.crc32(value)
return Zlib.crc32(value)
end
Doing this will break the test, because the expected value is not actually the result of applying the algorithm to create the checksum (unless I was pretty lucky). There's a tool to generate CRC-32-compliant checksums, so I can find the expected value for a simple string in there. Ruby uses the CRC-32/ISO-HDLC
algorithm, so that's the result value I'll pick.
it "creates a CRC-32-compliant checksum" do
expected = 0xEBE6C6E6
expect(KVDatabase::Serializer.crc32("Hello, world!")).to eq(expected)
end
Now the test pass. The expected value is a number expressed in hexadecimal, because that's how it's actually generated, but if I converted it to decimal, it would work as well:
it "creates a CRC-32-compliant checksum" do
expected = 3957769958
expect(KVDatabase::Serializer.crc32("Hello, world!")).to eq(expected)
end
However, I like to keep it as hexadecimal because it looks less messy. Once this test passes, I can start thinking about what's the actual goal of creating this checksum. What's its purpose? To maintain the integrity of the data by comparing if the checksum from the record is the expected checksum for that data. But how can I achieve creating a checksum that represents the record itself? I think that I can just use the data itself to generate the checksum. But if I have to use the data itself, then I would first need to know how that data would look like. So I should've started with the data instead of the checksum, but I only got to this conclusion because the test let me understood where I was going.
Rerouting
I'll focus on creating the data of the record and then come back to the checksum. Remembering how data should be stored in a record following the Bitcask model (another shoutout to Dinesh Gowda's article):
I modified the original image because on one hand, it showed 4 bytes for key and value types, but the total of the header was 16 instead of 20, and on the other hand, if the types will be small integers (with what I have know, it's just from 1 to 3), I don't need 4 bytes, I just need 1 byte. So I need a 14-byte header, 4 bytes for the epoch timestamp, 4 bytes for the key size, 4 bytes for the value size, 1 byte for the key type and 1 byte for the value type. Now let's go bit by bit (no pun intended, because they're not actual bits).
Thinking about the epoch timestamp
Let's create a test for the epoch timestamp:
it "creates an epoch timestamp" do
expected = 1746984608
expect(KVDatabase::Serializer.epoch_timestamp).to eq(expected)
end
It will fail because epoch_timestamp
doesn't exist. Here I can skip the fake implementation because I know the obvious implementation, which is just using Time.now.to_i
. Time.now
will generate a string with date and time, but to_i
will convert it into the milliseconds that we expect.
def self.epoch_timestamp
return Time.now.to_i
end
And now the test... fails. Why? Because the time has changed. Time.now
is dynamic, and expected
is, of course, not dynamic. This test won't be useful then, because I would have to just fix the timestamp and it wouldn't actually tell me a thing. So I won't create this test now, but when I have a better understanding of the problem, I'll see what is needed to be tested in terms of time. The good thing is that this experiment allowed me to understand that I needed to use Time.now.to_i
to create the timestamp.
Rerouting one more time
I feel like I'm hitting a dead end here, what can I do next? Is it that I need to start thinking about the implementations instead of tests? No, that would defeat the purpose of TDD. If I want to go to the next part, the key size, what would I need? A function to generate the key size? What is the key size? I know the key size will be measured in bytes, but what the key will be? If the key was "café"
, the size would be 4 because these are 4 characters? No, because "é" is more than 1 byte, so the number of bytes would be greater than the number of characters. Therefore, the size is the length of the array of bytes that represents the key. The same for the value size. So I think I can start by creating a test that generates the size from a given key.
Understanding the key size
it "generates the size of a key" do
expected = 5
expect(KVDatabase::Serializer.key_size("café")).to eq(expected)
end
This will fail because key_size
doesn't exist. The expected value comes from the previous test for serializing strings (5 "positions" in the string, so 5 bytes for "café"
). I could do a fake implementation, but one thing I know is that if I encode "café"
in UTF-8, and get an array of bytes, I can access a .bytes.length
property. But that would only work for strings, but I already have a serialize
method that handles different types. But it needs a type, so I would need to do this:
def self.key_size(key:, type:)
return serialize(value: key, type: type).bytes.length
end
That seems messy because, why would key_size
depend on serialize
? If I want to create the whole record, the size would be calculated from the data that I want to serialize, so if anything, serialize
would call key_size
, not the other way around.
serialize
to pack
Renaming This makes me see a fundamental issue with serialize
: this method should be responsible for converting the whole thing, not just one part, so right now it has a specific scope, but it should have a broader one. It only packs data in an array of bytes, but from the outside, I would assume serialize
is the one that takes the data of the record to be stored, and creates the whole thing with the CRC, the header, the key and the value. It's not that the implementation is wrong, it's just that the name is wrong for what it does. I will comment the test and implementation for key_size
first.
And now I will rename serialize
to a more suitable name, pack
, because that's literally what it does.
def self.pack(value:, type:)
if type == :String
return value.encode(Encoding::UTF_8)
end
return [value].pack(DATA_TYPE_FORMAT[type])
end
This will break the tests, so let's fix them:
it "packs integers" do
expected = "\x14\x00\x00\x00\x00\x00\x00\x00"
expect(KVDatabase::Serializer.pack(value: 20, type: :Integer)).to eq(expected)
end
it "packs floats" do
expected = "\x8f\xc2\xf5\x28\x5c\x8f\x2c\x40".b
expect(KVDatabase::Serializer.pack(value: 14.28, type: :Float)).to eq(expected)
end
it "packs strings" do
expected = "\x63\x61\x66\xc3\xa9"
expect(KVDatabase::Serializer.pack(value: "café", type: :String)).to eq(expected)
end
deserialize
to unpack
Renaming The same thing applies to deserialize
, it should be a method to take the stored data, identify each part and return the value, but right now it just unpacks some value given its type.
def self.unpack(value:, type:)
if type == :String
return value
end
return value.unpack1(DATA_TYPE_FORMAT[type])
end
Let's fix the tests:
it "unpacks integers" do
expected = 20
expect(KVDatabase::Serializer.unpack(value: "\x14\x00\x00\x00\x00\x00\x00\x00", type: :Integer)).to eq(expected)
end
it "unpacks floats" do
expected = 14.28
expect(KVDatabase::Serializer.unpack(value: "\x8f\xc2\xf5\x28\x5c\x8f\x2c\x40", type: :Float)).to eq(expected)
end
it "unpacks strings" do
expected = "café"
expect(KVDatabase::Serializer.unpack(value: "\x63\x61\x66\xc3\xa9", type: :String)).to eq(expected)
end
Refactoring to avoid name confusions between key and value
Now the names reflect their purposes better and the tests pass. But thinking about key
and value
, and that key
would be passed as value
, the names are a bit confusing, because of course value
is just a general name, it could be the key
, but I don't want that confusion, so I'll name value
to data
. pack
first:
def self.pack(data:, type:)
if type == :String
return data.encode(Encoding::UTF_8)
end
return [data].pack(DATA_TYPE_FORMAT[type])
end
Fix the tests:
it "packs integers" do
expected = "\x14\x00\x00\x00\x00\x00\x00\x00"
expect(KVDatabase::Serializer.pack(data: 20, type: :Integer)).to eq(expected)
end
it "packs floats" do
expected = "\x8f\xc2\xf5\x28\x5c\x8f\x2c\x40".b
expect(KVDatabase::Serializer.pack(data: 14.28, type: :Float)).to eq(expected)
end
it "packs strings" do
expected = "\x63\x61\x66\xc3\xa9"
expect(KVDatabase::Serializer.pack(data: "café", type: :String)).to eq(expected)
end
Now unpack
:
def self.unpack(data:, type:)
if type == :String
return data
end
return data.unpack1(DATA_TYPE_FORMAT[type])
end
Fix the tests:
it "unpacks integers" do
expected = 20
expect(KVDatabase::Serializer.unpack(data: "\x14\x00\x00\x00\x00\x00\x00\x00", type: :Integer)).to eq(expected)
end
it "unpacks floats" do
expected = 14.28
expect(KVDatabase::Serializer.unpack(data: "\x8f\xc2\xf5\x28\x5c\x8f\x2c\x40", type: :Float)).to eq(expected)
end
it "unpacks strings" do
expected = "café"
expect(KVDatabase::Serializer.unpack(data: "\x63\x61\x66\xc3\xa9", type: :String)).to eq(expected)
end
For crc32
there's no keyword parameter, but internally it's called value
, so to reflect better what will be used for the checksum is the whole data, let's rename it as well:
def self.crc32(data)
return Zlib.crc32(data)
end
Of course the test still passes because no behavior exposed to the outside was modified.
Handling errors when trying to pack invalid types
I want to make a pause here because after making these changes, I realized pack
and unpack
assume the type will be valid, which is not necessarily true. So I'll create a new test first for pack
that checks that an error is thrown when the type is not valid.
it "throws an error when packing data of an invalid type" do
expect{KVDatabase::Serializer.pack(data: :symbol, type: :Symbol)}.to raise_error(StandardError, "Invalid type")
end
Apparently I have to use {}
instead of ()
because otherwise the code will not be "catchable".
To make the test pass, I can do an ugly (but not that much) thing:
def self.pack(data:, type:)
if type != :String && type != :Integer && type != :Float
raise StandardError, "Invalid type"
end
if type == :String
return data.encode(Encoding::UTF_8)
end
return [data].pack(DATA_TYPE_FORMAT[type])
end
The test now will pass. This is not the best, but it's not that bad. I can use a case
(a switch
in Ruby syntax), but instead of hardcoding the symbols for the types, I can use my DATA_TYPE_SYMBOL
map.
def self.pack(data:, type:)
case type
when DATA_TYPE_SYMBOL[:Integer], DATA_TYPE_SYMBOL[:Float]
return [data].pack(DATA_TYPE_FORMAT[type])
when DATA_TYPE_SYMBOL[:String]
return data.encode(Encoding::UTF_8)
else
raise StandardError, "Invalid type"
end
end
That way I keep my single source of truth and code easy to understand (I think). I know I don't need explicit return
in Ruby, but I feel uncomfortable not using it, to me it's more readable to use it.
Of course, the test will still pass.
Handling errors when trying to unpack invalid types
Now let's create the test for unpack
:
it "throws an error when unpacking data of an invalid type" do
expect{KVDatabase::Serializer.unpack(data: :symbol, type: :Symbol)}.to raise_error(StandardError, "Invalid type")
end
Of course it will fail. Now let's make a similar implementation to make it pass:
def self.unpack(data:, type:)
case type
when DATA_TYPE_SYMBOL[:Integer], DATA_TYPE_SYMBOL[:Float]
return data.unpack1(DATA_TYPE_FORMAT[type])
when DATA_TYPE_SYMBOL[:String]
return data
else
raise StandardError, "Invalid type"
end
end
And... the test is fine now. Okay, after addressing this missing case, let's keep going.
serialize
should actually work
Thinking about how Going back to key_size
, I would still need to pass it the type, which seems like unnecessary coupling. When I feel like there's unnecessary coupling, I ask myself how this is supposed to be used and if it's worth it. If the key size will be calculated when serializing something, and that will be the only case and key_size
is dead simple, do I really need an auxiliary function for it? I don't think so. Besides, I would also need a value_size
if I'm going to keep key_size
, but they do the same, and they're actually two names for the same thing: getting the size of some packed data. If my tests already check that the data is properly packed, and I'm not using my own implementation but Ruby's library for this, then a test for checking the size wouldn't really add to the level of confidence of the test suite. The current tests check that my understanding of how data is packed and unpacked is right, so I can trust in them.
So, what's next? After failed attempts to create more tests, I gained knowledge about how things should work internally, which is of course useful. I think I can now start thinking about the actual serialization process and see what I'm missing in my current implementation. How would this test look like? What information is needed for serializing in the expected format of the Bitcask model? If I was using a key-value database, to store a record I would just want to pass it a key and a value. So that's how the interface will look like: serialize(key:, value:)
. But... what if I want a custom timestamp for some reason? I would like to have the possibility to specify it, but if it's not provided, default to now
. Internally I can handle the checksum creation, generating the timestamp and getting the type and size of the key and the value. It should return something that allows me to test that it worked. Perhaps it could be the size of the record so I can check that it equals the size of the CRC + the header + the key + the value. And it could also return the packed key + value so I can check that they're not empty (because that wouldn't make sense for an array of bytes).
serialize
pass (no, not implementing serialize
)
Making the test for I see that Ruby has a Faker
implementation, so I will use it for generating a random key and a value with a very long sentence. Since these two fake values will be expensive to compute, I'll memoize them and put them outside the test, inside the suite (require 'faker'
is needed in spec/spec_helper.rb
and the gem must be added in the Gemfile
, but I'll skip those details from now on):
RSpec.describe KVDatabase::Serializer do
describe "#serialize" do
let(:key) { Faker::Lorem.word }
let(:value) { Faker::Lorem.sentence(word_count: 5_000) }
end
# tests
end
And now the test:
it "serializes" do
crc_size = 4
header_size = 14
crc_and_header_size = crc_size + header_size
size, data = KVDatabase::Serializer.serialize(key: key, value: value)
key_size = KVDatabase::Serializer.size(data: key, type: :String)
value_size = KVDatabase::Serializer.size(data: value, type: :String)
expected_size = crc_and_header_size + key_size + value_size
expect(size).to eq(expected_size)
expect(data).not_to be_empty
end
It will fail, of course, because neither size
nor serialize
are implemented. The implementation for size
is obvious as we've seen it before.
def self.size(data:, type:)
return pack(data: data, type: type).bytes.length
end
Now things get interesting for this. How can I make the test pass? I know the size the CRC should have, and I know the size the header should have, so I can have constants for that inside Serializer
as well:
CRC32_SIZE = 4
HEADER_SIZE = 14
And return the sum in the first position of serialize
:
def self.serialize(key:, value:, epoch: Time.now.to_i)
return [CRC32_SIZE + HEADER_SIZE, 0]
end
Now that first part is missing the size of the key and the value. So I'll compute them using size
and add them to the sum:
def self.serialize(key:, value:, epoch: Time.now.to_i)
key_size = size(data: key)
value_size = size(data: value)
return [CRC32_SIZE + HEADER_SIZE + key_size + value_size, 0]
end
Wait, but that won't work because size
needs the type of key
and value
. How to get them? Having to pass them to serialize
would be awful. Isn't there a nice way to do it? If the types are symbols using the actual name of the type in Ruby, can't I just get the type of key
and value
and convert it into a symbol? I see that since these types are classes
in Ruby, I could do something like key.class
and value.class
, but that would give me the internal representation that Ruby uses, and I want them as symbols. I see I cannot get a symbol directly from there, so I need to first convert it into a string and then convert that string into a symbol, so I would have key.class.to_s.to_sym
and value.class.to_s.to_sym
:
def self.serialize(key:, value:, epoch: Time.now.to_i)
key_type = key.class.to_s.to_sym
value_type = value.class.to_s.to_sym
key_size = size(data: key, type: key_type)
value_size = size(data: value, type: value_type)
return [CRC32_SIZE + HEADER_SIZE + key_size + value_size, 0]
end
If I run the tests now, I can see that the size test passes, but the empty one keeps failing, so it means that my approach worked to get the size and type of the key and the value. Let's focus on a little refactor here first. I'm doing the same for getting the type of the key and the value, so let's introduce a little helper for this:
def self.type(data)
return data.class.to_s.to_sym
end
And now use it:
def self.serialize(key:, value:, epoch: Time.now.to_i)
key_type = type(key)
value_type = type(value)
key_size = size(data: key, type: key_type)
value_size = size(data: value, type: value_type)
return [CRC32_SIZE + HEADER_SIZE + key_size + value_size, 0]
end
That's better and the test still works. Now, to make the data not be empty and make the test pass, the only thing I need to do is to create pack both key
and value
and return the resultant array of bytes of adding them:
def self.serialize(key:, value:, epoch: Time.now.to_i)
key_type = type(key)
value_type = type(value)
key_size = size(data: key, type: key_type)
value_size = size(data: value, type: value_type)
key_bytes = pack(data: key, type: key_type)
value_bytes = pack(data: value, type: value_type)
return [CRC32_SIZE + HEADER_SIZE + key_size + value_size, key_bytes + value_bytes]
end
And... now the test will pass! But... is this the right implementation? It isn't, because the second value of the array should be the whole serialized data, not only key and value. It lacks the checksum and the header. But, as any respectable cliffhanger, I'll solve that in the next part.