Writing out what we read
The next part of this exercise is going to be writing back out the database files we read in. Hopefully by the end of the exercise we should be able to run this new function in our main file:
# main.py
...
+ def test_file_end_to_end():
+ # create the pagers for both the old and new dbs
+ old_db_pager = Pager('test.db')
+ new_db_pager = Pager('generated.db')
+
+ # read in the schema page
+ schema_page = old_db_pager.get_page(1)
+ dbinfo = DBInfo(schema_page)
+ schema_node = Node(schema_page, True)
+
+ # read in the page with the test table
+ data_page = old_db_pager.get_page(2)
+ data_node = Node(data_page)
+
+ # serialize the pages
+ new_schema_page = dbinfo.to_bytes() + schema_node.to_bytes(dbinfo)
+ new_data_page = data_node.to_bytes(dbinfo)
+
+ # write the pages to the new db
+ new_db_pager.write_page(1, new_schema_page)
+ new_db_pager.write_page(2, new_data_page)
+
+ print('generated.db written')
if __name__ == '__main__':
- test_schema_page()
+ test_file_end_to_end()
For this to work we need to introduce some new functions that allow us to serialize our database header and nodes:
# src/dbinfo.py
...
+ def to_bytes(self) -> bytes:
+ return bytes([])
+
def _debug(self):
# src/backend/node.py
...
+ from src.dbinfo import DBInfo
from src.util import b2i
...
+ def to_bytes(self, dbinfo: DBInfo) -> bytes:
+ return bytes([])
+
def _debug(self):
As well as a function that allows us to use our pager to write file content:
# src/backend/pager.py
...
+ def write_page(
+ self,
+ page_number: int,
+ data: bytes,
+ ):
+ # create the file if it doesn't exist
+ if not os.path.exists(self.file_name):
+ with open(self.file_name, 'w'): pass
+
+ with open(self.file_name, 'rb+') as file:
+ file.seek(self.get_offset(page_number))
+ file.write(data)
def get_offset(
...
If we haven’t touched the test.db
file since we created it last week, then we should be able to read the file generated by the program in the same way:
➜ sqlite3 test.db
SQLite version 3.39.5 2022-10-14 20:58:05
Enter ".help" for usage hints.
sqlite> select * from test;
hi|1
yo|2
Our goal for this post is to do the same with a file “generated.db” which is written out by our program.
We can test that our above skeleton works by creating the output file and running our main file once more:
➜ python3 main.py
generated.db written
If this works without an error then we can get started serializing the different bits of the database.
Writing out the Records and Cells
We’ll start by rewriting out the record values. If you recall, Records contain a header which include a varint for the header size as well as varints for each of the column types. We’ll start with the column types, by creating a function to_int
which turns Column
class which we created back into the integer value it came in as.
# src/backend/record.py
...
class Column
...
+ def to_int(self) -> int:
+ if self.type.value < 12:
+ return self.type.value
+
+ modifier = self.type.value
+ return (self.length * 2) + modifier
+
From here we can add a new to_varint
function which takes our integer values and turns them back into varint byte arrays. A first pass of this yeilds us:
# src/util.py
...
+ def to_varint(x: int) -> bytes:
+ """
+ to_varint takes an integer and turns it into a variable length byte array
+
+ it takes x, an integer
+
+ and returns a varint representation as a byte array
+ """
+
+ result = bytearray()
+ for j in range(8):
+ byte = x % 16
+ x = x >> 7
+
+ if x > 0:
+ # add the carry bit if there's more information
+ byte = byte | 0x80
+
+ result.append(byte)
+
+ if x <= 0:
+ return bytes(result)
+
+ byte = x % 32
+ result.append(byte)
+ return bytes(result)
However there’s an issue, running the following test doesn’t work:
# test.py (not saved in git)
from src.util import varint, to_varint
broken_example = bytes([0x81, 0x91, 0xd1, 0xac, 0x78])
recomputed = to_varint(varint(broken_example, 0)[0])
print(broken_example.hex(), recomputed.hex())
assert broken_example == recomputed
Running the above test we see a failure:
➜ python3 test.py
8191d1ac78 888c818101
Traceback (most recent call last):
File "/Users/godzilla/projects/pypopsql/test.py", line 6, in <module>
assert broken_example == recomputed
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError
Turns out generating varints is a bit more complicated than I had originally expected. I hinged my assumption on being able to simply:
bytes.append(x % 128)
x = x >> 7
But this turns out not to be the right approach. For one, the bytes are going to be in the wrong order (smallest left) and for another we’re going to be unsure how large the last byte is to be. In this particular case, the last byte we refer to is the last byte read, which will be the first byte written due to how we intend to pull off the bits.
We can find out if 9 bytes will be needed by checking if any of the first 7 bits of the long are filled. If any of them are filled (1), we will need a full 9 bytes to express the varint and therefore our first byte written will be 8 bits in size.
This is simple enough to express with an arithmetic or operation: requires_9_bytes = bool(0xfe00000000000000 & x)
We can use this to modify our to_varint
implementation to say whether the first byte that needs to be pulled off is to be the 9th byte of our varint. Then we can insert instead of append the bytes in order until the payload is filled.
# src/util.py
...
def to_varint(x: int) -> bytes:
"""
to_varint takes an integer and turns it into a variable length byte array
it takes x, an integer
and returns a varint representation as a byte array
"""
result = bytearray()
first_7_bits = 0xfe00000000000000
requires_9_bytes = x & first_7_bits
# if any of the first 7 bits are filled, 9 bytes will be needed
first_shift_size = 8 if requires_9_bytes else 7
first_modulo = 256 if requires_9_bytes else 128
result.append(x % first_modulo)
x = x >> first_shift_size
while x > 0:
# pull first 7 bits, flip the first bit to signal carryover
byte = (x % 128) | 0x80
result.insert(0, byte)
x = x >> 7
return bytes(result)
Then we can add the following line to our utility test to see if the output of to_varint
is the same as what we read for varint
:
input_data = bytes(test.data)
result, cursor = varint(input_data, 0)
self.assertEqual(test.expected, result)
self.assertEqual(test.cursor, cursor)
+ self.assertEqual(input_data, to_varint(result))
So now we’ve put the two things together for generating bytes for the Record header, we’ll need to encode the column types as well as the header length. The header length has an interesting little quirk as its own size contributes to its value. This has a unique edge case:
Per the file format documentation, often the header_size is going to be encoded in a single byte. This is because all numbers 127 and below can be encoded in a single byte, and it’s unlikely that a record has so many columns that it needs to communicate a length greater than 127.
Consider for a moment the unlikely case that a Record has 127 TINYINT columns. Encoding a varint of every number 127 and below only takes a single byte. If we kept the assumption earlier that we would only need a single byte, we could say the whole header would be 128 bytes long because it would be the length of the columns - in addition to the space required to encode the payload size. However when we try to encode the number 128 it ends up in 2 bytes - making the actual header size 129 even though the number encoded is 128.
For this reason we can place some inherent limits on this operation, saying that all header lengths 126 and below require 1 byte for the length varint, 2 bytes for anything above and if the user tries to encode a header of greater than 32765 we’ll throw an error. It’s likely impossible that a header of that length is a valid header, especially considering that it’s likely a good bit larger than the OS page size.
# src/backend/record.py
...
class Record
...
def read_value(
...
+ def header_bytes(self) -> bytes:
+ payload = bytearray()
+
+ for column in self.columns:
+ payload += column.to_bytes()
+
+ payload_size = len(payload)
+ if payload_size > 32765:
+ raise Exception(f'unexpected header payload size {payload_size}')
+
+ expected_varint_length = 1 if payload_size < 127 else 2
+ header_size_bytes = to_varint(payload_size + expected_varint_length)
+ return bytes(header_size_bytes + payload)
Once we’ve effectively written out the header payload, we can serialize the body in a similar manner to how we deserialized it originally.
# src/backend/record.py
...
def header_bytes(self) -> bytes:
...
+ @staticmethod
+ def value_bytes(column: Column, value: any) -> bytes:
+ # Note: column types other than NULL, TINYINT, and TEXT are omitted
+ if column.type == ColumnType.NULL:
+ return bytes([])
+ elif column.type == ColumnType.TINYINT:
+ return value.to_bytes(1)
+ elif column.type == ColumnType.TEXT:
+ return bytes(value, 'utf-8')
+ else:
+ raise Exception(f'cannot parse column type {column_type}')
+
+ def values_bytes(self) -> bytes:
+ body = bytearray()
+
+ for i, column in enumerate(self.columns):
+ body += self.value_bytes(column, self.values[i])
+
+ return bytes(body)
And we can create a function for serializing the entire record payload:
# src/backend/record.py
...
def values_bytes(self) -> bytes:
...
+ def to_bytes(self) -> bytes:
+ header_bytes = self.header_bytes()
+ values_bytes = self.values_bytes()
+
+ return header_bytes + values_bytes
Now we can add a test which deserializes, and then reserializes the record data. We’ll use a simple example which has 3 columns, a TINYINT, an INTEGER and a VARCHAR of length 2:
# test/backend/test_record.py
+ def test_record_read_write_payload(self):
+ # 4 byte header
+ # col 1 - tinyint, 17
+ # col 2 - integer, 1114129
+ # col 3 - string, 'OI'
+ payload = bytes.fromhex('0401041111001100114f49')
+ record = Record(payload, 0)
+ self.assertEqual(record.to_bytes(), payload)
This test takes an inbound payload, turns it into a record, and then verifies the output of the to_bytes
function on the record matches the original payload.
We can follow on this work by adding a similar and simple to_bytes
function to our TableLeafCell
class:
# src/backend/cell.py
...
+ def to_bytes(self):
+ payload = self.record.to_bytes()
+ payload_size_bytes = to_varint(len(payload))
+ row_id_bytes = to_varint(self.row_id)
+ return payload_size_bytes + row_id_bytes + payload
def _debug(self):
And we can add a testing class to ensure this behavior works as well:
# test/backend/test_cell.py (new file)
from unittest import TestCase
from src.backend.cell import TableLeafCell
class TestCell(TestCase):
def test_table_leaf_cell(self):
data = bytes([
0x06, # payload size
0x02, # row id
0x03, 0x11, 0x01, 0x79, 0x6f, 0x02, # payload
])
cell = TableLeafCell(data, 0)
self.assertEqual(cell.payload_size, 6)
self.assertEqual(cell.row_id, 2)
self.assertEqual(cell.payload, data[2:])
self.assertEqual(cell.cursor, 8)
self.assertEqual(cell.to_bytes(), data)
if __name__ == '__main__':
unittest.main()
where the final assertion self.assertEqual(cell.to_bytes(), data)
validates the behavior of turning the cell back into bytes.
Serializing the Nodes
Building on the work done for the record values, we now can move onto the Node
class. As mentioned earlier, the data layout for these will be building from both the left and the right of the page, with the node header and cell pointer array on the left and the cell content on the right.
Getting this behavior to work requires us to build the cell data from the right to the left while keeping track of the pointers to the cell data so that we can add them to the array on the left. It also requires us to effectively pad the space in the middle so that an appropriate amount of bytes is to be serialized to disk.
We also need to keep track of the reserved bytes setting that’s built into the SQLite engine so that we’re compliant with the engine’s configuration, as well as shrinking the output to make space for the database header if it’s the first page in the database.
All of this requires a good amount of simple, yet precise arithmetic so we need to be careful with this step as it is likely the most mechanically complex operation we’ve dealt with to this point.
As discussed last week, the node header looks like this in a byte-by-byte level:
Offset Size Description 0 1 The one-byte flag at offset 0 indicating the b-tree page type. 1 2 The two-byte integer at offset 1 gives the start of the first freeblock on the page, or is zero if there are no freeblocks. 3 2 The two-byte integer at offset 3 gives the number of cells on the page. 5 2 The two-byte integer at offset 5 designates the start of the cell content area. A zero value for this integer is interpreted as 65536. 7 1 The one-byte integer at offset 7 gives the number of fragmented free bytes within the cell content area. 8 4 The four-byte page number at offset 8 is the right-most pointer. This value appears in the header of interior b-tree pages only and is omitted from all other pages.
And the values for the page type byte:
- A value of 2 (0x02) means the page is an interior index b-tree page.
- A value of 5 (0x05) means the page is an interior table b-tree page.
- A value of 10 (0x0a) means the page is a leaf index b-tree page.
- A value of 13 (0x0d) means the page is a leaf table b-tree page.
- Any other value for the b-tree page type is an error.
One of the benefits of our approach is that since we’re not concerned for speed, and because we deserialize the payload in full into Python objects - we don’t need to think about freeblocks and fragmented bytes. If we were thinking more about performance, we may not even turn the byte array into objects, we would simply write helper functions for us to seek, read and write values from the byte array in place. Since we’re not doing this, we can pack the cells tightly together in the page, which gives us the benefit of simplicity at the expense of execution speed.
Something to note before we get started however is that we won’t have the starting location of the cell content until we serialize the cell, so for now we’ll leave that blank. We can do this by adding a header_bytes
function to the Node like we did for the Record:
# src/backend/node.py
...
+ def header_bytes(self, cell_offset: int=0) -> bytes:
+ node_type_bytes = self.node_type.value.to_bytes(1)
+ first_freeblock_bytes = (0).to_bytes(2)
+ num_cells_bytes = len(self.cells).to_bytes(2)
+ cell_offset_bytes = (cell_offset)
+ num_fragmented_bytes = (0).to_bytes(1)
+
+ return node_type_bytes + \
+ first_freeblock_bytes + \
+ num_cells_bytes + \
+ cell_offset_bytes + \
+ num_fragmented_bytes
def to_bytes(self) -> bytes:
...
If we want, we can modify our table leaf node test to verify this behavior:
# test/backend/node.py
...
def test_table_leaf_parse(self):
...
self.assertEqual(node.cells[1].cursor, 25)
+ header_bytes = data[:8]
+ self.assertEquals(header_bytes, node.header_bytes(17))
From here we can add another function cells_bytes
which will give us:
- A section of bytes which corresponds to the 2-byte cell pointer list
- A section of bytes corresponding to the cell content block
- The integer location of the start of the cell content area (used in the header)
We have some creative direction here as well. Since the format doesn’t specify which order the cell content needs to be in, we could either have the cell content left to right like:
[node header, cell pointer 1, cell pointer 2, .... cell content 1, cell content 2]
Or in the reverse, right to left like:
[node header, cell pointer 1, cell pointer 2, .... cell content 2*, cell content 1*]
The latter option is easier, as it allows us to calculate the offsets in the order that the cell pointers appear. Additionally this is the approach taken by SQLite in our test.db file, so it allows us to be consistent with the engine’s behavior. We can generate these values with the following code:
# src/backend/node.py
+ def cells_bytes(self) -> Tuple[
+ bytes, # cell pointer bytes
+ bytes, # cell content bytes
+ int, # cell content start pointer
+ ]:
+ cell_pointer_bytes = bytes([])
+ cell_content_bytes = bytes([])
+
+ pointer = self.page_size
+ for cell in self.cells:
+ content_bytes = cell.to_bytes()
+ pointer = pointer - len(content_bytes)
+ pointer_bytes = pointer.to_bytes(2)
+
+ cell_pointer_bytes += pointer_bytes
+ # cell content grows leftward
+ cell_content_bytes = content_bytes + cell_content_bytes
return (cell_pointer_bytes, cell_content_bytes, pointer)
def to_bytes(self, dbinfo: DBInfo) -> bytes:
Note how the we calculate the pointer locations, which move leftward as we serialize each cell. Reasoning about this, consider the following example of theoretical 32 byte page with two cells:
cell 1 - length 3
cell 2 - length 4
[0, 1, 2, ... 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]
| ↳ start cell 1
↳ start cell 2
We’re aiming for cell 1 to start at 29, occupying bytes 29, 30, and 31 and for cell 2 to start at 25, occupying bytes 25, 26, 27, and 28. Using the code above, the pointer starts at 32. When we serialize cell 1, we calculate its pointer to be 32 - 3 = 29
; the location we were aiming for and when we serialize cell 2, we calculate its pointer to be 29 - 4 = 25
; also what we were aiming for. When the function exits, it returns 25 as the cell content start location - which is also the value we are targeting.
We don’t yet protect for the cell content overflowing the page, nor do we pad the page with reserved bytes as specified in the DBInfo file, but that’s something we’ll get into after we are able to write out the database header. For now we can finish our first pass at our to_bytes
function on the Node.
There’s one more bit of arithmetic to perform on this section and this is to calculate the number of null bytes between the cell pointer array and the cell content block. We need to take into account the following information:
- The page size of the node
- The size of the database header
- The size of the node header
- The size of the cell pointer array
- The size of the cell content block
We also will need to account for the reserved bytes when we are able to account for the database configuration, but we can factor that logic into the size of the cell content block to simplify this operation. The to_bytes
function will serialize the cell content and pointers, serialize the node header, calculate and create the null byte space in the middle, and return those items formatted in a contiguous block. The output should be the size of a database page, so if it’s not we can also halt and notify ourselves if it’s not.
Visualizing the layout using the above example, we can imagine that bytes 0-7 will be occupied by the header, bytes 9-11 are occupied by the cell pointer array, and bytes 25-31 are occupied by the cell content block.
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]
| ------------------ | cell header
| -------- | cell pointer array
cell content block | --------------------- |
| ---------------------------------------------- | free space
Which means that cells 12 to 24 should be empty space, which totals 13 bytes. Using the equation page size - (header size + pointer array size + content block size)
works out to 32 - (8 + 4 + 7) = 13
which will be an effective calculation for the number of bytes of empty space. We now give a first pass to the to_bytes
function, which you can see below:
# src/backend/node.py
...
def to_bytes(self, dbinfo: DBInfo) -> bytes:
+ db_header_len = 100 if self.has_db_header else 0
+
+ cell_pointer_bytes, cell_content_bytes, cell_content_start = self.cells_bytes()
+ header_bytes = self.header_bytes(cell_content_start)
+
+ total_content_len = db_header_len + len(header_bytes) + len(cell_pointer_bytes) + len(cell_content_bytes)
+ num_null_bytes = self.page_size - total_content_len
+
+ if num_null_bytes < 0:
+ raise ValueError(f'node page overflows by {abs(num_null_bytes)} bytes')
+
+ null_bytes = bytes([0x00] * num_null_bytes)
+
+ return header_bytes + cell_pointer_bytes + null_bytes + cell_content_bytes
- return bytes([])
And use our existing test to verify its behavior:
# test/backend/node.py
...
self.assertEqual(header_bytes, node.header_bytes(17))
+ self.assertEqual(data, node.to_bytes(get_simple_dbinfo()))
If we did everything right this test should still pass. We’ve taken the first step for serializing full database pages, now we can move onto the database header.
Serializing the database header and schema page
The database header should be simple enough, we just reserialize the bytes in the way that we found them. If they changed along the way we reflect those changes in our output. We can nearly copy the order that we found in the DBInfo constructor, and the only piece that requires more than serializing integers to bytes will be the version. One additional step will be required however in that we will need to manually place the 20 bytes of reserved space between indicies 72 and 92.
# src/dbinfo.py
...
def to_bytes(self) -> bytes:
+ data = b'SQLite format 3\x00'
+
+ data += self.page_size.to_bytes(2)
+
+ data += self.file_format_write_version.value.to_bytes(1)
+ data += self.file_format_read_version.value.to_bytes(1)
+
+ data += self.page_end_reserved_space.to_bytes(1)
+
+ data += self.maximum_embedded_payload_fraction.to_bytes(1)
+ data += self.minimum_embedded_payload_fraction.to_bytes(1)
+ data += self.leaf_payload_fraction.to_bytes(1)
+
+ data += self.file_change_counter.to_bytes(4)
+
+ data += self.db_size_in_pages.to_bytes(4)
+ data += self.first_freelist_trunk_page.to_bytes(4)
+ data += self.num_freelist_pages.to_bytes(4)
+
+ data += self.schema_cookie.to_bytes(4)
+ data += self.schema_format_number.value.to_bytes(4)
+
+ data += self.default_page_cache_size.to_bytes(4)
+ data += self.largest_btree_root_page.to_bytes(4)
+ data += self.text_encoding.value.to_bytes(4)
+ data += self.user_version.to_bytes(4)
+
+ data += self.incremental_vacuum_mode.to_bytes(4)
+ data += self.application_id.to_bytes(4)
+
+ # reserved expansion space
+ data += bytes([0x00] * 20)
+
+ data += self.version_valid_for.to_bytes(4)
+ data += self.version.to_bytes()
+ return data
- return bytes([])
We can then add the corresponding function for serializing the version:
# src/dbinfo.py
...
class Version:
...
def to_int(self) -> int:
return (self.major * 1000000) + \
(self.minor * 1000) + \
self.patch
def to_bytes(self) -> bytes:
return self.to_int().to_bytes(4)
And now that we have the serialization fully wired up for the DBInfo we can add a new test to ensure the system is working end to end.
# test/test_dbinfo.py
...
+ def test_dbinfo_to_bytes(self):
+ dbinfo = DBInfo(EXAMPLE_DBINFO_BYTES)
+ # test first hundred bytes
+ self.assertEqual(EXAMPLE_DBINFO_BYTES[:100], dbinfo.to_bytes())
if __name__ == '__main__':
There’s one last thing to do before we’re able to serialize the inbound test.db
file. SQLite right pads the database pages - of which when I created a db from scratch the setting was 12 bytes. Per the docs this information is included in the header:
20 1 Bytes of unused "reserved" space at the end of each page. Usually 0.
We need to use this information from the header to do two things:
- Modify the starting offset for the cell content area
- Add additional padded bytes when we serialize a node
Let’s start by modifying our cells_bytes
function to add this behavior:
# src/backend/node.py
...
- def cells_bytes(self) -> Tuple[...]:
+ def cells_bytes(self, reserved_bytes=0) -> Tuple[...]:
cell_pointer_bytes = bytes([])
cell_content_bytes = bytes([])
- pointer = self.page_size
+ pointer = self.page_size - reserved_bytes
for cell in self.cells:
...
+ cell_content_bytes += bytes([0x00] * reserved_bytes)
return (cell_pointer_bytes, cell_content_bytes, pointer)
And then we can use the dbinfo
parameter from to_bytes
to pass this information in:
def to_bytes(self, dbinfo: DBInfo) -> bytes:
db_header_len = 100 if self.has_db_header else 0
- cell_pointer_bytes, cell_content_bytes, cell_content_start = self.cells_bytes()
+ cell_pointer_bytes, cell_content_bytes, cell_content_start = self.cells_bytes(dbinfo.page_end_reserved_space)
Now we can test that the schema page is being serialized properly - header and all:
# test/backend/test_node.py
...
def test_schema_header_page(self):
...
self.assertEqual(cell.record.cursor, 4084)
+ serialized_page = dbinfo.to_bytes() + node.to_bytes(dbinfo)
+ self.assertEqual(data, serialized_page)
And finally, we can try to read our test.db
file and create a generated.db
file which hopefully SQLite can read:
➜ python3 main.py
generated.db written
➜ sqlite3 generated.db
SQLite version 3.39.5 2022-10-14 20:58:05
Enter ".help" for usage hints.
sqlite> .tables
test
sqlite> select * from test;
hi|1
yo|2
And just like that we’ve fully a fully valid SQLite database with one user table from scratch! Moving forward we’re going to continue working with the database nodes. There’s a lot to tackle, like inserting and deleting rows, adding indexes and growing beyond a single database page. We can also going to try to see what happens when we add columns to existing tables and try to see if we can overflow nodes. Once we’ve painted over a lot of the backend parts of the system, we should be able to move onto the database engine and compiler.
Click here to see the codebase after the above modifications