Storing Simulation Results in a Database
This has been in use in some Digital Verification flows for quite some time. Why would you want to do it? It turns out that database applications can be very fast, reliable and asynchronous. Much faster and more reliable that writing your own code that is e.g. layered on top of a file system. The reliability comes from advanced locking mechanisms that ensure no data is lost as many clients attempt to concurrently read and write. Compare this to mangled, interleaved and partially buffered output streams from multiple processes and races corrupting files and pipes in a home grown regression framework. We can reduce the framework interaction to read and write through the database's API and let the database application schedule writing and keep the data free of corruption. There is no need for any local locking mechanisms to prevent file corruption. Advantages include- Concurrent, fast and asynchronous upload, management of conflicts via locks (data not lost or corrupted)
- Concurrent and coherent reading
- Indexing for fast retrieval (much faster than walking over a directory structure and examining files)
- Archiving of results for later mining (trend analysis)
However there is one large barrier to using this technology and that is understanding how databases work and learning how to use them.
Which database?
Is probably the hardest question. There are a plethora of them, but for this application they can be subdivided into two categories
- Traditional SQL based, MySQL (MariaDB, Percona, Drizzle), PostgreSQL, sqlite, firebird
- Newer, more fashionable NoSQL, mongoDB, cassandra, voldemort, couchDB
SQL
Of the traditional SQL databases listed sqlite is the odd one out as it is an embedded database that does not require a separate daemon process like the others. We will use it in this example as it has a low overhead and is easy to use. Most importantly it requires zero effort to set up. However it's performance in scaling to a large number of clients (10's to 100's) is unknown to me but it may still perform well in this use case.
The others require more effort in set up and hardware. Ideally they should have dedicated local disk resource (performance will be terrible if NAS storage is used). They also require tuning of configuration parameters for best results, for example this type of application tends to be append only (we don't update or delete rows) and we can be fairly relaxed about what happens in a catastrophic failure (we don't care if we loose some data if the machine dies). See below for some pointers in the case of MySQL and forks, perhaps there is scope for yet another blog post on this subject.
NoSQL
The NoSQL databases all have similar requirements, to each other and the conventional SQL databases - a daemon process and fast local disk for best results. They require less configuration, but you will still need to learn how to use them. I have not done any performance comparison against SQL databases so don't know how they compare when inserting and querying. Elsewhere I have mentioned functional coverage - I have an inkling that an SQL type database will be better suited to this, so in the event of otherwise similar performance the SQL type may win out.
Schema Design
Not so much of an issue with NoSQL (otherwise known as schema-free) databases, but you'll still need to index them. But in an SQL database we'll need to carefully design the tables and index them.
Having said that the tables in this simple example are, well, simple. We have a table for each invocation (log) which may or may not be a simulation. There is a field in this table to add a description of what it may be, could be regression, simulation, synthesis or anything else. Also who (uid) is running it and where it is being hosted, and also some parent information to allow us to put a tree of activities (e.g. a regression) back together.
We also add some pertinent indexes on this table to allow us to search it more efficiently.
We have a second table, message, that holds individual messages. Each has a field keyed to a unique log table entry, and this is how we put logs back together again by joining the tables on the log_id field.
Updating the database
In order to develop this example in the shortest possible time a Python callback is placed in the message emit callback list which enqueues each emitted message to be committed to the database. Even though this runs in a separate thread it's probably not the best implementation performance wise. A C++ version would make more sense. However being able to use Python demonstrates the flexibility in this approach to facilitate speedy prototyping (and the rapid turnaround in iterations that using an interpreted language allows).
The Python library is here.
When the messaging instance is created an entry is made in the log table, and the key id of this row is stored for use in all subsequent message table entries.
Each message is placed in a queue and the queue is flushed and committed to the database at each tick of a timer. The commit action is placed in a separate thread so that the process runs asynchronously to the simulation, thus the simulation does not normally stop when messages are sent to the database. However if a message of high enough severity is encountered the threads are synchronized and the message immediately committed, halting simulation. This is an attempt to prevent this message being lost in any subsequent program crash - at least that message will have been captured. The severity is programmable so this behaviour can be changed.
There is also a programmable filter function that can be used so that messages only of set severities or some other function of message parameters are committed.
Each commit to the database is an insert, with the message severity, text, time, filename and line number together with any associated identifier if given (e.g. IDENT-1). The motivation with the identifier is to make it easier to search for specific messages across multiple simulations when mining some specific data or grouping for triage. For example, to find all Python PYTN-0 messages :
sqlite> select * from log natural join message
where message.ident = 'PYTN' and message.subident = '0';
This was one of the original goals; we can easily find messages of a particular type without resorting to regular expression searching a number of files. And not just of a particular type, but from a file, line or of a particular severity.
Results
When simulations are run using this environment all interesting log messages are logged to a database. We can then retrieve these messages post mortem to examine status from test to regression.
The next post will look at how we can do this in a web browser.
Installing and running MySQL
A brief note on getting a database up and running.
- Which flavour? MySQL, MariaDB or Percona? I'd probably lean toward Maria & Aria engine for this type of application, perhaps Percona if I wanted to use XtraDB (note it's also available in Maria). I'd pick the latest non alpha version - if you're building from source (which I would also recommend) it's easy to have multiple versions installed anyway.
- It is possible to have multiple instances of multiple versions running on the same machine. Building from source seems to be straightforward and this makes it easy to have multiple versions installed. You'll need to place each instance on a separate port, of course.
- You do not need to run as mysqladmin user, most compute farms are fire-walled off the internet, so we do not need to worry about hackers. Whilst you may want to consider running your production server under a special user login (and add some basic table permissions to perhaps prevent row deletion/alteration), it is possible to prototype under your own login. If you're changing configuration parameters you'll be taking the daemon up and down frequently.
- Hardware is important. You will need fast local storage (10k+ SAS, SSD or PCIe flash) and plenty of memory for best results. Don't use a networked drive at all costs. A recent processor will help too, avoid using hand me downs if this is going to be critical to you.
- You will gain most by tuning your configuration, especially if you choose to use InnoDB/XtraDB. Most recent releases have been set to use this by default, so if you're not using MyISAM or Aria make sure that the daemon configuration file has the following set :
- innodb_flush_log_at_trx_commit is 2
- innodb_flush_method is O_DIRECT
- innodb_buffer_pool_size is sized to the installed memory (70% of available RAM)
- innodb_log_file_size is small, or send the logs to disks (if using SSDs) or turn off binary logging altogether with binlog_ignore_db, so as not to fill the primary (expensive?) storage.
- transaction-isolation is READ-UNCOMMITTED
- If you are using MyISAM or Aria
Functional Coverage
This post has only covered storing simulation messages and results. It is also possible to store coverage data with many of same advantages. Storing data on a per simulation basis and merging it when required yields good results when a performant schema is used. Merging coverage data from many thousands of simulations each with thousands of buckets can be done surprisingly quickly (in the order of seconds) and individual buckets ranked by test even quicker. In fact it is possible to generate which tests hit which buckets data in real time as a mouse is rolled over a table.
Perhaps this is for another blog post series.
No comments:
Post a Comment