Friday 13 December 2013

Using a Relational Database for Functional Coverage Collection Part Two

Schema for Functional Coverage

Selection criteria include
  1. Performance. I have an expectation of rapid query results. I don't want to have to wait. I don't want other engineers having to wait. To a lesser extent I want the upload to be fast too, I don't want the simulation to have to wait [to finish, so the next one can start].
  2. Compact. I will conceivably create a large amount of data. I wish to keep it all for later analysis, at the project level and possibly beyond.
  3. Easy to understand. If I do use an Object Relational Mapper I still want to be able to hand code queries in the event that generated ones aren't fast enough.
  4. Ease of merging. Across regressions and also when part of a coverage point has changed. If a coverpoint has changed it is unlikely we can still merge the coverage of that point with any meaning, but we can merge the unchanged portions of the data set.
As we are building upon the previous blog series we will start with the schema used there for describing simulation runs. As a starting point we will reuse the log_id field to reference the simulation invocation.

We must store three pieces of information.
  1. The hierarchy and description of the coverage points, including the axis/dimensions of each point and the enumerations of each axis/dimension.
  2. The goal or target of each bucket defined. We can also encode whether any particular bucket cannot be hit or is uninteresting, or where it would be illegal to do so.
  3. The actual hits from each invocation. These are tallied to provide the merged coverage.
We will look at each in turn.

Hierarchy and Description of the Coverage Points

The first item could be done in a variety of ways. It does not necessarily require database tables either, the information is relatively small and compact so a file could be dropped somewhere. It does need to be persistent else the coverage data stored is meaningless if it can't be serialized to coverage points. In a similar vein it could stored as a blob in the database, for example in JSON format. Using JSON would make it very easy to serve the data to a browser but perhaps harder to post process that data with another scripting language.
We could encode the coverpoint tree in tables. There is plenty of material on the web detailing how to encode a tree in a RDMS - The Adjacency List Model is just one of them detailed here. Note that we'll not be updating the tree so don't need to use a method that allows the tree to be edited. For this schema we will use three tables
  1. One for the points themselves. Name, description, parent, MD5 hashes for congruency checking.
  2. One for the axes/dimensions of the cross. Parent coverpoint and name. Multiple dimensions per point.
  3. One for the enumerations of the dimensions. Text enumeration and integer value. Multiple enumerations per dimension.
To regenerate coverage point hierarchy and axes/dimensions with enumerations we can use the following query.

  SELECT * FROM point
    LEFT OUTER JOIN axis USING (point_id)
    LEFT OUTER JOIN enum USING (axis_id)
    WHERE log_id = @master;

If we then collate by point, axis and enum we can then recreate the hierarchy.
We could choose a schema where points are not bound to any invocation, but for this example I'm going to use one where the log_id of a master invocation is used to reference the goal data, so this column is added to the point table. This schema won't naturally merge runs of different coverage as they'll not share bucket identifiers, but it could be made to work rather like merging edited coverage but both are beyond the scope of these articles. You can still merge identical coverage points that are congruent but not having the same master id.
The point table will create a hierarchy in the same way as the log table, using a log_id and a parent column. This means every node can have optionally have coverage or not, children or not.
It will also use separate tables for the dimensions and their individual enumerations. We also add MD5 sums for congruency checking. In the event that a coverage merge is attempted on different goal data (because they're different, perhaps because of an edit) we can refuse to do it or invoke a special merging algorithm that can cope (but not implemented here*). We have three MD5 sums
  1. md5_self for congruency checking of name and description
  2. md5_axes for congruency checking of axis and enumeration data
  3. md5_goal for congruency checking of goal data 
If all sums are identical then the coverage points and goals are exactly the same. If the goal data has changed the merge is still straightforward. If the axes have changed but number of buckets hasn't it's also straightforward but may produce strange results if the axis information is completely different. Of course there is always the case when previous coverage collection was just plain wrong and the hit data wrong or meaningless so we wouldn't want that data in that case. The MD5 hashes are there to provide information on the validity of the coverage merge.

Goal & Hits of each Bucket

There are two very different ways of collating the hit counters. The most simple approach would be to have a combined goal and hit table that was updated by each coverage run, essentially keeping a running tally of hits. The upload of coverage data becomes an SQL UPDATE command, but may require a temporary table to hold the imported coverage data from the test run prior to the update which is dropped after. This effectively merges the coverage on the fly and the RDMS acts to queue the update requests and keep the table coherent.
The second method is to append all coverage data from each and every run to a table and collate the total cumulative coverage on demand, or at the end of testing when no more coverage is to be generated.
Obviously the second method will create much, much larger tables (of course, we wouldn't write any buckets with zero hits, we just miss them out) and adds a potentially costly post processing step. However it does preserve valuable information that can be used
  • To generate the coverage of the individual test.
    • For inspection.
    • In combination with pass/fail statistics to determine any correlation between coverage and test failure.
  • To determine which tests hit which buckets.
    • Useful as an indicator of where to focus to improve coverage.
    • To locate examples of particular behaviour. (Which tests do exactly this? I have found this useful on numerous occasions.)
  • To determine the subset of tests that would need to be run to achieve the maximum coverage. 
    • Ultimately some tests will not be contributing to the headline coverage because they are only hitting buckets that other tests have already hit, and so are not exercising any previously untested features. We may wish not to run these tests to reduce the overall test set size and execution time. Or we may choose to run these first to get an early indication that coverage is still as expected.
When using separate goal and hit tables we can merge the coverage of a regression and write it to the hit table and create a a cached copy against the log of the root invocation. This can save the processing time associated with subsequent merges, and there's no reason this could not be overwritten later.

I want to consider the use of the following schema for the goal and hit tables. They both require a log_id key and both require a goal/hit column, but we can index a bucket using 1 or 2 columns.
  1. Use a single column that has a unique bucket index in the whole coverage. It is harder to work back to a point from the index (you have to subtract an offset) but that is not an issue as the normal working mode is the other way around - from a point to a bucket which is much easier to query.
  2. Use two columns. One to index the bucket in a point, one to index the coverage point. Here (point_id, bucket_id) is unique. This makes merging changed coverage easier as the nth bucket in the nth point is the same even if another point has lost or gained some buckets. The downside is that this uses more space to store the data.

Selection for this example

I want to keep the example as simple as possible but also as instructive as possible. So we'll use
  1. The three table model for coverage points, encoding hierarchy with a simple parent id hierarchy model. Note that this doesn't necessarily fit well with SystemVerilog coverage points as there can only be 1 cross per hierarchy container and each cross is the cross of all the dimensions declared in the hierarchy container.
  2. Single column model for goal and hit tables. We have MD5 sums to warn that coverage declarations may have changed and I'm going to use the single column schema in this example as I'm not a fan of merging mismatching coverage. I'd rather spend time ensuring that it is easier and quicker to rerun a whole load of coverage simulations than try to fudge together mismatching data.
I personally am not frightened about table size. 1TB of coverage data is a massive amount, yet still fits on a contemporary consumer grade SATAIII SSD of modest cost (to an enterprise if not to an individual). A consumer grade SSD will suffice as our schema is append only so we are only writing more data. So even if the storage media is three level flash with very low write endurance we are not continually deleting and rewriting to the SSD, just appending.
How much coverage data will be generated depends on the size of your coverage points and the number of tests you run with coverage enabled. More buckets and more tests will mean more data, obviously. This can be partially mitigated by gradually turning off coverage points as they are hit. We can periodically check the coverage status and turn off any hit points. We can move older coverage data to cheaper mass storage such as HDDs or in the limit delete the individual test coverage data if we have stored the cumulative coverage.

Using MySQL

The example will again use sqlite due to the ultra simple database set up. Whilst I haven't done any performance evaluations I strongly suspect that PostgreSQLMySQL or one of the forks will perform substantially better than sqlite under higher loading with multiple clients. I have no doubt that using sqlite for large regressions will most likely yield very poor performance.
If using a non bleeding edge MySQL version you have a choice of choice of database engines, basically between InnoDB and MyISAM (or XtraDB and Aria depending on the MySQL fork).
For this type of application the transactional features that InnoDB/XtraDB support are not required. If some data is lost it's not a big problem, it's probably the cause of the loss that is going to be the problem. If the server loses power or crashes that's bad and is going to cause issues, regardless of the state of the tables when the server is rebooted. In fact we need to disable most of the transactional features in order to increase baseline performance to acceptable levels when using InnoDB/XtraDB. Also take note of the hardware requirements, especially regarding local disk. As mentioned above consumer grade SSDs are viable in this space as the schema is append only and we are not continually overwriting existing data. But do remember that PCIe flash will yield the best results and is getting cheaper all the time, and is becoming available in a 2.5" form factor too. To justify the extra expense simply integrate engineer waiting time multiplied by their salary and it will become blindingly obvious that the 1000's of $s price tag is more than worthwhile.
It is also interesting to note that the schema selected is append only. We don't update any of the coverage data once written, we only ever add new data to the end of the table. This negates any advantage that the row level locking InnoDB/XtraDB provides. This should also work to our advantage if using SSDs for database storage.
I would advocate using the MyISAM/Aria engine for this reason. What matters in this application is insert speed, and then query times. As long as the data is not frequently corrupted (in my experience never - but that's not to say it won't happen eventually) then we don't need the transactional and data integrity features of InnoDB/XtraDB.
Note that you don't have to use the same engine for all the tables, but I haven't experimented with this or generated any performance data.
If, on the other hand, you prefer to use a more recent version of MariaDB and are not frightened to custom install or build from source then MariaDB 10.0.6 has the option of using TokuDB. This claims improved performance over the traditional engines, notably insert times as it uses a different indexing algorithm. I have no experience of this engine as yet so am unable to comment further.
It is also possible to use SSD and disk. Disable binary logging if this is no replication or write it to HDD instead of SSD. Don't forget you'll still need a backup strategy, which will be complicated by having very large tables.
The first schema, which uses on-the-fly coverage aggregation, will be best be served by tables using the MEMORY engine as this is the fastest database engine as long as there is sufficient memory to host the entire table.

Next

We will examine data IO and the database.

* Merging non congruent coverage sets could be implemented by first grouping together individual test runs into congruent groups and then merging coverage groupwise. Overload the __add__ operator on the hierarchy class to only merge congruent coverpoints (or use the newer goal value on coverpoints whose md5_goal is different) and insert non matching coverpoints to leave two copies in the final hierarchy. The final step is to sum() the different groups to yield the merged coverage.

No comments:

Post a Comment