Tuesday 17 December 2013

Using a Relational Database for Functional Coverage Collection Part Three

Pushing Data In

This time we are collecting two different classes of instrumentation.
  1. Test status from generated messages as before, and
  2. Functional coverage from coverage generating instrumentation.
The mechanics of collecting the coverage from the DUV via some Python instrumentation is dealt with in the next post. But once the test has been executed and coverage calculated we need to store this into the database.
To store the coverpoint information we store three different types of data.
  1. The coverpoints themselves in their hierarchy, and with axis/dimension information including enumerations.
  2. The individual bucket goal information. The number of target hits expected bucketwise.
  3. The individual bucket hits from each simulation.
The schema for these were described in the previous post.
We use a class (database.insert.sql) to walk over the coverpoint hierarchy and create the necessary table entries to allow the hierarchy to be recreated later. This data is normally very, very much smaller than the coverage bin data and only needs to be stored once for the master invocation which also stores the individual bucket goals.
The individual bucket data (goal or hit) is collected by iterating over each bucket in each coverage point and then inserting these values into the relevant table. The code for this is contained in the database.insert class, the sub class sql for the coverpoints and the insert class itself for goal or hit data which is differentiated by use of the parameter reference (true for goal and false for hits).

Using MySQL

In this example we are limited by the sqlite API and as Python is being used the sqlite3 module. With this we can only use an INSERT as there is no other method for bulk insert, but even here there is scope for multiple methods delivering wildly different performance. We will try and keep the number of INSERTs to a minimum, but nothing beyond this.
Multiple methods exist when using MySQL. Insert rates will vary wildly depending on insert method and your choice of database engine and its configuration parameters.
  • mysqlimport or LOAD LOCAL FILE INTO - touted as being fastest insert method. You don't need to dump to an actual file - you can use a pipe in linux.
  • SQL INSERT. It is best to use as few commands as possible, grouping together multiple rows in a single statement. Use INSERT DELAYED if concurrent coverage queries are running and your engine choice supports it.
  • handlersocket. A nosql plugin that allows a very direct connection to the storage engine. Included by default in recent Mariadb (since 5.3) and Percona distributions. It may only work with a subset of engines, however.
If you have a large number of client simulations attempting to insert coverage data you will need to tune your installation to prevent everything grinding to a halt as all the data is processed and written to disk. Whilst the capabilities of the underlying hardware are important, the biggest gains can be made using the right insert method as described above and configuring your MySQL server correctly. For example
Unfortunately getting this right can take time and experience.

Pulling Data Out

We have two applications that query coverage data from the database
  • A web server
  • Command line optimizer
Here we will use SQL queries via the Python sqlite3 module again to get the data we require in a suitable format for further processing. 
For the web server we use a thin layer of Python around the SQL queries to generate the data in JSON format suitable for further processing within a web browser in Javascript. For example the bkt class, which queries individual or aggregated bucket coverage, is just a single query that returns the data in a JSON structure. Indeed the web application only generates JSON data dynamically, but does serve other formats statically (e.g. the JavaScript and index HTML page).
For the optimizer everything is done in Python, and is here. It takes arguments consisting of
  • regression ids
  • log ids
  • log ids in an xml format
or any mixture thereof. It first collates a list of individual test log_ids and finds the coverage master and its MD5 sums for each individual test. Should multiple MD5 sums exist it will cowardly fail as it can't merge the coverage. Methods are provided to add tests one by one and record any increase in coverage. Subclasses allow the tests to be added in different order: cvgOrderedOptimize, posOrderedOptimize and randOrderedOptimize. The result is a list of tests that achieve the headline coverage of all the tests, and may be a subset of the input test list, with those tests which do not add any headline coverage being filtered out. More details are contained in a following post.
Both web server and optimizer use database.cvg to recreate the coverpoint hierarchy and annotate coverage data. The query in the hierarchy class method fetches all the coverpoint information and then a groupby function is used to find the coverpoints and the recursive build function pieces the hierarchy back together. single and cumulative classes are available to extract coverage on a single leaf test or a whole regression.

Using MySQL

Configuration is also important to get good performance from queries, e.g. ensuring that buffers are correctly sized. But the first optimization should be to make use of the explain command. It is one thing to get a query returning the correct data, the performance of that query is another. I am not advocating premature optimization but it is vitally important to be able to use the explain command and understand its output if you wish your coverage queries to complete in non geological time. With large datasets it is imperative that the correct indices exist and are correctly used by queries to return data in a reasonable time. It is quite possible to get 10-100x difference in execution time with a poorly constructed query. Slow queries will become all too apparent when using the web interface but you can also look at the slow query log. Either way use explain on these queries to ensure that the right indices are being used so that the query results can be returned as quickly as possible reducing server load and user waiting times (which will become a vicious circle as more client sessions are opened and more queries started whilst engineers wait for the slow ones to finish).

Next Post

The next post looks at creating the data to push in.

No comments:

Post a Comment