Pushing Data In
This time we are collecting two different classes of instrumentation.
To store the coverpoint information we store three different types of data.
We use a class (database.insert.sql) to walk over the coverpoint hierarchy and create the necessary table entries to allow the hierarchy to be recreated later. This data is normally very, very much smaller than the coverage bin data and only needs to be stored once for the master invocation which also stores the individual bucket goals.
The individual bucket data (goal or hit) is collected by iterating over each bucket in each coverage point and then inserting these values into the relevant table. The code for this is contained in the database.insert class, the sub class sql for the coverpoints and the insert class itself for goal or hit data which is differentiated by use of the parameter reference (true for goal and false for hits).
Multiple methods exist when using MySQL. Insert rates will vary wildly depending on insert method and your choice of database engine and its configuration parameters.
- Test status from generated messages as before, and
- Functional coverage from coverage generating instrumentation.
To store the coverpoint information we store three different types of data.
- The coverpoints themselves in their hierarchy, and with axis/dimension information including enumerations.
- The individual bucket goal information. The number of target hits expected bucketwise.
- The individual bucket hits from each simulation.
We use a class (database.insert.sql) to walk over the coverpoint hierarchy and create the necessary table entries to allow the hierarchy to be recreated later. This data is normally very, very much smaller than the coverage bin data and only needs to be stored once for the master invocation which also stores the individual bucket goals.
The individual bucket data (goal or hit) is collected by iterating over each bucket in each coverage point and then inserting these values into the relevant table. The code for this is contained in the database.insert class, the sub class sql for the coverpoints and the insert class itself for goal or hit data which is differentiated by use of the parameter reference (true for goal and false for hits).
Using MySQL
In this example we are limited by the sqlite API and as Python is being used the sqlite3 module. With this we can only use an INSERT as there is no other method for bulk insert, but even here there is scope for multiple methods delivering wildly different performance. We will try and keep the number of INSERTs to a minimum, but nothing beyond this.Multiple methods exist when using MySQL. Insert rates will vary wildly depending on insert method and your choice of database engine and its configuration parameters.
- mysqlimport or LOAD LOCAL FILE INTO - touted as being fastest insert method. You don't need to dump to an actual file - you can use a pipe in linux.
- SQL INSERT. It is best to use as few commands as possible, grouping together multiple rows in a single statement. Use INSERT DELAYED if concurrent coverage queries are running and your engine choice supports it.
- handlersocket. A nosql plugin that allows a very direct connection to the storage engine. Included by default in recent Mariadb (since 5.3) and Percona distributions. It may only work with a subset of engines, however.
- The storage engine makes a difference.
- MariaDB has lots of choices - TokuDB is an interesting new choice in version 10, claims high insert rate.
- MyISAM has a special lock-less append insert if no DELETEs/UPDATEs have left holes in the table data, concurrent_insert.
- Make sure buffers (in this case write buffers) are configured correctly.
- innodb_buffer_pool_size, innodb_buffer_pool_instances, innodb_double_write for InnoDB
- bulk_insert_buffer_size for MyISAM
- Check the options for the engines you're using and make sure the others are set to zero.
Unfortunately getting this right can take time and experience.
Pulling Data Out
We have two applications that query coverage data from the database
- A web server
- Command line optimizer
For the web server we use a thin layer of Python around the SQL queries to generate the data in JSON format suitable for further processing within a web browser in Javascript. For example the bkt class, which queries individual or aggregated bucket coverage, is just a single query that returns the data in a JSON structure. Indeed the web application only generates JSON data dynamically, but does serve other formats statically (e.g. the JavaScript and index HTML page).
For the optimizer everything is done in Python, and is here. It takes arguments consisting of
Both web server and optimizer use database.cvg to recreate the coverpoint hierarchy and annotate coverage data. The query in the hierarchy class method fetches all the coverpoint information and then a groupby function is used to find the coverpoints and the recursive build function pieces the hierarchy back together. single and cumulative classes are available to extract coverage on a single leaf test or a whole regression.
- regression ids
- log ids
- log ids in an xml format
Both web server and optimizer use database.cvg to recreate the coverpoint hierarchy and annotate coverage data. The query in the hierarchy class method fetches all the coverpoint information and then a groupby function is used to find the coverpoints and the recursive build function pieces the hierarchy back together. single and cumulative classes are available to extract coverage on a single leaf test or a whole regression.
Using MySQL
Configuration is also important to get good performance from queries, e.g. ensuring that buffers are correctly sized. But the first optimization should be to make use of the explain command. It is one thing to get a query returning the correct data, the performance of that query is another. I am not advocating premature optimization but it is vitally important to be able to use the explain command and understand its output if you wish your coverage queries to complete in non geological time. With large datasets it is imperative that the correct indices exist and are correctly used by queries to return data in a reasonable time. It is quite possible to get 10-100x difference in execution time with a poorly constructed query. Slow queries will become all too apparent when using the web interface but you can also look at the slow query log. Either way use explain on these queries to ensure that the right indices are being used so that the query results can be returned as quickly as possible reducing server load and user waiting times (which will become a vicious circle as more client sessions are opened and more queries started whilst engineers wait for the slow ones to finish).
Next Post
The next post looks at creating the data to push in.
No comments:
Post a Comment