emGee Software Solutions Custom Database Applications

Share this

Web Technologies

Planes, Trains, and Automobiles: MySQL XA Transactions

Planet MySQL - Wed, 05/16/2018 - 14:43

MySQL, coupled with the InnoDB engine, provides full ACID compliance. With the addition of “eXtended Architecture” (XA) technology, they can all act in a distributed atomic environment.

In distributed systems, several different technologies, platforms and devices may need to act on separate sets of data atomically. These backend datasets can exist side by side in a datacenter, or across oceans. There can be multiple datasets where atomicity is a necessity. Foreign engines could completely control these datasets, with no concept of cross communication. The XA standard was specified by the Open Group to describe distributed transaction processing. This “eXtended Architecture” describes the relationship between a transaction manager and a transaction resource. You can find the specification HERE.

Transactions are:

  • ATOMIC, committed as a whole
  • CONSISTENT, the state of all data is either valid, or rolled back to last valid state prior to a failed transaction
  • ISOLATED, uncommitted transactions are isolated from other transactions*
  • DURABLE, once committed, transactions stay committed even in the event of a system failure

* transaction isolation level is configurable, allowing for reads of various states (dirty, phantom) of un-committed data, if necessary.

Working with database objects, transactions are committed as a whole (ATOMIC), or not committed at all. This is achieved with the concept of the two-phase commit (2PC) protocol. 2PC describes the method whereby all processes involved in a transaction are first polled for a consistent state, ready to commit (phase 1) and the commit phase where all process commit the changes. If the vote during the polling phase is not unanimous, the entire transaction is rolled back. Sometimes though, the native transaction atomicity in the InnoDB engine is not enough.

The most common example of a distributed transaction is a banking transaction. Let us imagine you go to the local grocery store and pick up a few items. At the register, you pay with a debit card and request some money back. There are several transactions going on here at the same time. From the moment the request for payment has started, despite all the different pieces involved, the entire transaction must be completed as a whole. To maintain a consistent state, the transaction manager assigns an XID that is distributed along with each individual request. The XID is a unique identifier that is persistent for each meta transaction.

Let’s say, for example, the bill for your groceries was $288.50, and you want $50.00 back. There are several separate transactions going on here, but to simplify, there are three main transactions going on:

  1. The withdrawal of $338.50 from your bank account
  2. The withdrawal of $50.00 from the cash register account
  3. The deposit of $338.50 to the master store account
Paper, plastic or XA transaction?


These transactions must be done atomically. The transaction manager handles the requests and provides an XID; each resource manager starts a transaction with that XID. The resource managers, in turn, perform the necessary DML. Each resource manager ends the transaction and prepares it for commit, and at this point, those changes are not finalized. But the data is locked, solely writable by this transaction. During this time the transaction manager is polling the resource managers for the XID in prepared state. When the transaction manager receives confirmation from all resource managers that they are prepared, it sends the instruction to commit.

This is a very simplified description of the transaction chain, but it is enough to illustrate the individual nature of the transactions. We have three separate”banks” here: your bank, the physical cash register and the store bank. A separate transaction is prepared on each of the disparate accounts: decrementing $338.50 from your bank (if available of course), decrementing $50 from the cash register local balance, and incrementing the store bank account $338.50. The transaction manager then polls the transaction resources for status and if they are all in agreement that they can complete (PREPARED), they execute as a whole, creating an atomic transaction across the many resources.

Each of these resources may be running a different backend resource. For example, we can imagine your bank is running MySQL community, the cash register is backed by PostgreSQL and the store bank is running Percona Server for MySQL. These completely separate technologies act in concert as directed by the XA transaction manager.

  1. The transaction manager follows a distinct path with each resource:
  2. Transaction Manager (TM) instructs the Resource Manager (RM) to XA START my_XA_transaction
  3. RM executes DML1; DML2; … DMLN; XA END my_XA_transaction; XA PREPARE my_XA_transaction*
  4. TM polls RM for a positive response on PREPARE
  5. When all RMs are in prepared state, the TM instructs the RMs to XA COMMIT/ROLLBACK

*transactions can be “ONE PHASE” and will COMMIT and PREPARE in one step if only a single Transaction Manager is involved

Another example of the XA protocol we might be familiar with is a travel agency. A travel agent may be reserving travel accommodations, lodging, and local transportation as a single transaction in the view of the consumer. Behind the scenes, several disconnected transactions are being executed across several different client resources.

XA transactions book planes, trains and automobiles at the same time.Working Directly with XA Transactions

MySQL enables XA transactions by default. However, they are only supported by the InnoDB engine. XA transactions are by design, optimistic: the manager assumes that if a transaction can prepare completely, it can commit completely. This is not always the case.

In MySQL, a transaction may report back to the manager that it is prepared (that is, the transaction can be completed logically and within data integrity rules as enforced by the resource), but may not be able to actually complete that operation due to block corruption, power loss, or any host of reasons a server may halt. The rules of XA transactions do not require a resource to report back success on commit. This can manifest as a database locking incident if a resource manager outside of the MySQL daemon does not successfully prepare its own transaction. The thread will wait, with potentially many rows locked for the resource manager to initiate the execute statement. While rare, I have seen instances where this occurs and XA transactions are left in the “recover” state.

Recover is not the aptest term for the state of the transaction. Recovery implies that something was broken and has now been fixed. That may be the case, a resource manager may have failed and is now back online and a transaction needs to be recovered. But every XA transaction that has been prepared, but not yet executed will show up when XA RECOVER is executed. A more descriptive term might be “uncommitted” and the reason for the lack of commit may be a recovered transaction. In any case, transactions that do not clear have definite locking implications and can cause total application failure if proper error handling at the transaction management layer is not in place.

Take the following example. First, we will start an XA transaction:

mysql> XA START X'1A2B3C4D5E6F',X'F6E5D4C3B2A1',12345; Query OK, 0 rows affected (0.00 sec)

And perform locking DML:

mysql> SELECT * FROM t WHERE id=1 FOR UPDATE; +----+----------+-----------+------+------+---------------------+ | id | key_name | key_value | foo | bar | deleted_on | +----+----------+-----------+------+------+---------------------+ | 1 | bla | bar | test | NULL | 0000-00-00 00:00:00 | +----+----------+-----------+------+------+---------------------+ 1 row in set (0.00 sec) mysql> UPDATE t SET bar='1' WHERE id=1; Query OK, 1 row affected (0.01 sec) Rows matched: 1 Changed: 1 Warnings: 0

With our local work “done”, we end and prepare the transaction:

mysql> XA END X'1A2B3C4D5E6F',X'F6E5D4C3B2A1',12345; Query OK, 0 rows affected (0.00 sec) mysql> XA PREPARE X'1A2B3C4D5E6F',X'F6E5D4C3B2A1',12345; Query OK, 0 rows affected (0.01 sec) mysql>

At this point, in theory, the transaction manager would poll all transaction resources and when they all respond that they are prepared, the order commit is given. However, if one of the resource managers for any reason does not report back the transaction on the MySQL server must wait. This, of course, blocks other transactions from acquiring row locks. For example, another thread tries to update the same row and times out waiting:

mysql> UPDATE t SET bar=2 WHERE id=1; ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction mysql>

And we can see the transaction in the InnoDB status holding the lock:

------------ TRANSACTIONS ------------ Trx id counter 164642 Purge done for trx's n:o < 0 undo n:o < 0 state: running but idle History list length 0 LIST OF TRANSACTIONS FOR EACH SESSION: ---TRANSACTION 281479603788472, not started 0 lock struct(s), heap size 1136, 0 row lock(s) ---TRANSACTION 164640, ACTIVE (PREPARED) 1056 sec 2 lock struct(s), heap size 1136, 1 row lock(s), undo log entries 1 MySQL thread id 4, OS thread handle 123145350483968, query id 219 localhost root

In order to see what XA transactions have been prepared but not committed, we can execute the recover command:

mysql> XA RECOVER; +----------+--------------+--------------+--------------+ | formatID | gtrid_length | bqual_length | data | +----------+--------------+--------------+--------------+ | 12345 | 6 | 6 | +<M^o???ò? | +----------+--------------+--------------+--------------+ 1 row in set (0.00 sec)

This output is not useful as the “data” column is in potentially unprintable characters and cannot be used on the transaction. Luckily we can convert the XID to a hexstring:

mysql> XA RECOVER CONVERT XID; +----------+--------------+--------------+----------------------------+ | formatID | gtrid_length | bqual_length | data | +----------+--------------+--------------+----------------------------+ | 12345 | 6 | 6 | 0x1A2B3C4D5E6FF6E5D4C3B2A1 | +----------+--------------+--------------+----------------------------+ 1 row in set (0.00 sec)

That is better, but still not totally usable. We still need to parse the value of the data column for the actual gtrid and bqual values of the XID. The gtrid_length and bqual_length fields tell us the position to start and stop. First, we need to strip the characters that denote a HEX value, the preceding 0x. Then we need to take the grtid_length in bytes, and bqual_lengh in bytes and separate out those values.

I started this transaction denoting the XID in hex, so the values were converted (i.e., 1A is a single byte). Armed with this knowledge, we know that the gtrid is “1A2B3C4D5E6F”, and the bqual is “F6E5D4C3B2A1”. Now we can take appropriate action on the stranded transaction. When confronted with this, it is necessary to investigate all transaction managers in order to determine if the transaction should be committed, or rolled back. The transactions are logged in the binary log with the XID in hexstring format*. We can use mysqlbinlog to find a specific XA START command and inspect the transactions:

SET TIMESTAMP=1525981411/*!*/; SET @@session.pseudo_thread_id=123456/*!*/; SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0, @@session.unique_checks=1, @@session.autocommit=1/*!*/; SET @@session.sql_mode=1080033280/*!*/; SET @@session.auto_increment_increment=1, @@session.auto_increment_offset=1/*!*/; /*!C utf8 *//*!*/; SET @@session.character_set_client=1,@@session.collation_connection=33,@@session.collation_server=33/*!*/; SET @@session.lc_time_names=0/*!*/; SET @@session.collation_database=DEFAULT/*!*/; XA START X'1a2b3c4d5e6f',X'f6e5d4c3b2a1',12345 /*!*/; # at 511 #180510 19:43:31 server id 1 end_log_pos 597 CRC32 0xd0489605 Table_map: `test`.`t` mapped to number 9 # at 597 #180510 19:43:31 server id 1 end_log_pos 707 CRC32 0x5a2b80b0 Update_rows: table id 9 flags: STMT_END_F ### UPDATE `test`.`t` ### WHERE ### @1=1 ### @2='blah' ### @3='bar' ### @4='test' ### @5=NULL ### @6='0000-00-00 00:00:00' ### SET ### @1=1 ### @2='blah' ### @3='bar' ### @4='test' ### @5=1 ### @6='0000-00-00 00:00:00' SET TIMESTAMP=1525981411/*!*/; XA END X'1a2b3c4d5e6f',X'f6e5d4c3b2a1',12345 /*!*/; # at 11473 #180510 19:43:31 server id 171379273 end_log_pos 11607 CRC32 0x9770e9d1 XA PREPARE X'1a2b3c4d5e6f',X'f6e5d4c3b2a1',12345

*hexstrings are case insensitive, when parsing a binary log for the hex string, make sure to ignore case in string searches

When we have made the determination we can either commit the transaction or roll it back*:

mysql> XA COMMIT X'1A2B3C4D5E6F',X'F6E5D4C3B2A1',12345; Query OK, 0 rows affected (0.00 sec) mysql>

*the alternative is XA ROLLBACK XID

Which frees up the object for other transactions:

mysql> UPDATE t SET bar=2 WHERE id=1; Query OK, 1 row affected (0.01 sec) Rows matched: 1 Changed: 1 Warnings: 0 mysql>

Alternatively, I could have started the transaction denoting the XID in quoted string form:

mysql> XA START 'xa_transaction_two'; Query OK, 0 rows affected (0.00 sec) mysql> UPDATE t SET deleted_on=now(); Query OK, 3 rows affected (0.00 sec) Rows matched: 3 Changed: 3 Warnings: 0 mysql> XA END 'xa_transaction_two'; Query OK, 0 rows affected (0.00 sec) mysql> XA PREPARE 'xa_transaction_two'; Query OK, 0 rows affected (0.00 sec) mysql> XA RECOVER; +----------+--------------+--------------+--------------------+ | formatID | gtrid_length | bqual_length | data | +----------+--------------+--------------+--------------------+ | 1 | 18 | 0 | xa_transaction_two | +----------+--------------+--------------+--------------------+ 1 row in set (0.00 sec) mysql> XA RECOVER CONVERT XID; +----------+--------------+--------------+----------------------------------------+ | formatID | gtrid_length | bqual_length | data | +----------+--------------+--------------+----------------------------------------+ | 1 | 18 | 0 | 0x78615F7472616E73616374696F6E5F74776F | +----------+--------------+--------------+----------------------------------------+ 1 row in set (0.00 sec) mysql> XA COMMIT X'78615F7472616E73616374696F6E5F74776F'; Query OK, 0 rows affected (0.00 sec) mysql> XA RECOVER; Empty set (0.00 sec) mysql> SELECT * FROM t; +----+----------+-----------+------+------+---------------------+ | id | key_name | key_value | foo | bar | deleted_on | +----+----------+-----------+------+------+---------------------+ | 1 | bla | bar | test | 2 | 2018-05-10 13:56:02 | | 2 | test | foo | test | 0 | 2018-05-10 13:56:02 | | 3 | foo | bar | foo | 1 | 2018-05-10 13:56:02 | +----+----------+-----------+------+------+---------------------+ 3 rows in set (0.00 sec)


Prior to MySQL minor version 5.7.7, replication was not entirely safe. A crash could result in a transaction that can be recovered by issuing the appropriate XA commit, but the corresponding binary log entry would not exist. We have worked with a client doing distributed financial transactions utilizing XA and traditional replication on older versions. Occasionally transactions were lost on slaves and pt-table-checksum and pt-table-sync were necessary to ensure consistency. This was not best practice, but this topology supported >15000/TPS for years with just a handful of failures! More info on the replication issue prior to 5.7.7 was detailed by Yves.

Note that Percona XtraDB Cluster does not support XA.


When you have need to tie disparate data sources together for distributed atomic transactions, the support is built into MySQL and InnoDB engine by default. Make sure the engineers responsible for the database are aware of the existence of the XA transactions and know how to debug them. And always be certain to keep MySQL up to date.

Find the full documentation in the reference manual.

The post Planes, Trains, and Automobiles: MySQL XA Transactions appeared first on Percona Database Performance Blog.

Categories: Web Technologies

Responsive tables, revisited

CSS-Tricks - Wed, 05/16/2018 - 11:21

Lea Verou with some extra super fancy CSS trickery. No way we could miss linking this one up!

One of the problems with responsive table solutions, at least the ones where you are using CSS to rejigger things, is that there is duplicated content somewhere. Either in CSS or HTML.

Lea finds two ways to prevent that. One of which uses text-shadow to "duplicate" a copy of the text and move it into place. Another uses the Firefox-only element() function.

Another thing to remember: if you're forcing table elements to anything other than their natural display value (like block), make sure to use ARIA roles so they don't lose their accessibility usefulness.

Direct Link to ArticlePermalink

The post Responsive tables, revisited appeared first on CSS-Tricks.

Categories: Web Technologies

The backdrop-filter CSS property

CSS-Tricks - Wed, 05/16/2018 - 07:07

I had never heard of the backdrop-filter property until yesterday, but after a couple of hours messing around with it I’m positive that it’s nothing more than magic. This is because it adds filters (like changing the hue, contrast or blur) of the background of an element without changing the text or other elements inside.

Take this example where I’ve replicated the iOS notification style: see how the background of each of these boxes are blurred but the text isn’t?

That’s only a single line of CSS to create that faded background effect, just like this:

.notification { backdrop-filter: blur(3px); }

Now it’s worth noting that browser support for this CSS property isn’t particularly well supported just yet (see below). But we’ve been trying to do this sort of filtering stuff for a really long time and so it’s great to see that progress is being made here. Chris wrote about the “frosted-glass” technique in 2014 and way back then you had to use a bunch of weird hacks and extra images to replicate the effect. Now we can write a lot less code to get the same effect!

We also get to pick from a lot more filters than just that frosted glass style. The following demo showcases all of the backdrop-filter values and how they change the background:

Each of these boxes are just separate divs where I’ve applied a different backdrop-filter to each. And that’s it! I sort of can’t believe how simple this is, actually.

Of course you can chain them together like so:

.element { backdrop-filter: blur(5px) contrast(.8); }

And this could make all sorts of elaborate and complex designs, especially if you start combining them together with animations.

But wait, why do we even need this property? Well, after reading up a little it seems that the go-to default example is a modal of some description. That’s what Guillermo Esteves was experimenting with back in 2015:

See the Pen PwRPZa by Guillermo Esteves (@gesteves) on CodePen.

I reckon we can do something much weirder and more beautiful if we put our minds to it.

A note about browser support

The backdrop-filter property is not well supported at the time of this writing. And even in Safari where it is supported, you’ll still need to prefix it. There’s no support for Firefox at all. But, really, do websites need to look exactly the same in every browser?

This browser support data is from Caniuse, which has more detail. A number indicates that browser supports the feature at that version and up.

DesktopChromeOperaFirefoxIEEdgeSafariNoNoNoNo179*Mobile / TabletiOS SafariOpera MobileOpera MiniAndroidAndroid ChromeAndroid Firefox9.0-9.2*NoNoNoNoNo Further reading

The post The backdrop-filter CSS property appeared first on CSS-Tricks.

Categories: Web Technologies

A Strategy Guide To CSS Custom Properties

CSS-Tricks - Wed, 05/16/2018 - 07:05

CSS preprocessor variables and CSS custom properties (often referred to as "CSS variables") can do some of the same things, but are not the same.

Practical advice from Mike Riethmuller:

If it is alright to use static variables inside components, when should we use custom properties? Converting existing preprocessor variables to custom properties usually makes little sense. After all, the reason for custom properties is completely different. Custom properties make sense when we have CSS properties that change relative to a condition in the DOM — especially a dynamic condition such as :focus, :hover, media queries or with JavaScript.

Direct Link to ArticlePermalink

The post A Strategy Guide To CSS Custom Properties appeared first on CSS-Tricks.

Categories: Web Technologies

MySQL on Fedora 27

Planet MySQL - Wed, 05/16/2018 - 01:13

While updating my class image to Fedora 27, I noticed that it installed the Akonadi Server. The documentation on the Akonadi server lacked some straightforward documentation. It also offered a bundled set of software that limited how to approach MySQL development.

So, I removed all those packages with the following syntax:

dnf remove `rpm -qa | grep akonadi`

After removing those Akonadi packages, I installed the MySQL Community Edition from the Fedora repo with this syntax:

yum install -y community-mysql*

Having installed MySQL Community Edition, I started the service with this command:

sudo service mysql start

Then, I ran the mysql_secure_installation script to secure the installation:


The script set the root user’s password, remove the anonymous user, disallow remote root login, and remove the test databases. Then, I verified connecting to the MySQL database with the following syntax:

mysql -uroot -ppassword

I enabled the MySQL Service to start with each reboot of the Fedora instance. I used the following command:

systemctl enable mysqld.service

It creates the following link:

ln -s '/etc/systemd/system/multi-user.target.wants/mysqld.service' '/usr/lib/systemd/system/mysqld.service'

Now, I need to install and configure Apache, PHP, and upgrade Oracle Database 11g XE’s APEX 4 to 5.

Categories: Web Technologies

MariDB to Hadoop in Spanish

Planet MySQL - Wed, 05/16/2018 - 01:12

Nicolas Tobias has written an awesome guide to setting up replication from MariaDB to Hadoop/HDFS using Tungsten Replicator, in Spanish! He’s planning more of these so if you like what you see, please let him know!

Semana santa y yo con nuevas batallas que contar.
Me hayaba yo en el trabajo, pensando en que iba a invertir la calma que acompa;a a los dias de vacaciones que libremente podemos elegir trabajar y pense: No seria bueno terminar esa sincronizacion entre los servidores de mariaDB y HIVE?

Ya habia buscado algo de info al respecto en Enero hasta tenia una PoC montada con unas VM que volvi a encender, pero estaba todo podrido: no arrancaba, no funcionba ni siquiera me acordaba como lo habia hecho y el history de la shell er un galimatias. Decidi que si lo rehacia todo desde cero iba a poder dejarlo escrito en un playbook y ademas, aprenderlo y automatizarlo hasta el limite de poder desplegar de forma automatica on Ansible.

via De MariaDB a HDFS: Usando Continuent Tungsten. Parte 1 | run.levelcin.co

Categories: Web Technologies

MySQL 8.0 InnoDB Cluster and Persistent Configurations

Planet MySQL - Tue, 05/15/2018 - 18:06
There are lots of new features in MySQL 8.0 that was recently released including our Document Store capability.  There are a few core capabilities related to InnoDB Cluster in 8.0 that I’d like to share but some will arrive in some coming blogs.  Primarily here I’ll point out some nice details with InnoDB Cluster 8.0… Read More »
Categories: Web Technologies


CSS-Tricks - Tue, 05/15/2018 - 14:25

The team at Figma has created a new resource for “learning, creating and evangelizing design systems” called Design Systems that already has a good collection of interviews and articles by some folks thinking about these things.

I particularly liked Jeroen Ransijn’s post on how to convince your company it’s ready for a design system, where he writes:

Building a design system is not about reaching a single point in time. It’s an ongoing process of learning, building, evangelizing and driving adoption in your organization.

Design systems are a popular topic. Ethan Marcotte recently looked at instances where patterns get weird, Lucan Lemonnier shared a process for creating a consistent design system in Sketch, and Brad Frost debunked the perception that design systems are rigid. Seems like Figma's new site will be a nice curated repository of this ongoing discussion.

Direct Link to ArticlePermalink

The post designsystems.com appeared first on CSS-Tricks.

Categories: Web Technologies

About ZFS Performance

Planet MySQL - Tue, 05/15/2018 - 11:59

If you are a regular reader of this blog, you likely know I like the ZFS filesystem a lot. ZFS has many very interesting features, but I am a bit tired of hearing negative statements on ZFS performance. It feels a bit like people are telling me “Why do you use InnoDB? I have read that MyISAM is faster.” I found the comparison of InnoDB vs. MyISAM quite interesting, and I’ll use it in this post.

To have some data to support my post, I started an AWS i3.large instance with a 1000GB gp2 EBS volume. A gp2 volume of this size is interesting because it is above the burst IOPS level, so it offers a constant 3000 IOPS performance level.

I used sysbench to create a table of 10M rows and then, using export/import tablespace, I copied it 329 times. I ended up with 330 tables for a total size of about 850GB. The dataset generated by sysbench is not very compressible, so I used lz4 compression in ZFS. For the other ZFS settings, I used what can be found in my earlier ZFS posts but with the ARC size limited to 1GB. I then used that plain configuration for the first benchmarks. Here are the results with the sysbench point-select benchmark, a uniform distribution and eight threads. The InnoDB buffer pool was set to 2.5GB.

In both cases, the load is IO bound. The disk is doing exactly the allowed 3000 IOPS. The above graph appears to be a clear demonstration that XFS is much faster than ZFS, right? But is that really the case? The way the dataset has been created is extremely favorable to XFS since there is absolutely no file fragmentation. Once you have all the files opened, a read IOP is just a single fseek call to an offset and ZFS doesn’t need to access any intermediate inode. The above result is about as fair as saying MyISAM is faster than InnoDB based only on table scan performance results of unfragmented tables and default configuration. ZFS is much less affected by the file level fragmentation, especially for point access type.

More on ZFS metadata

ZFS stores the files in B-trees in a very similar fashion as InnoDB stores data. To access a piece of data in a B-tree, you need to access the top level page (often called root node) and then one block per level down to a leaf-node containing the data. With no cache, to read something from a three levels B-tree thus requires 3 IOPS.

Simple three levels B-tree

The extra IOPS performed by ZFS are needed to access those internal blocks in the B-trees of the files. These internal blocks are labeled as metadata. Essentially, in the above benchmark, the ARC is too small to contain all the internal blocks of the table files’ B-trees. If we continue the comparison with InnoDB, it would be like running with a buffer pool too small to contain the non-leaf pages. The test dataset I used has about 600MB of non-leaf pages, about 0.1% of the total size, which was well cached by the 3GB buffer pool. So only one InnoDB page, a leaf page, needed to be read per point-select statement.

To correctly set the ARC size to cache the metadata, you have two choices. First, you can guess values for the ARC size and experiment. Second, you can try to evaluate it by looking at the ZFS internal data. Let’s review these two approaches.

You’ll read/hear often the ratio 1GB of ARC for 1TB of data, which is about the same 0.1% ratio as for InnoDB. I wrote about that ratio a few times, having nothing better to propose. Actually, I found it depends a lot on the recordsize used. The 0.1% ratio implies a ZFS recordsize of 128KB. A ZFS filesystem with a recordsize of 128KB will use much less metadata than another one using a recordsize of 16KB because it has 8x fewer leaf pages. Fewer leaf pages require less B-tree internal nodes, hence less metadata. A filesystem with a recordsize of 128KB is excellent for sequential access as it maximizes compression and reduces the IOPS but it is poor for small random access operations like the ones MySQL/InnoDB does.

To determine the correct ARC size, you can slowly increase the ARC size and monitor the number of metadata cache-misses with the arcstat tool. Here’s an example:

# echo 1073741824 > /sys/module/zfs/parameters/zfs_arc_max # arcstat -f time,arcsz,mm%,mhit,mread,dread,pread 10 time arcsz mm% mhit mread dread pread 10:22:49 105M 0 0 0 0 0 10:22:59 113M 100 0 22 73 0 10:23:09 120M 100 0 20 68 0 10:23:19 127M 100 0 20 65 0 10:23:29 135M 100 0 22 74 0

You’ll want the ‘mm%’, the metadata missed percent, to reach 0. So when the ‘arcsz’ column is no longer growing and you still have high values for ‘mm%’, that means the ARC is too small. Increase the value of ‘zfs_arc_max’ and continue to monitor.

If the 1GB of ARC for 1TB of data ratio is good for large ZFS recordsize, it is likely too small for a recordsize of 16KB. Does 8x more leaf pages automatically require 8x more ARC space for the non-leaf pages? Although likely, let’s verify.

The second option we have is the zdb utility that comes with ZFS, which allows us to view many internal structures including the B-tree list of pages for a given file. The tool needs the inode of a file and the ZFS filesystem as inputs. Here’s an invocation for one of the tables of my dataset:

# cd /var/lib/mysql/data/sbtest # ls -li | grep sbtest1.ibd 36493 -rw-r----- 1 mysql mysql 2441084928 avr 15 15:28 sbtest1.ibd # zdb -ddddd mysqldata/data 36493 > zdb5d.out # more zdb5d.out Dataset mysqldata/data [ZPL], ID 90, cr_txg 168747, 4.45G, 26487 objects, rootbp DVA[0]=<0:1a50452800:200> DVA[1]=<0:5b289c1600:200> [L0 DMU objset] fletcher4 lz4 LE contiguous unique double size=800L/200P birth=3004977L/3004977P fill=26487 cksum=13723d4400:5d1f47fb738:fbfb87e6e278:1f30c12b7fa1d1 Object lvl iblk dblk dsize lsize %full type 36493 4 16K 16K 1.75G 2.27G 97.62 ZFS plain file 168 bonus System attributes dnode flags: USED_BYTES USERUSED_ACCOUNTED dnode maxblkid: 148991 path /var/lib/mysql/data/sbtest/sbtest1.ibd uid 103 gid 106 atime Sun Apr 15 15:04:13 2018 mtime Sun Apr 15 15:28:45 2018 ctime Sun Apr 15 15:28:45 2018 crtime Sun Apr 15 15:04:13 2018 gen 3004484 mode 100640 size 2441084928 parent 36480 links 1 pflags 40800000004 Indirect blocks: 0 L3 0:1a4ea58800:400 4000L/400P F=145446 B=3004774/3004774 0 L2 0:1c83454c00:1800 4000L/1800P F=16384 B=3004773/3004773 0 L1 0:1eaa626400:1600 4000L/1600P F=128 B=3004773/3004773 0 L0 0:1c6926ec00:c00 4000L/c00P F=1 B=3004773/3004773 4000 L0 EMBEDDED et=0 4000L/6bP B=3004484 8000 L0 0:1c69270c00:400 4000L/400P F=1 B=3004773/3004773 c000 L0 0:1c7fbae400:800 4000L/800P F=1 B=3004736/3004736 10000 L0 0:1ce3f53600:3200 4000L/3200P F=1 B=3004484/3004484 14000 L0 0:1ce3f56800:3200 4000L/3200P F=1 B=3004484/3004484 18000 L0 0:18176fa600:3200 4000L/3200P F=1 B=3004485/3004485 1c000 L0 0:18176fd800:3200 4000L/3200P F=1 B=3004485/3004485 ... [more than 140k lines truncated]

The last section of the above output is very interesting as it shows the B-tree pages. The ZFSB-tree of the file sbtest1.ibd has four levels. L3 is the root page, L2 is the first level (from the top) pages, L1 are the second level pages, and L0 are the leaf pages. The metadata is essentially L3 + L2 + L1. When you change the recordsize property of a ZFS filesystem, you affect only the size of the leaf pages.

The non-leaf page size is always 16KB (4000L) and they are always compressed on disk with lzop (If I read correctly). In the ARC, these pages are stored uncompressed so they use 16KB of memory each. The fanout of a ZFS B-tree, the largest possible ratio of a number of pages between levels, is 128. With the above output, we can easily calculate the required size of metadata we would need to cache all the non-leaf pages in the ARC.

# grep -c L3 zdb5d.out 1 # grep -c L2 zdb5d.out 9 # grep -c L1 zdb5d.out 1150 # grep -c L0 zdb5d.out 145447

So, each of the 330 tables of the dataset has 1160 non-leaf pages and 145447 leaf pages; a ratio very close to the prediction of 0.8%. For the complete dataset of 749GB, we would need the ARC to be, at a minimum, 6GB to fully cache all the metadata pages. Of course, there is some overhead to add. In my experiments, I found I needed to add about 15% for ARC overhead in order to have no metadata reads at all. The real minimum for the ARC size I should have used is almost 7GB.

Of course, an ARC of 7GB on a server with 15GB of Ram is not small. Is there a way to do otherwise? The first option we have is to use a larger InnoDB page size, as allowed by MySQL 5.7. Instead of the regular Innodb page size of 16KB, if you use a page size of 32KB with a matching ZFS recordsize, you will cut the ARC size requirement by half, to 0.4% of the uncompressed size.

Similarly, an Innodb page size of 64KB with similar ZFS recordsize would further reduce the ARC size requirement to 0.2%. That approach works best when the dataset is highly compressible. I’ll blog more about the use of larger InnoDB pages with ZFS in a near future. If the use of larger InnoDB page sizes is not a viable option for you, you still have the option of using the ZFS L2ARC feature to save on the required memory.

So, let’s proposed a new rule of thumb for the required ARC/L2ARC size for a a given dataset:

  • Recordsize of 128KB => 0.1% of the uncompressed dataset size
  • Recordsize of 64KB => 0.2% of the uncompressed dataset size
  • Recordsize of 32KB => 0.4% of the uncompressed dataset size
  • Recordsize of 16KB => 0.8% of the uncompressed dataset size
The ZFS revenge

In order to improve ZFS performance, I had 3 options:

  1. Increase the ARC size to 7GB
  2. Use a larger Innodb page size like 64KB
  3. Add a L2ARC

I was reluctant to grow the ARC to 7GB, which was nearly half the overall system memory. At best, the ZFS performance would only match XFS. A larger InnoDB page size would increase the CPU load for decompression on an instance with only two vCPUs; not great either. The last option, the L2ARC, was the most promising.

The choice of an i3.large instance type is not accidental. The instance has a 475GB ephemeral NVMe storage device. Let’s try to use this storage for the ZFS L2ARC. The warming of an L2ARC device is not exactly trivial. In my case, with a 1GB ARC, I used:

echo 1073741824 > /sys/module/zfs/parameters/zfs_arc_max echo 838860800 > /sys/module/zfs/parameters/zfs_arc_meta_limit echo 67108864 > /sys/module/zfs/parameters/l2arc_write_max echo 134217728 > /sys/module/zfs/parameters/l2arc_write_boost echo 4 > /sys/module/zfs/parameters/l2arc_headroom echo 16 > /sys/module/zfs/parameters/l2arc_headroom_boost echo 0 > /sys/module/zfs/parameters/l2arc_norw echo 1 > /sys/module/zfs/parameters/l2arc_feed_again echo 5 > /sys/module/zfs/parameters/l2arc_feed_min_ms echo 0 > /sys/module/zfs/parameters/l2arc_noprefetch

I then ran ‘cat /var/lib/mysql/data/sbtest/* > /dev/null’ to force filesystem reads and caches on all of the tables. A key setting here to allow the L2ARC to cache data is the zfs_arc_meta_limit. It needs to be slightly smaller than the zfs_arc_max in order to allow some data to be cache in the ARC. Remember that the L2ARC is fed by the LRU of the ARC. You need to cache data in the ARC in order to have data cached in the L2ARC. Using lz4 in ZFS on the sysbench dataset results in a compression ration of only 1.28x. A more realistic dataset would compress by more than 2x, if not 3x. Nevertheless, since the content of the L2ARC is compressed, the 475GB device caches nearly 600GB of the dataset. The figure below shows the sysbench results with the L2ARC enabled:

Now, the comparison is very different. ZFS completely outperforms XFS, 5000 qps for ZFS versus 3000 for XFS. The ZFS results could have been even higher but the two vCPUs of the instance were clearly the bottleneck. Properly configured, ZFS can be pretty fast. Of course, I could use flashcache or bcache with XFS and improve the XFS results but these technologies are way more exotic than the ZFS L2ARC. Also, only the L2ARC stores data in a compressed form, maximizing the use of the NVMe device. Compression also lowers the size requirement and cost for the gp2 disk.

ZFS is much more complex than XFS and EXT4 but, that also means it has more tunables/options. I used a simplistic setup and an unfair benchmark which initially led to poor ZFS results. With the same benchmark, very favorable to XFS, I added a ZFS L2ARC and that completely reversed the situation, more than tripling the ZFS results, now 66% above XFS.


We have seen in this post why the general perception is that ZFS under-performs compared to XFS or EXT4. The presence of B-trees for the files has a big impact on the amount of metadata ZFS needs to handle, especially when the recordsize is small. The metadata consists mostly of the non-leaf pages (or internal nodes) of the B-trees. When properly cached, the performance of ZFS is excellent. ZFS allows you to optimize the use of EBS volumes, both in term of IOPS and size when the instance has fast ephemeral storage devices. Using the ephemeral device of an i3.large instance for the ZFS L2ARC, ZFS outperformed XFS by 66%.

The post About ZFS Performance appeared first on Percona Database Performance Blog.

Categories: Web Technologies

i18n with angular 6+

Echo JS - Tue, 05/15/2018 - 10:51
Categories: Web Technologies

Debugging Electron Memory Usage

Echo JS - Tue, 05/15/2018 - 10:51
Categories: Web Technologies


Echo JS - Tue, 05/15/2018 - 10:51
Categories: Web Technologies