Skip to content

The --exact-rowcount option is buggy on large tables #830

@zhxnlai

Description

@zhxnlai

Context

During a schema change, gho-st copies every row from the original table to a ghost table. This results in a lock on the original table and a long running query similar to the following

mysql> show open tables where in_use>0;
+------------+----------------------+--------+-------------+
| Database   | Table                | In_use | Name_locked |
+------------+----------------------+--------+-------------+
| my_service | my_table             |      1 |           0 |
+------------+----------------------+--------+-------------+
1 row in set (0.00 sec)

mysql> show full processlist;
...
| 1594199 | my_service_ddl | 10.137.166.91:35626  | my_service         | Query       |    0 | Sending data                                                  | select  /* gh-ost `my_service`.`my_table` iteration:108 */
						`id`
					from
						`my_service`.`my_table`
					where ((`id` > _binary'1131891')) and ((`id` < _binary'310834793') or ((`id` = _binary'310834793')))
					order by
						`id` asc
					limit 1
					offset 9999 |

It appears that when --exact-rowcount is enabled, ghost grabs another lock on the original table and starts an addition query to calculate a more accurate progress ETA. By default, row counting happens in parallel to row copying.

mysql> show open tables where in_use>0;
+------------+----------------------+--------+-------------+
| Database   | Table                | In_use | Name_locked |
+------------+----------------------+--------+-------------+
| my_service | my_table             |      2 |           0 |
+------------+----------------------+--------+-------------+
1 row in set (0.00 sec)

mysql> show full processlist;
...
| 1594046 | my_service_ddl | 10.137.166.91:57666  | my_service         | Query       |   71 | Sending data                  | select /* gh-ost */ count(*) as rows from `my_service`.`my_table`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
...

Problem

After all the rows are copied to the ghost table, gh-ost attempts to cut-over the original table.

Usually, the row counting query finishes before all the rows are copied to the ghost table. At the time of cut-over, because both queries have finished, there is no lock on the origin table. Hence the schema change is successful.

However, we repeatedly found that on a large table (> 400GB), row counting query takes longer than row copying. At the time of cut-over, the row counting query is still running and holding a lock on the original table.

This causes the schema change to fail with this error: "Error 1205: Lock wait timeout exceeded; try restarting transaction", which was described in #773

Proposals

If you run into this issue, try disabling --exact-rowcount.

Improvements that we could make

  • Document this bug; add a warning.
  • Teach gh-ost to terminate the row counting query before cutting over the original table.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions