-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Context
During a schema change, gho-st copies every row from the original table to a ghost table. This results in a lock on the original table and a long running query similar to the following
mysql> show open tables where in_use>0;
+------------+----------------------+--------+-------------+
| Database | Table | In_use | Name_locked |
+------------+----------------------+--------+-------------+
| my_service | my_table | 1 | 0 |
+------------+----------------------+--------+-------------+
1 row in set (0.00 sec)
mysql> show full processlist;
...
| 1594199 | my_service_ddl | 10.137.166.91:35626 | my_service | Query | 0 | Sending data | select /* gh-ost `my_service`.`my_table` iteration:108 */
`id`
from
`my_service`.`my_table`
where ((`id` > _binary'1131891')) and ((`id` < _binary'310834793') or ((`id` = _binary'310834793')))
order by
`id` asc
limit 1
offset 9999 |
It appears that when --exact-rowcount
is enabled, ghost grabs another lock on the original table and starts an addition query to calculate a more accurate progress ETA. By default, row counting happens in parallel to row copying.
mysql> show open tables where in_use>0;
+------------+----------------------+--------+-------------+
| Database | Table | In_use | Name_locked |
+------------+----------------------+--------+-------------+
| my_service | my_table | 2 | 0 |
+------------+----------------------+--------+-------------+
1 row in set (0.00 sec)
mysql> show full processlist;
...
| 1594046 | my_service_ddl | 10.137.166.91:57666 | my_service | Query | 71 | Sending data | select /* gh-ost */ count(*) as rows from `my_service`.`my_table` |
...
Problem
After all the rows are copied to the ghost table, gh-ost attempts to cut-over the original table.
Usually, the row counting query finishes before all the rows are copied to the ghost table. At the time of cut-over, because both queries have finished, there is no lock on the origin table. Hence the schema change is successful.
However, we repeatedly found that on a large table (> 400GB), row counting query takes longer than row copying. At the time of cut-over, the row counting query is still running and holding a lock on the original table.
This causes the schema change to fail with this error: "Error 1205: Lock wait timeout exceeded; try restarting transaction", which was described in #773
Proposals
If you run into this issue, try disabling --exact-rowcount
.
Improvements that we could make
- Document this bug; add a warning.
- Teach gh-ost to terminate the row counting query before cutting over the original table.