Fixing data encoding in a MySQL database

Offlate I’ve been working on a pet project, part of which involved restoring an old MySQL database. The backup was of MySQL data files and not a SQL dump. However, restoring database from those data files was not much of an issue; I just hopped over to PuPHPet and downloaded config for a VM which was quickly setup, MySQL data files copied over to /var/lib/mysql/<db-name>, repair command run and a SQL dump created.

Now to my dismay (I was kind of hoping not to run into this), the data appeared mucked up, gibberish. The reason? Most of the text is in Hindi written in Devnagari script but in the database it was encoded in latin1 charset (past snafu when apps were not considerate enough to use appropriate charset & collations – this database was backed up more than 10 years ago).

Now there are several articles out there which would tell you to set the collation of your database, tables & columns to utf8 to fix this issue but this only sets the charset so that new data entered into those tables/columns would be encoded correctly – it does not fix the gibberish which is existing data.

A friend suggested the following command to fix the data encoding using iconv tool:

		
			mysqldump --add-drop-table database_to_correct | replace CHARSET=latin1 CHARSET=utf8 | iconv -f latin1 -t utf8 | mysql database_to_correct

But this did not work in my case but I tested a hunch & it worked for me:

		
			UPDATE
	<table-name>
SET
	<col-1> = CONVERT( CAST( CONVERT( <col-1> USING latin1 ) AS BINARY ) USING utf8 ),
	<col-2> = CONVERT( CAST( CONVERT( <col-2> USING latin1 ) AS BINARY ) USING utf8 ),
	<col-3> = CONVERT( CAST( CONVERT( <col-3> USING latin1 ) AS BINARY ) USING utf8 )
;

What this SQL does is, it tells MySQL to cast latin1 encoded column into binary and then re-encode it into utf8.

Having found a working solution to my encoding woes, my last hurdle was to run this on the whole database. With 100+ tables and several columns per table, running this SQL manually for each table/column would be time consuming and not very smart.

And so, I ended up writing a quick PHP script to do the job for me. This script would traverse the whole database, look into all tables, determine which columns to update and then update all of them. Writing the script took about 15 minutes, running it on CLI took about 1 minute (time will vary depending on the size of database, which in my case was less than 100 MB).

Sharing the script in case it may be of use to someone else in a similar predicament.

Before you run this script, here are some caveats:

This script is best run on CLI – not a good idea to run it via web server as it could time out, not to mention, someone might get the URL & run this thus messing up your database even further.
The script assumes you have already set your table columns to utf8 charset and utf8_unicode_ci collation since it relies on this to identify which columns to re-encode.
If the charset you want to encode data into is not utf8 then you should update CHARSET & COLLATION constants accordingly.
You will, ofcourse, need to update $_creds with appropriate credentials – server domain/IP, database name, username and password.
Backup your database before you run this script on it & never run such things on production.

Happy encoding!

Fixing data encoding in a MySQL database

Leave a Reply Cancel Reply

Sliding Sidebar