Every web programmer knows that UTF-8 is where it’s at these days. Even if you are coding something for only the US market, you’re still probably going to run into names, addresses, or whatever with accented characters.
Everyone knows that “all you have to do” is make your app speak UTF-8, set the character encoding in your web pages to UTF-8, and set your database encoding and collation to something like utf8_unicode_ci
, and you’re off and running… right?
Not quite. You’ve probably run into the situation where you hop into phpMyAdmin and edit some table row, and what to your wondering eyes should appear, but something like this:
Voici un article français.
Well, crap. That’s supposed to be a “ç” in “français”, not an “ç”!
And yet, when you view the web page and edit the content in your app, it looks and works fine! What’s going on here?
It’s really quite simple, and apparently something that no one bothers to mention: you have to tell your app to communicate with the database in UTF-8! Here’s how you do that in PHP and Rails…
This problem has bothered me for years. I just put up with it. Finally, one day I discovered the answer. I also asked other people I knew if they knew about this little “trick”, and they looked at me like I had seven heads – like, “Duh! Everyone knows that!”
But I even looked in many programming texts like the Agile Rails Development book, and no dice! Apparently, this trick is the best kept secret since the development of the H-bomb, or the assassination of JFK.
And so, without further ado, here’s the one thing you’re missing…
In PHP, do this:
$DB = mysql_connect(DB_SERVER,DB_SERVER_USERNAME,DB_SERVER_PASSWORD); $DB_SELECTED = mysql_select_db(DB_DATABASE, $DB); mysql_set_charset("utf8");
In other words, connect to MySQL, pick a database, and then tell PHP to talk to MySQL in UTF-8. That’s it. It’s the “mysql_set_charset("utf8");
” that does the trick.
In Rails, edit your database.yml
and make it look like so:
development: adapter: mysql database: DATABASE username: USERNAME password: PASSWORD encoding: utf8
With Rails, it’s the “encoding: utf8
” part that tells Rails, “Talk to MySQL in UTF-8!”
In both cases, if you do not add the extra little line, PHP or Rails will talk to MySQL in latin1 encoding. This means that the data will be stored in the database as UTF-8, and your web pages will display the content in UTF-8. But when it comes time to actually write to or read from the DB, the data will be conveyed in latin1, not in UTF-8. That’s why you will see things like “français” in the example above, instead of “français”. And that’s why editing some data through phpMyAdmin becomes rather difficult.
“Well, that’s great!” you say, “But there’s one problem: I have 50 gajillion rows in one of my tables, and all the characters are screwed up. Worse yet, my formerly nicely-accented web pages are now displaying gibberish! HELP!”
Not to worry. After making your app talk UTF-8 to MySQL, there is one more step: You must export all the data, and then import it all again. There is also a trick to doing this step, as you’ll see in just a second.
You’ll need to SSH in to your server and use the command prompt to dump your entire database, and then stuff it back in. Don’t cry, I’ll tell you exactly what to type:
mysqldump --user USERNAME --password --skip-set-charset --default-character-set=latin1 DATABASE > dump.sql
- Change the line “
/*!40101 SET NAMES latin1 */;
” to “/*!40101 SET NAMES utf8 */;
” indump.sql
using vi or pico mysql --user USERNAME --password DATABASE < dump.sql
Right, so, just replace USERNAME
and DATABASE
with the proper MySQL username and database name. Steps 1 and 3 will prompt you for your MySQL database password. Here’s what you’re doing:
- Dumps the entire database using latin1
- Changes the SQL dump so that mysql will use UTF-8 to import the data back into the DB, instead of latin1
- Actually does the import of the modified
dump.sql
file
That’s it! Now when you look in the database, you should see nice pretty UTF-8 characters. You will be able to edit content directly in the DB, and everything should still look nice and clean on your web site, too.
The key step here is #2. Changing the “SET NAMES
” line was missing from tons of examples I found on the net. I figured out the missing step by reading in the PHP manual that the mysql_set_charset()
command more or less does the MySQL query “SET NAMES utf8
“. Well, that looks familiar! Thus, changing the SET NAMES
line in the dump.sql
file does the trick!
The only mystery that remains is why this very simple, very essential information for dealing with UTF-8 is so hard to come by!
I take it back! Check this out:
http://www.nathan5.com/ruby/advanced_rails/rails_and_unicode.html