It appears that you're running an Ad-Blocker. This site is monetized by Advertising and by User Donations; we ask that if you find this site helpful that you whitelist us in your Ad-Blocker, or make a Donation to help aid in operating costs.

UTF-8 vs Latin-1 (ISO-8859-1) · Article

By Isaac DeCoursey and James Corthell

What Is Covered

Summary
Converting
Caveats
Wikipedia Explains


 

Summary


The current default internet standard for character sets are UTF8; when UBB.threads was initially written, the database was configured to use an ISO character set. UTF8 is what Google uses and it's what they index, any multibyte characters will be ignored by Google, and they will additionally not show up in a search through Google.

Your MySQL Collation and HTML Character Sets can both have different options; the default "fresh install" of UBB.threads will start with a UFT8 character set. If you've started your site with a non-UTF8 character set/collation, then by all means, stay with that character set.


 

Converting


If you're wanting to convert your forum and database to be UTF8, and your user base is primarily comprised of English speakers, you may end up with a few black diamonds (see our Converting to UTF8 guide) that you'll need to correct.

We would suggest that if your forum language is English that you convert to UTF8, if your forum is NOT English then we'd recommend that you stick with the ISO character set which you where initially running.


 

Caveats


If you use a UTF8 Character Set and you import from a non UTF8 RSS Feed, or you copy/paste from Microsoft Word or a non-English language, you might see black diamonds, this is completely normal, and is because of multibyte characters not being able to be stored with a UTF8 Character Set.

Input from a MAC computer, much like using Microsoft Word, will use some special/multibyte characters (such as the picture of the Apple logo) which are not in the UTF8 Character Set, and will display as a Black Diamond.

The best advice that we can give is:
• If your site is currently ENGLISH ONLY and you cater to 99% ENGLISH SPEAKERS (readers?), update your headers to UTF8.
• If your site does not cater to primarily English only speakers, keep the character set you're currently using.
• If you are installing a fresh/new site, use UTF8.


 

Wikipedia Explains


Wikipedia explains both character sets reasonably well; UTF-8 vs Latin-1 (ISO-8859-1).

Former is a variable-length encoding, latter single-byte fixed length encoding. Latin-1 encodes just the first 256 code points of the Unicode character set, whereas UTF-8 can be used to encode all code points. At physical encoding level, only codepoints 0 - 127 get encoded identically; code points 128 - 255 differ by becoming 2-byte sequence with UTF-8 whereas they are single bytes with Latin-1."
Posted on September 5th, 2015 · Updated on January 13th, 2017
▼ Sponsored Links ▼
▲ Sponsored Links ▲

UTF-8 vs Latin-1 (ISO-8859-1) · Pictures

Unicode Statistics - Source: Google Blog

Comments and Attributions

This article originated from a midnight conversation between Isaac and James regarding multibyte characters.

See Also:
How to Fix Black Diamonds (Article)
Converting to UTF8 (Guide)

Wikipedia: ISO/IEC 8859-1
Wikipedia: UTF-8

Comments

( Posted)