Monday, August 9, 2010

[PHP] How to Converting HTML to BBCODE

While migrating to vBulletin blog I realized there was one big difference between the two systems that would effect how our blogs were displayed. TinyMCE/MyBlog stored everything in the database as pure HTML while the vBulletin Blog system stored everything as plain text using BBCODE. For those of you that do not know what BBCODEWiki is it is a simple, text method for changing the way text is presented.

TinyMCE HTML

One of the things that I found annoying was the way TinyMCE seemed to randomly modify tags. An example is a typical HTML image tag that should look like this:

HTML Code:
<img src="someimage.gif" alt="someimage" width="xx" height="xx">
However, TinyMCE added extra attributes there were not even valid. A simple image tag ended up looking like this:

HTML Code:
<img mce_bogus="1" title="someimage" src="http://codecall.net/someimage.gif" mce_src="http://codecall.net/someimage.gif">
Notice the new "mce_" tags? I've no idea what the purpose of these tags are but they are not valid. Similarly, br tags were changed to this:

HTML Code:
<br mce_bogus="1">
Because there seemed to be no consistent style I had to write regular expressions to handle these odd tags.

PHP Tags

Since TinyMCE/MyBlog was unable to handle backslashes and PHP Open/Close tags (<?php ?>) out of the box, we were forced to write custom code to handle this. During the migration these tags also needed to be converted back to their original form. These are the tags used:

Code:
::BACKSLASH::
::PHP_OPEN::
::PHP_CLOSE::
Other Tags

Since TinyMCE wraps a lot of questionable HTML tags around text (such as <span>) all tags not defined as BBCODE were stripped from the text and disregarded. Tags beneficial and kept were:

HTML Code:
b /b
i /i
u /u
ul /ul
li /li
img
div
br
a href
strong
The PHP Code!

It is easy to wrap a lot of words around this code (as I've done above) but the actual script is quit simple and straight forward. Using regular expressions, this is how I achieved conversion:

Code:

// $text is the text data from the database

// of the old blogging system.

// Do a simple text replace for our PHP Tags

$text = str_ireplace("::BACKSLASH::", "\\", $text);

$text = str_ireplace("::PHP_OPEN::", "<?", $text);
$text = str_ireplace("::PHP_CLOSE::", "?>", $text);

// Tags to Find

$htmltags = array(
                        '/\<b\>(.*?)\<\/b\>/is',

                        '/\<i\>(.*?)\<\/i\>/is',

                        '/\<u\>(.*?)\<\/u\>/is',

                        '/\<ul\>(.*?)\<\/ul\>/is',

                        '/\<li\>(.*?)\<\/li\>/is',

                        '/\<img(.*?) src=\"(.*?)\" (.*?)\>/is',

                        '/\<div\>(.*?)\<\/div\>/is',

                        '/\<br(.*?)\>/is',

                        '/\<strong\>(.*?)\<\/strong\>/is',

                        '/\<a href=\"(.*?)\"(.*?)\>(.*?)\<\/a\>/is',

                        );

// Replace with

$bbtags = array(

                        '[b]$1[/b]',

                        '[i]$1[/i]',

                        '[u]$1[/u]',

                        '[list]$1[/list]',

                        '[*]$1',

                        '[img]$2[/img]',

                        '$1',

                        '\n',

                        '[b]$1[/b]',

                        '[url=$1]$3[/url]',

                        );

// Replace $htmltags in $text with $bbtags

$text = preg_replace ($htmltags, $bbtags, $text);

// Strip all other HTML tags

$text = strip_tags($text);

// Other code such as DB cleansing and

// inserting below


















 .....





















 

































































Conclusion

There you go, a simple, rough HTML to BBCODE converter written entirely in vimWiki through SSH.

No comments:

Post a Comment