Caritatis

Just another WordPress.com weblog

Getting rid of Word junk February 4, 2011

Filed under: ColdFusion,Drupal,FCK Editor — caritatis @ 9:01 pm

FROM:  http://tim.mackey.ie/CleanWordHTMLUsingRegularExpressions.aspx

Haven’t tried this, but looks good.

Wednesday, November 23, 2005 3:40:36 PM (GMT Standard Time, UTC+00:00) ( .Net General )

Introduction

I’ve spent a long time trying many different approaches at getting rid of MS Word HTML, when importing or pasting text into my content management system, with very mixed success.  Previous efforts involved using the MSHTML Element Dom but this was slow and difficult to implement.  i think i’ve finally found a satisfactory and fast solution using only regular expressions.  Please feel free to use it in your applications, and post any improvements you may find.

The Code

/// <summary>
/// Removes all FONT and SPAN tags, and all Class and Style attributes.
/// Designed to get rid of non-standard Microsoft Word HTML tags.
/// </summary>
private string CleanHtml(string html)
{
    // start by completely removing all unwanted tags
    html = Regex.Replace(html, @"<[/]?(font|span|xml|del|ins|[ovwxp]:\w+)[^>]*?>", "", RegexOptions.IgnoreCase);
    // then run another pass over the html (twice), removing unwanted attributes
    html = Regex.Replace(html, @"<([^>]*)(?:class|lang|style|size|face|[ovwxp]:\w+)=(?:'[^']*'|""[^""]*""|[^\s>]+)([^>]*)>","<$1$2>", RegexOptions.IgnoreCase);
    html = Regex.Replace(html, @"<([^>]*)(?:class|lang|style|size|face|[ovwxp]:\w+)=(?:'[^']*'|""[^""]*""|[^\s>]+)([^>]*)>","<$1$2>", RegexOptions.IgnoreCase);
    return html;
}

Samples of non-standard Microsoft Word HTML

<SPAN lang=EN-IE style="mso-ansi-language: EN-IE">
<p>
<UL style="MARGIN-TOP: 0cm" type=circle>
<o:p>&nbsp;</o:p>
<li style='mso-list:l3 level1 lfo3;tab-stops:list 36.0pt'>

Explanation of Regular Expressions

I’ve spent a good deal of time examining the problematic tags that MS Word inserts in its HTML, some examples are shown above.  The above code is based on a few requirements for my CMS:

  • remove all FONT and SPAN tags, because all the content in my CMS is done through style-sheets.
  • remove all CLASS and STYLE tags because they mean nothing outside of the original word document
  • remove all namespace tags and attributes like <o:p> and < … v:shape … >

The first regular expression removes unwanted tags, and is broken down as follows:

<[/]?(font|span|xml|del|ins|[ovwxp]:\w+)[^>]*?>
  • match an open tag character <
  • and optionally match a close tag sequence </  (because we also want to remove the closing tags)
  • match any of the list of unwanted tags: font,span,xml,del,ins
  • a pattern is given to match any of the namespace tags, anything beginning with o,v,w,x,p, followed by a : followed by another word
  • match any attributes as far as the closing tag character >
  • the replace string for this regex is “”, which will completely remove the instances of any matching tags.
  • note that we are not removing anything between the tags, just the tags themselves

The second regular expression removes unwanted attributes, and is broken down as follows:

<([^>]*)(?:class|lang|style|size|face|[ovwxp]:\w+)=(?:'[^']*'|""[^""]*""|[^\s>]+)([^>]*)>
  • match an open tag character <
  • capture any text before the unwanted attribute (This is $1 in the replace expression)
  • match (but don’t capture) any of the unwanted attributes: class, lang, style, size, face, o:p, v:shape etc.
  • there should always be an = character after the attribute name
  • match the value of the attribute by identifying the delimiters. these can be single quotes, or double quotes, or no quotes at all.
  • for single quotes, the pattern is: ‘ followed by anything but a ‘ followed by a ‘
  • similarly for double quotes.
  • for a non-delimited attribute value, i specify the pattern as anything except the closing tag character >
  • lastly, capture whatever comes after the unwanted attribute in ([^>]*)
  • the replacement string <$1$2> reconstructs the tag without the unwanted attribute found in the middle.
  • note: this only removes one occurence of an unwanted attribute, this is why i run the same regex twice.  For example, take the html fragment: <p style=”Margin-TOP:3em”>
    the regex will only remove one of these attributes.  Running the regex twice will remove the second one.  I can’t think of any reasonable cases where it would need to be run more than that.

Suggestions!

If you have any suggestions or improvments, please post them here as comments.
Thanks :)

Lots of great comments!  Here’s one in CF:

<cffunction name=”cleanUpWord” access=”public” output=”false” returntype=”string” returnformat=”JSON” hint=”I clean up MS Word code”>
<cfargument name=”inputString” type=”string” required=”yes”>

<cfset var local = StructNew()>

<!— The two regex expressions in this function were taken from http://tim.mackey.ie/CleanWordHTMLUsingRegularExpressions.aspx —>

<cfset local.cleanText = ReplaceNoCase(arguments.inputString,”<p “,”<p><p “,”all”)> <!— Keep our P tag when it has bullshit MS Word attributes —>
<cfset local.cleanText = ReReplaceNoCase(local.cleanText,”<[/]?(font|span|xml|del|ins|o|st1|[ovwxp]:\w+)[^>]*?>”,”",”all”)> <!— Borrowed Regex —>
<cfset local.cleanText = ReReplaceNoCase(local.cleanText,”<([^>]*)(?:class|lang|style|size|face|[ovwxp]:\w+)=(?:’[^']*’|”"[^""]*”"|[^\s>]+)([^>]*)>”,”",”all”)> <!— Borrowed Regex —>
<cfset local.cleanText = ReplaceNoCase(local.cleanText,”&ndash;”,”-”,”all”)> <!— Get rid of unnecessary escape sequences —>
<cfset local.cleanText = ReReplaceNoCase(local.cleanText,”&rsquo;|&lsquo;”,”‘”,”all”)> <!— Get rid of unnecessary escape sequences —>
<cfset local.cleanText = ReReplaceNoCase(local.cleanText,”&rdquo;|&ldquo;”,”"”",”all”)> <!— Get rid of unnecessary escape sequences —>

<cfset local.cleanText = ReReplaceNoCase(local.cleanText,”“|””,”&quot;”,”all”)> <!— Get rid of MS Word SmartQuotes —>

<cfreturn local.cleanText>
</cffunction>

 

Design To Theme Downloads January 26, 2011

Filed under: Drupal — caritatis @ 3:33 pm

Top 10 Themer Mistakes

PSD to Theme Workbook

From http://www.designtotheme.com/.  These were released under http://creativecommons.org/licenses/by-nc-nd/2.5/ca/ so I think they’re ok to post here.

 

Fixing paths in drupal database June 18, 2010

Filed under: Drupal — caritatis @ 1:05 pm

Here’s the syntax of the search/replace SQL:

UPDATE files SET filepath = REPLACE(filepath,'path/to/search','path/to/replace');
UPDATE node_revisions SET body = REPLACE(body, 'path/to/search', 'path/to/replace')
UPDATE node_revisions SET teaser = REPLACE(teaser, 'path/to/search', 'path/to/replace')

Or it can be done from within the mysql gui admin tool too, but the command line is probably faster/easier.

Also, check the mini-panels, but they can’t be updated as quickly:

 select * from panels_pane where configuration like '%path/to/search%';

The reason changing data directly in the mini-panels table (panels_pane) doesn’t work is that drupal stores serialized arrays of PHP in a single DB field. (It’s one of the not-so-great things about drupal’s underlying structure.).  E.g. the field looks like this:

a:4:{s:11:”admin_title”;s:18:”Did You Know title”;s:5:”title”;s:6:”<none>”;s:4:”body”;s:152:”<img src=”/site_name/sites/site_path/files/heading_did-you-know_orange.jpg” alt=”Did You Know” height=”35″ width=”254″>”;s:6:”format”;s:1:”2″;}

The ‘152’ indicates the numbers of characters in that array element. It’s essentially a simple checksum. If the amount of text following the colon after 152 doesn’t add up to the number, it breaks the array when it’s unserialized. So, you can change the data within the array element, but you also have to calculate the difference and change the checksum. It’s possible, but it’s a hassle and difficult to do automagically via a query.

UPDATE:

My co-worker found a script that handles the issues with serialized php arrays.  Yeah!  You can find it at this site:  http://www.davesgonemental.com/mysql-database-search-replace-with-serialized-php/.  I’m including the code here because this is the kind of thing you try to find later and oops!  the site doesn’t exist!  Too bad wordpress doesn’t keep spacing.

<?php
// Safe Search and Replace on Database with Serialized Data v1.0.1

// This script is to solve the problem of doing database search and replace
// when developers have only gone and used the non-relational concept of
// serializing PHP arrays into single database columns.  It will search for all
// matching data on the database and change it, even if it’s within a serialized
// PHP array.

// The big problem with serialised arrays is that if you do a normal DB
// style search and replace the lengths get mucked up.  This search deals with
// the problem by unserializing and reserializing the entire contents of the
// database you’re working on.  It then carries out a search and replace on the
// data it finds, and dumps it back to the database.  So far it appears to work
// very well.  It was coded for our WordPress work where we often have to move
// large databases across servers, but I designed it to work with any database.
// Biggest worry for you is that you may not want to do a search and replace on
// every damn table – well, if you want, simply add some exclusions in the table
// loop and you’ll be fine.  If you don’t know how, you possibly shouldn’t be
// using this scr

// To use, simply configure the settings below and off you go.  I wouldn’t
// expect the script to take more than a few seconds on most machines.

// BIG WARNING!  Take a backup first, and carefully test the results of this code.
// If you don’t, and you vape your data then you only have yourself to blame.
// Seriously.  And if you’re English is bad and you don’t fully understand the
// instructions then STOP.  Right there.  Yes.  Before you do any damage.

// USE OF THIS SCRIPT IS ENTIRELY AT YOUR OWN RISK.  I/We accept no liability from its use.

// Written 20090525 by David Coveney of Interconnect IT Ltd (UK)
// http://www.davesgonemental.com or http://www.interconnectit.com or
// http://spectacu.la and released under the WTFPL
// ie, do what ever you want with the code, and I take no responsibility for it OK?
// If you don’t wish to take responsibility, hire me through Interconnect IT Ltd
// on +44 (0)151 331 5140 and we will do the work for you, but at a cost, minimum 1hr
// To view the WTFPL go to http://sam.zoy.org/wtfpl/ (WARNING: it’s a little rude, if you’re sensitive)

// Version 1.0.1 – styling and form added by James R Whitehead.

// Credits:  moz667 at gmail dot com for his recursive_array_replace posted at
//           uk.php.net which saved me a little time – a perfect sample for me
//           and seems to work in all cases.

//  Start TIMER
//  ———–
$stimer = explode( ‘ ‘, microtime() );
$stimer = $stimer[1] + $stimer[0];
//  ———–

// Database Settings

$host = ‘localhost’;        // normally localhost, but not necessarily.
$usr  = ”;       // your db userid
$pwd  = ”;                 // your db password
$db   = ”;           // your database

// Replace options

$search_for   = ”;  // the value you want to search for
$replace_with = ”;  // the value to replace it with
if (isset($_POST['search']) && $search_for == ”) {
$search_for  = stripcslashes($_POST['search']);
}

if (isset($_POST['replace']) && $replace_with == ”) {
$replace_with  = stripcslashes($_POST['replace']);
}

if ($search_for == ” || $replace_with == ” || $replace_with == $search_for) {
header(‘Content-Type: text/html; charset=UTF-8′);?>
<!DOCTYPE html PUBLIC “-//W3C//DTD XHTML 1.0 Transitional//EN” “http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd”>
<html xmlns=”http://www.w3.org/1999/xhtml”>
<head>
<meta http-equiv=”Content-Type” content=”text/html; charset=utf-8″ />

<title>Search and replace DB.</title>
<style type=”text/css”>
body {
background-color: #E5E5E5;
font-size: 12px
}

form {
display:block;
width: 400px;
padding: 10px;
margin: 50px auto;
border:solid 10px #ccc;
background-color: #F5F5F5;
}

fieldset {
border: 0 none;
}

label {
font-weight: bold;
display:block;
line-height: 2em;
}

input.text {
margin-bottom: 1em;
display:block;
width: 90%;
}

input.button {
}

div.help {
border-top: 1px dashed #999999;
margin-top: 20px;
padding-top: 10px
}

</style>
</head>
<body>
<form action=”<?php echo basename(__FILE__)?>” method=”post”>
<fieldset>
<label for=”search_text”>Search for:</label>
<input id=”search_text” type=”text” name=”search” value=”<?php echo $search_for; ?>”/>
<label for=”replac_text”>Replace with:</label>
<input id=”replac_text” type=”text” name=”replace” value=”<?php echo $replace_with; ?>”/>
<input type=”submit” value=”Search and replace” onclick=”if (confirm(‘Are you really REALLY sure you want to do that?’)){return true;}return false;”/>
<div>
<h4>Spectacu.la Safe Search and Replace on Database with Serialized Data v1.0.1</h4>
<p>This script is to solve the problem of doing database search and replace
when developers have only gone and used the non-relational concept of
serializing PHP arrays into single database columns.  It will search for all
matching data on the database and change it, even if it’s within a serialized
PHP array.</p>

<p>The big problem with serialised arrays is that if you do a normal DB
style search and replace the lengths get mucked up.  This search deals with
the problem by unserializing and reserializing the entire contents of the
database you’re working on.  It then carries out a search and replace on the
data it finds, and dumps it back to the database.  So far it appears to work
very well.  It was coded for our WordPress work where we often have to move
large databases across servers, but I designed it to work with any database.
Biggest worry for you is that you may not want to do a search and replace on
every damn table – well, if you want, simply add some exclusions in the table
loop and you’ll be fine.  If you don’t know how, you possibly shouldn’t be
using this script anyway.</p>

<p>To use, simply configure the settings below and off you go.  I wouldn’t
expect the script to take more than a few seconds on most machines.</p>

<p><strong style=”color:red”>BIG WARNING!</strong> Take a backup first, and carefully test the results of this code.
If you don’t, and you vape your data then you only have yourself to blame.
Seriously.  And if you’re English is bad and you don’t fully understand the
instructions then STOP.  Right there.  Yes.  Before you do any damage.
And don’t forget – <strong style=”color:red”>delete this utility from your
server after use.  It represents a major security threat to your database if
maliciously used.</strong></p>

<p><strong>USE OF THIS SCRIPT IS ENTIRELY AT YOUR OWN RISK. <br/> We accept no liability from its use.</strong></p>
</div>
</fieldset>
</form>
</body>
</html>
<?php die;
}

$cid = mysql_connect($host,$usr,$pwd);

mysql_set_charset(‘utf8′);

if (!$cid) { echo(“Connecting to DB Error: ” . mysql_error() . “<br/>”); }

// First, get a list of tables

$SQL = “SHOW TABLES”;
$tables_list = mysql_db_query($db, $SQL, $cid);

if (!$tables_list) {
echo(“ERROR: ” . mysql_error() . “<br/>$SQL<br/>”); }

// Loop through the tables

while ($table_rows = mysql_fetch_array($tables_list)) {

$count_tables_checked++;

$table = $table_rows['Tables_in_'.$db];

echo ‘<br/>Checking table: ‘.$table.’<br/>***************<br/>’;  // we have tables!

$SQL = “DESCRIBE “.$table ;    // fetch the table description so we know what to do with it
$fields_list = mysql_db_query($db, $SQL, $cid);

// Make a simple array of field column names

$index_fields = “”;  // reset fields for each table.
$column_name = “”;
$table_index = “”;
$i = 0;

while ($field_rows = mysql_fetch_array($fields_list)) {

$column_name[$i++] = $field_rows['Field'];

if ($field_rows['Key'] == ‘PRI’) $table_index[$i] = true ;

}

//    print_r ($column_name);
//    print_r ($table_index);

// now let’s get the data and do search and replaces on it…

$SQL = “SELECT * FROM “.$table;     // fetch the table contents
$data = mysql_db_query($db, $SQL, $cid);

if (!$data) {
echo(“ERROR: ” . mysql_error() . “<br/>$SQL<br/>”); }

while ($row = mysql_fetch_array($data)) {

// Initialise the UPDATE string we’re going to build, and we don’t do an update for each damn column…

$need_to_update = false;
$UPDATE_SQL = ‘UPDATE ‘.$table. ‘ SET ‘;
$WHERE_SQL = ‘ WHERE ‘;

$j = 0;

foreach ($column_name as $current_column) {
$j++;
$count_items_checked++;

//            echo “<br/>Current Column = $current_column”;

$data_to_fix = $row[$current_column];
$edited_data = $data_to_fix;            // set the same now – if they’re different later we know we need to update

//            if ($current_column == $index_field) $index_value = $row[$current_column];    // if it’s the index column, store it for use in the update

// unserialise – if false returned we don’t try to process it as serialised

$unserialized = @unserialize( $data_to_fix ); // unserialise – if false returned we don’t try to process it as serialised

if ( ‘b:0;’ === $data_to_fix || false !== $unserialized ) {
//                echo “<br/>unserialize OK – now searching and replacing the following array:<br/>”;
//                echo “<br/>$data_to_fix”;
//
//                print_r($unserialized);

recursive_array_replace($search_for, $replace_with, $unserialized);

$edited_data = serialize($unserialized);

//                echo “**Output of search and replace: <br/>”;
//                echo “$edited_data <br/>”;
//                print_r($unserialized);
//                echo “———————————<br/>”;

}

else {

if (is_string($data_to_fix)) $edited_data = str_replace($search_for,$replace_with,$data_to_fix) ;

}

if ($data_to_fix != $edited_data) {   // If they’re not the same, we need to add them to the update string

$count_items_changed++;

if ($need_to_update != false) $UPDATE_SQL = $UPDATE_SQL.’,';  // if this isn’t our first time here, add a comma
$UPDATE_SQL = $UPDATE_SQL.’ ‘.$current_column.’ = “‘.mysql_real_escape_string($edited_data).’”‘ ;
$need_to_update = true; // only set if we need to update – avoids wasted UPDATE statements

}

if ($table_index[$j]){
$WHERE_SQL = $WHERE_SQL.$current_column.’ = “‘.$row[$current_column].’” AND ‘;
}
}

if ($need_to_update) {

$count_updates_run;

$WHERE_SQL = substr($WHERE_SQL,0,-4); // strip off the excess AND – the easiest way to code this without extra flags, etc.

$UPDATE_SQL = $UPDATE_SQL.$WHERE_SQL;
echo $UPDATE_SQL.’<br/><br/>’;

$result = mysql_db_query($db,$UPDATE_SQL,$cid);

if (!$result) {
echo(“ERROR: ” . mysql_error() . “<br/>$UPDATE_SQL<br/>”); }

}

}

}

// Report

$report = $count_tables_checked.” tables checked; “.$count_items_checked.” items checked; “.$count_items_changed.” items changed;”;
echo ‘<p style=”margin:auto; text-align:center”>’;
echo $report;

mysql_close($cid);

//  End TIMER
//  ———
$etimer = explode( ‘ ‘, microtime() );
$etimer = $etimer[1] + $etimer[0];
printf( “<br/>Script timer: <b>%f</b> seconds.”, ($etimer-$stimer) );
echo ‘</p>’;
//  ———

function recursive_array_replace($find, $replace, &$data) {

if (is_array($data)) {
foreach ($data as $key => $value) {
if (is_array($value)) {
recursive_array_replace($find, $replace, $data[$key]);
} else {
// have to check if it’s string to ensure no switching to string for booleans/numbers/nulls – don’t need any nasty conversions
if (is_string($value)) $data[$key] = str_replace($find, $replace, $value);
}
}
} else {
if (is_string($data)) $data = str_replace($find, $replace, $data);
}

}

?>

 

 
Follow

Get every new post delivered to your Inbox.