Although Microsoft Word is reasonably smart about reading HTML and other formatting codes, sometimes you end up receiving or using some text that has various codes in it that Word doesn’t understand.
In fact, writing this article may be a challenge. If it gets converted into Quark XPress format, the codes I’m talking about will disappear. If it gets turned into a Web page, the same thing will happen. Here’s why. Many formatting codes use greater than and less than signs. These symbols are what you get when you type Shift+, (comma) or Shift+.(period).
Because the codes are a problem, I’m going to have to be a little creative in explaining the nifty tip I figured out today. This morning, I ended up needing to extract some text from an RTF file that crashed and couldn’t be read as RTF anymore. Like HTML and XPress tags, RTF uses greater than and less than signs to signify formatting. If you read in an RTF file as plain text, you get lots of creepy formatting codes in addition to the text you are trying to extract. The same thing happens if you look at plain HTML (right-click and choose View|Source on any Web page and you’ll see what I mean).
So today, I wanted to extract the text out of my corrupted RTF file. I needed to remove the formatting information, so I would be left with just plain text. I didn’t want to laboriously remove the codes by hand, so I knew I needed to search and replace for text in between greater than and less than signs.
Because I didn’t know exactly what text would be in between the codes, I knew I needed to use Word XP’s "wildcard" feature. The online help explains how you use the asterisk (*) wildcard to search for a string of characters. For example, if you type in s*d, the search and replace finds both sad and started.
The bad news is that in Word, the greater than and less than signs also mean something special in the Find and Replace box (they mean the beginning or end of a word). So, after much experimentation, I finally found out that if I want the greater than and less than signs not to be read as special codes, I had to put a backslash in front of them. So to ditch the codes, what I finally ended up doing was the following:
1. Choose Edit|Replace.
2. Click the More button and check Use wildcards.
3. Put less-than-sign(*)greater-than-sign into the Find What box and leave the Replace with box blank. But use the real greater than and less than characters. Like I said, HTML e-mail programs may trash the codes, but for those using plain text e-mail, here’s what it looks like <(*)>
4. Click Find Next, Replace, or Replace All.
Word goes through and takes out all the codes. Since I’m not a programmer, figuring this out was a small technological triumph!