â€å“once Again I Have Cut a Worthless Object.ã¢â‚¬â
Skillful journalists file clean copy. Merely what does that mean? This commodity gives journalists the basics they need to know to ensure that every character, word, sentence, and paragraph they intended to write gets correctly saved and reproduced on computer systems, and ultimately online and in print. You'll learn enough to avoid the dreaded borked Unicode.
Writing make clean copy
Clean re-create is an established concept in print journalism. Conservatively, the term refers to text that is spelled and punctuated correctly and makes sense. Every bit an editor who was formative in my development, the late Sid Adilman of the Toronto Star, put information technology, the goal is to write an article that "reads well." Clean copy is a prerequisite for that.
But I want to aggrandize the definition to cover character encoding. Your copy can't be considered "clean" unless and until it is stored and reproduced correctly. Getting character encoding right is an accented necessity for working print journalists, which is all well and good except for the fact that nobody has ever bothered to teach journalists what character encoding is.
What you're going to learn
By the time you're washed reading this commodity, y'all'll gain knowledge of character encoding that leads to confidence that y'all can write clean copy and muscle memory and good habits to actually exercise information technology.
Knowledge
Basic facts
The concepts involved in producing clean copy tin can extend way out to the horizon, but journalists don't need to worry nigh expert-level details. Here is the consummate list of facts y'all demand to know.
-
Computers store characters equally numbers. That includes visible characters like letters and numerals; whitespace characters like spaces and tabs; and invisible characters like optional hyphens. At root, they're all numbers. A arrangement of enumeration of characters is called character encoding.
-
In the by, at that place were dozens of conflicting sets of numbers used to enumerate characters. Those accept all been superseded by a single numbering system, Unicode.
-
Unicode does not include every conceivable character, but every grapheme yous the journalist will need is available in Unicode. Corollary: In that location is no such thing as a "special character." (Really, there isn't.)
-
To write clean copy, everything you lot write, edit, laissez passer on to someone else, receive from someone else, and publish has to employ Unicode from beginning to finish.
-
Unicode is a large specification that can be expressed in a diverseness of ways, merely in the normal class of events the only variation you lot need to know about is UTF‑8. You don't fifty-fifty have to know what that stands for.
-
In essentially every case, you lot but blazon, insert, or paste the bodily character yous want – just as with characters y'all find noncontroversial. In rare cases, you lot will refer to characters by their Unicode number. (That's called using a grapheme entity or escaping a character.)
There. Now you're up to speed. For the working hack, it actually is that unproblematic.
Just blazon the character
The virtually important advice is the easiest: Just type the character you desire. You may need to larn how to blazon information technology, merely I'm going to teach you how. You may need to copy and paste it from another source. Just the point is utilize the character you want right in your document. And practice that everywhere – hed, dek, torso re-create, in RSS, on Twitter.
There are rare exceptions to this rule. When the character your system displays tin be confused with something else or is simply invisible, every bit in the case of whitespace or non-breaking hyphen (meet below), you need to enter a character entity, which uses a sequence of other characters to escape the character you actually desire. In these cases, you're specifying the grapheme by an agreed-upon name or by its Unicode number. Yous do that by starting the name or number with an ampersand and catastrophe it with a semicolon. Some examples, purely for illustration purposes:
-
&
for ampersand & -
é
for é -
à
orà
for à
Y'all accept to know this troublesome implementation detail considering it is the only way to reliably enter and edit the few characters that need this approach. Absolutely do not apply this method to enter what you think are "special characters" in the day-to-solar day run of your piece of work as writer or editor.
What tin get wrong?
I want to make sure you know what I'one thousand talking near, then I'm going to testify y'all a few errors of character encoding. Subsequently you finish this article, you will be in a position to avoid borked Unicode similar this for the rest of your career.
-
A mismatch in server and browser settings ways the browser can't figure out the character encoding and uses the wrong characters.
-
Characters are saved as numerical references instead of the actual character, but that process goes wrong.
-
The character may be in the file, only it's and then badly encoded the browser can't fifty-fifty estimate what it is. (The question mark in a diamond is a character from the Apple Last Resort font. It'southward widely seen in this scenario but it isn't whatever kind of "standard.")
-
Accented characters merely disappear.
-
Smart quotes never live upwardly to their name and simply go crazy.
If yous demand a poster child for character encoding, this is it
'
In the English language language, the giveaway character that can conclusively prove your copy is muddy is this: ' (opening single quotation mark). Why?
-
If information technology's missing or replaced by ' (neutral apostrophe), that means someone can't type the character or thinks downstream systems tin't accept it, or those downstream systems changed it behind your back.
-
If information technology'southward replaced by apostrophe ('), it ways someone relied on so-called smart quotes and didn't double-check for errors.
-
If it's used where it shouldn't be – a mutual occurrence, equally in stone 'n' curl or '90s grunge stone – it ways someone didn't know the difference or (again) foolishly trusted smart quotes.
If you can't get opening single quotation mark right, yous probably can't go anything right that isn't a nice like shooting fish in a barrel letter or a number.
Starting time using a proficient editor
You can bang out copy in Microsoft Discussion if y'all want. With mod versions of MS Word, that re-create will ever use Unicode. Problem solved? No, considering Word is borderline useless when it comes to fixing someone else's copy.
At a minimum, yous need to be able to practice all of the following:
-
Save and reopen files of unknown or incorrect encoding. (Goal: Ever yield UTF‑viii plain text or markup like HTML. Never yield word-processing formats like
.md
and RTF.) -
Remove wagon returns from inside paragraphs and other places.
-
Reliably change neutral quotes to correct quotes (including when nested).
-
Reliably alter any of the countless incorrect analogues of en and em dashes to the real matter.
-
Escape the rare characters that need them.
You tin gin up macros to do most of these things in MS Word, simply you lot have better things to exercise.
Peradventure you acquaintance writing on a computer with "word processing," and associate discussion processing with the market leader, MS Give-and-take. But hither I am strongly recommending you update your workflow to use a existent text editor. You may not even know this category of software exists, but information technology does, and it is mature and tin do everything y'all need.
-
Mac options: BBEdit (strongly preferred); Textmate
-
Windows options: Yous shouldn't exist using Windows to edit clean re-create. If you lot accept no choice, consider Notepad++; Sublime Text; Dreamweaver. (Absolutely not the built-in Notepad utility, not even for emergency usage; information technology accepts Unicode handily but is too underpowered and inconvenient.)
Unreadable onscreen type leads to re-create errors. Always employ nice big fonts (16 pixels minimum in typical cases), and unless you know you lot demand to, don't use monospaced fonts, especially not Courier.
What characters do people get wrong?
Apostrophes and quotation marks
Neutral versions of either of those (' and ") practise not cut it anywhere, at any time, outside of programming and markup.
When consecutive quotes follow each other, or when an apostrophe sits inside a quotation marker, yous have to use at to the lowest degree a thin space to split up them, as described below.
Accents and diacritics
By consensus, nosotros don't accept to write foreign words in their own script if it isn't Latin script. Nosotros don't have to write Москва for Moscow and 日本 for Japan. Foreign proper nouns with accepted English spellings, like Cologne (for Köln, Germany), don't have to exist changed.
But ane source of unclean copy is the conventionalities that a discussion that contains accents or diacritics is weird and foreign and that the accents are optional. Incorrect. Accents or diacritics are intrinsic to correct spelling. Just equally cant and wont are words that differ from can't and won't, resume and résumé are two different words.
In case I'one thousand non making myself clear, leaving out accents means you've misspelled the discussion.
To write clean re-create, you have to know what accents are called. You lot can't use fake French names for accents; all of them already take English names. You cannot use vague, impressionistic, guaranteed-to-be-misunderstood descriptors similar "the i that goes from left to right."
Why learn the right terminology? You lot can't work in ignorance, first of all, but more importantly, you have to be able to correct someone's copy, or instruct someone on how to enter text correctly, over the phone, in person, or via chat – anywhere you can't really draw a graphic symbol on a printout. ("No, the showtime letter is cap East acute. Y'all sent me grave emphasis. Fix it.")
- Acute (just the word "acute," non simulated-French "aksã aygoo"): Áá
- Grave (simply the word "grave" that rhymes with "salve," not "aksã graav"): Àà
- Circumflex: Ââ
- Dieresis: Ää ("umlaut" is a term applied solely to High german; an umlaut is too a dieresis)
- Cedilla (non "sedee"): Çç (Șș and Țț have a comma, not a cedilla, under them)
- Tilde (information technology's 2 syllables, similar Swinton): Ãã (Ññ is but n-tilde and is non chosen "enya")
- Háček ("hatchek") or caron: Čč (but Ď/ď, Ľ/ľ )
- Macron: Āā
- Breve ("breev"): Ăă
- Double acute: Őő
- Ogonek: Ąą
Of import unmarried messages from other languages:
- Nordic languages (Swedish, Danish, Norwegian, Icelandic), when viewed all together, use these letters:
- A ring Åå
- AE digraph (not "ligature") or "ash" Ææ
- Slash O Øø
- Thorn Þþ
- Eth Ðð (that's a voiced consonant at the end of the discussion eth, equally in breathe)
(I've grouped these messages for convenience. None of these iv languages uses all of them.)
- Turkish: Dotless Iı; dotted 2; Ğğ
That isn't the full list of diacritics or letters, but the point is yous must be able to name these letters and accents on sight, cold, without a cheatsheet. Do you recollect they'll never come up up in the re-create you write? Well, perchance they won't, just they'll come up in copy y'all read and possibly in copy you edit.
-
What if the Icelandic banking system collapses? Who won the Turkish election?
-
If y'all're an arts writer and think this sort of thing is never going to come up considering it'due south "too technical," well, don't half your friends piece of work at Condé Nast? Then over again, they can't go that proper noun correct either:
And I assume you'll someday have to name-drop the family of actors named Skarsgård, or review popular mystery writers like Jo Nesbø and Arnaldur Indriðason, or write a thinkpiece about cinéma vérité or causes célèbres.
-
And tell me: What kind of eatery critic can't spell chèvre?
The idea that accents are not really an integral and mandatory component of proficient copy – it'south only so passé.
Hyphens and dashes
You need to know when to use and how to enter an en dash (between these letters: a–b) and an em nuance (between these letters: a—b).
-
Style guides may insist you utilize an em nuance not surrounded by spaces, but that usage does non work online (browsers break lines unpredictably) and barely works in print. Use infinite-endash-infinite by preference.
-
Typewriter-manner faux em dashes -- (2 hyphens maybe surrounded by spaces) have to exist converted. Don't ever utilise hyphen-hyphen as a claimed em nuance. First of all, it isn't, and second of all, some organisation will surely come forth and break a line between the dashes.
-
Space-hyphen-space is not an em dash and needs to be converted.
-
You accept to manually inspect copy for en dashes errantly replaced past hyphen. I'g talking nearly copy that might have been right in the original only was dumbed down for the Web past someone or some system who didn't empathise Unicode – due east.chiliad., converting New York–style pizza to New York-way pizza or New-York-mode pizza.
-
Non-breaking hyphen is bachelor and should be used, especially in pocket-size alphanumeric sequences (IRS Westward‑8B form) or in words that begin or terminate with a hyphen (left‑ and right-hand bulldoze; free-speech and ‑assembly rights). As with whitespace (see below), you need to escape the character – enter information technology as
‑
. -
Your organisation may mis-encode the regular visible hyphen any number of means, including equally a not-breaking hyphen that is immediately followed by a displayed optional hyphen or as two hyphens.
Whitespace
The standard wordspace (what you get when you press the spacebar on your computer) is noncontroversial in the hack context.
Among the many bachelor whitespace characters:
- Em space (between these letters: a b)
- En infinite (between these letters: a b)
- Sparse infinite (between these characters: ' ")
- Non-breaking space (in the centre of this postal lawmaking: M5W 1E6)
These space characters are tricky enough that yous should use graphic symbol entities for them. Why? And then y'all can identify them with certainty in source code. Two normal wordspaces expect a lot like an em space. Of course no i writes two consecutive spaces in normal English prose, but HTML source lawmaking and many other contexts allow two sequent wordspaces, collapsing them to a single wordspace. To tell those autonomously from an intended em or en space, apply a grapheme entity for them.
For regular Web pages, merely utilise  
and  
for em and en space. Thin space is, not surprisingly;  
. Non-breaking infinite is
.
Arrows
Dozens of arrow characters exist in Unicode and you should never utilize sequences of punctuation in their identify. To evidence four of many options:
- Leftarrow ← not <--
- Rightarrow → not -->
- Uparrow ↑ non ^ (caret)
- Downarrow ↓
Fractions
Many commonly-used fractions are available as predefined Unicode characters. At that place is no reason to write something similar 3 1/4 ever again. (Besides inadvisable because, somewhere along the way, a line will interruption between the integer and the fraction.)
- ½ · ⅓ · ¼ · ⅕ · ⅙ · ⅐ · ⅛ · ⅑ · ⅒
- ⅔ · ⅖
- ¾ · ⅗ · ⅜
- ⅘
- ⅚ · ⅝
- ⅞
Subscripts and superscripts
Do not try to false subscript or superscript numerals by using smaller font sizes. Also don't give upwards and merely write a normal number inside parentheses or brackets. Utilize the actual Unicode numerals:
-
Subscript: ₀₁₂₃₄₅₆₇₈₉
-
Superscript: ⁰¹²³⁴⁵⁶⁷⁸⁹
Superscribed and subscribed letters and other characters are barely available in Unicode. This is another style of saying you can rely but on sub/super numerals, not letters or anything else.
Bones symbols
It shouldn't surprise you that basic symbols exist in Unicode. As with arrows, yous shouldn't use other characters to imitation them.
-
Currency: $, £, and ¢ are like shooting fish in a barrel ones, only nearly xx years subsequently the introduction of the euro (not "Euro"), hacks however write currency values like "twenty.99 euros" instead of using the available € symbol (€20.99).
-
Legal: ©, ®, and ™ but not (C), (R), and (TM); servicemark ℠ and music publishing ℗ are available. ¶ and § can and should be used instead of "para." and "s" or "S.": ¶2, ¶¶7–8, §A2, §§one–iii.
-
Eye: I♥NY as much as anybody, simply I exercise non [heart] it or <iii it.
-
Math: The letter eks is not a multiplication sign; × is. (Hence no one buys a stack of "2x4s" at the lumber yard.) An opinion poll may be valid ±three points 19 times out of 20, simply information technology isn't valid "+/‑" iii points.
And at the very least use an en dash for a minus sign in reporting temperatures and other figures. (We can hash out the vagaries of Unicode in this regard another time.)
Musculus retentivity and habits
You tin can hands find guides on the Spider web most typing "special characters." These won't help you at all. True, you do need to learn the technical basics most how to enter a character that isn't printed on your keyboard. Simply what all those online guides don't tell yous is y'all need to develop muscle memory and habits.
-
If yous can touch-type, yous demand to be able to bear upon-type common characters like accents and dashes. You can't stop to think about information technology; that's the kind of barrier that makes people think "To hell with it!" and only enter an incorrect substitute grapheme.
-
When typing isn't an option, you have to develop reliable ways of solving the problem, which can be as simple as Googling the character and copying and pasting.
To form muscle retentiveness and habits, you need practice, exercise, practice. The fashion to do this is to write a lot of re-create using characters that aren't simple letters and numbers. Don't do that ofttimes? Then put some time into your professional development and carry out exercises like these.
-
Print out, or just ready up in a separate window, easy text that uses accents, similar Wikipedia's lists of foreign-language terms used in English (east.one thousand., from French, from High german). Sit downwards and retype a few dozen of the entries.
-
Wait up any common concept or noun (h2o, sky, calcium, life, baby) at Wikipedia, pick a translation in the target linguistic communication of your choice, and retype a few paragraphs.
-
Take some well-edited copy from a print publication (east.g., a popular U.S. consumer mag like Vanity Fair or GQ) and retype a few paragraphs. (For advanced learners, try duplicating the New Yorker's idiosyncratic style.)
At all times use right quotation marks and em and en dashes even if non in the original (a skill to develop in itself). These exercises volition begin the process of instilling the muscle memory of typing the right characters.
If yous aren't a ten-finger impact-typist, exercise the same exercises. Even if you hunt and peck, you lot take your own musculus memory to develop.
Three ways to enter a character
Excluding speech recognition and other edge cases, you have three options for entering any character.
- Type it on a hardware or onscreen keyboard. May include modifier keys like Shift, Choice, Command, or Ctrl. May involve dead keys that exercise aught until you type the adjacent key (e.m., Choice-due north N to produce Ñ).
-
Look it upwardly and paste it. This is past far the all-time option for the first time a document needs a truly rare graphic symbol. But fifty-fifty in an age where we Google everything, this only never occurs to working hacks. Google the character you're looking for and re-create and paste information technology into your copy. (Now you meet why you demand to know what characters are called.) Yous then have the character at the ready for further copying and pasting. (Use Paste Special to paste equally unformatted text in Microsoft Word or equivalent.)
-
Utilize a character picker. InDesign, Microsoft Word, Mac OS X, and countless other software "boasts" graphical character maps yous can use to "easily" insert "special" characters. In practice they are phenomenally difficult and cumbersome to use and require near-expert cognition just to figure out which category to look in. Employ every bit a last resort.
And here'southward how you absolutely are non going to enter a character:
-
On Windows, look up a number for a character and type it while belongings down Alt , or type that number and then press Alt-10. Online guides tell you this is the simply mode to enter "special" characters. They're incorrect and they're missing the betoken. Information technology's not that the numerical method is the but method, it'due south the worst method. (Could you enter a total sentence of French text that way?)
-
Use "smart quotes." Decades on, and so-called smart quotes simply cannot handle English usage. It is smart quotes that cause the typical opening-single-quote errors mentioned before (rock 'n' scroll, '90s). Even typographers brand this mistake:
Systems cannot handle consecutive quotation marks (as in quotes within quotes). Smart quotes cannot disentangle typical British usage, which presents many structural ambiguities (opening single quote or apostrophe? closing single quote or apostrophe?). Advanced hacks, who are rare, can overcome the failings of smart quotes. Anybody else falls prey to them, and the victims are readers.
This is not a fourth dimension for "platform equivalency"
Typically merely not universally, impress publications are all-Mac shops. Online publications, and freelancers and independent bloggers, very oft use Macs.
These people have cipher to worry about. It is expressionless simple to write, salvage, and transmit clean copy on Macs. Important keystrokes on Mac have non changed since the introduction of the Macintosh in 1984 and it is readily possible for a ten-finger typist to bear on-type them.
The trouble hither is Windows. You need to have a lot of Unicode knowledge even to debate the upshot, simply merely take my discussion for it on 2 counts here, and accept that I am non saying this simply because I am a Macintosh supremacist.
- At the organisation level, Windows has phenomenal Unicode support (noticeably ameliorate than Mac OS).
- At the user level, since the twenty-four hour period Windows was introduced it has been punishingly difficult (bordering on impossible) to type anything that isn't printed on the keycaps of a U.S. keyboard.
I explanation for why y'all run into then many incorrect characters online is elementary: The writers are on Windows, and even if they know the right character, they pretty much can't effigy out how to.
So: This is not a time for platform equivalency. Utilize of Microsoft Windows prevents people from producing clean re-create in the existent earth. Hence they over-rely on smart quotes, just that just means the organisation beats you up twice – once in your inability to blazon a grapheme, once more when the computer guesses wrong.
Solution for Windows users
At that place is a reasonable set up for professional person writers using Windows: Turn on the U.Southward. International keyboard layout. Do that and all of a sudden you have the equivalent of a Mac keyboard layout and about everything just works. (More details.)
Common errors you need to ready
When somebody hands you copy, you demand to exercise all the following to ensure it is clean:
-
Alter all tab characters to space characters. Don't just delete them, considering they could exist separating other characters or words and you don't want to destroy that separation. Replace one tab character (^t in MS Word, \t in BBEdit) with one space.
-
Remove all soft-hyphen characters, particularly word-initial soft hyphens used as a control non to hyphenate that discussion. These exported characters merely aren't handled well past browsers. Soft hyphens exported past software like InDesign or newspaper composition systems like Harris or Atex will employ the wrong character.
-
Soft hyphen displayed as ¬:
-
Soft hyphens erratically replaced by space or hyphen:
-
To make this piece of work, consign a test document known to take soft hyphens. Open that document in your text editor. Discover the soft-hyphen grapheme and save it in a document you can find later. Replace that grapheme with nothing. Do the same with all your files from now on.
Any number of false soft-hyphen characters are possible; you take to test your own system first to know which one information technology uses. Ideally your system would just finish outputting soft hyphens, but let'southward exist realistic.
If for some reason you want to add together soft hyphens later on, do so at that later stage. I strongly discourage this practice because it is the rendering engine (whatever sets the type in the final stage before a reader sees it) that should insert soft hyphens. Manually-inserted soft hyphens should exist reserved for intentional fine control, not as a task you apply to every line of text.
-
-
Bank check every non-breaking space to ensure it really has to be there. Systems include
for a variety of spurious reasons, and your copy can be littered with them without your knowing. In particular, Mac users must bank check that inadvertent typing of Option-spacebar or Command-spacebar did non introduce ane non-breaking-space character for each such keypress. -
Remove linebreaks inside paragraphs.
-
Convert quotation marks and dashes (never a strictly automated process).
Exporting from Word and PDF
You will receive countless MS Word files and PDFs. These are a leading cause of borked Unicode, peculiarly when cake quotations from these sources are plunked inside correctly-encoded text.
What goes wrong?
-
Early MS Word versions – nevertheless very much in utilize in large corporate environments – didn't apply Unicode by default and mis-encode characters. MS Give-and-take is engineered to avoid displaying the wrong character, only it remains the wrong character internally and stays wrong when you paste or drag it somewhere else.
-
PDFs can include nigh any imaginable grade of digital object, including visible characters interspersed with invisible characters and sentences with no space characters between words. (In that instance a PDF just displays a intermission between words even if information technology doesn't really exist.) What you lot run across is not necessarily what y'all get in PDF.
The following workflow avoids virtually mis-encodings from Discussion and PDF.
-
Practice non re-create and paste from either of these sources. Don't drag and driblet, either.
-
You must consign as plain text (UTF‑8) in Discussion and and so open the resulting file in your editor.
-
You also must export as Text (Plain) in Acrobat. Oddly, the Text (Accessible) setting produces worse results. Fifty-fifty this won't work all the time considering PDFs tin can include text that is difficult to export. Open the resulting file in your editor.
Whether you similar it or not, you lot have no existent pick but to use Adobe Acrobat for this office, not whatsoever other PDF viewer.
Common errors on specific platforms
-
WordPress double-prime mistake. When typing inside the textarea on a WordPress blog with smart quotes turned on, whatever sequence that ends in a numeral followed by a neutral quotation mark becomes a numeral followed by a double-prime character, ″. This may also happen even if a period or comma sits betwixt the numeral and the end quote. (I long ago reported this bug.)
Solution: Type existent quotation marks, or at least always bank check for the double-prime graphic symbol and change information technology.
-
The quote/emdash philharmonic. Some blogs (the Awl; Gawker) used a nonsensical combination of neutral quotation marks only as well em dashes. This practices makes believe that em dashes are piece of cake characters while quotation marks are difficult. Do not follow this pattern.
-
Newspapers. Online versions of print newspapers, including the New York Times, export what are surely internally right quotation marks as neutral. Practise not follow this pattern.
Policy changes
I believe every publication, including every individual blogger, should specifically invite corrections of typos and copy errors. A lot of sites do that already, just I think it should be universal. And the invitation to submit corrections should specifically mention that readers can written report characters that don't prove up properly. In other words, you lot or your publication should have a stated policy inviting people to report character-encoding errors in the aforementioned mode they'd report whatsoever other re-create error.
Useful links
-
On the Twitters: @BorkedUnicode.
-
Almost pointlessly, a Flickr group to which you lot tin upload screenshots.
-
Can't figure out what a character is? Re-create and paste it, or a string of characters, into Richard Ishida'southward String Analyzer, which will return a list of every grapheme and its Unicode name.
-
The definitive book on Unicode for informed amateurs is Jukka K. Korpela's Unicode Explained. You really can sit there and read all 600-odd pages (I did); the book explains Unicode as simply as it ever could be explained. Or you can just search the Google Books version for help with tricky cases.
-
Looking for a searchable online database of Unicode characters? In that location are several, but Fileformat.info is the one I use.
-
Joel Spolsky'due south archetype article "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know Near Unicode and Character Sets" isn't so complicated that merely hackers can understand it. Hacks volition find it informative, too.
Published 2011.11.14 ¶ Updated 2018.07.23
Source: https://joeclark.org/borked/
0 Response to "â€å“once Again I Have Cut a Worthless Object.ã¢â‚¬â"
Publicar un comentario