MySQL whitepace symbol problem

Discussion in 'Web Design and Development' started by big_malk, Oct 19, 2009.

  1. big_malk macrumors 6502a

    Aug 7, 2005
    I'm developing a website and have been given a large dataset to use.
    I've put it in a MySQL table, but when it's displayed on a webpage sometimes spaces (as in the space bar) are shown as � (you might not see the same symbol I do if it's character set problem, it's a black diamond with a white question mark in it).
    These don't show in phpMyAdmin when I look at the data, they just show as normal spaces.
    I've doing an "UPDATE table SET field = REPLACE(field, '�', ' ')" query to replace them, but it won't work.
    At first I got a character encoding miss-match error, I changed the table to UTF-8 hoping it would help, it removed the miss-match error but not the problem. Now, that query runs but doesn't affect any rows, as if the two characters are identical.
    How can I sort this out?

    The field should only currently have 0-9 and spaces in it.
    Would it be possible to replace anything that isn't one of these with a standard space?
  2. big_malk thread starter macrumors 6502a

    Aug 7, 2005
    Ok, I've done more digging and it seems this weird symbol is a non-breaking space, and I've added htmlentities() to replace it with   in the HTML.
    This doesn't seems ideal though, I've tried copying the non-breaking space and using it in the MySQL replace query, but it hasn't helped.
    Is there anyway I can replace these inconvenient spaces with regular spaces in the database?

    Any help would be greatly appreciated!
  3. Groovetube macrumors member

    Jul 19, 2008
    is the database charactarset set to iso-8859-1 (or, not UTF-8)

    if so you'll need to set it on your php doc,
    <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
  4. BertyBoy macrumors 6502

    Feb 1, 2009
    Welcome to working with databases, in particular, working with external data.

    Keep your htmlentities() in, you may also want htmlspecialchars(). External data will always be infested with unprintables. No matter what you do to tidy it up, more will come along with new streams of unprintables.

    You need to ensure that any keyed data is correctly formed - including removing embedded SQL - and remove all unprintables and re-encode any HTML special chars before displaying. It's only an issue if you use the field as a search key.

    I once had to tidy up 2 million post codes at one of the worlds largest insurers (in Oracle7). Apart from badly formed and invalid post codes, nearly all the corrupt codes had a tab at the end of the string, they'd been copied in from an Excel or other form based software. As fast as I fixed them, more were added, and post code was of course one of the primary search fields.
  5. big_malk thread starter macrumors 6502a

    Aug 7, 2005
    I changed each text/varchar row in the table to UTF-8, although this didn't help the problem. I have had the UTF-8 charset defined with the meta data you gave and in the .htaccess file.

    All of the data is coming from reliable sources as far as SQL injection is concerned (obviously still using precautions), just some of it's apparently not completely formatted correctly. This was just particularly confusing as it was 'invisible' characters, and it all looked perfect before I ran the query, and looked perfect in phpMyAdmin, it just didn't format correctly in the webpage.

    Is there no easy way to distinguish between and remove these different types of 'spaces' from the database? For the sake of tidiness, and this data will probably be being used in various other places including mobile devices soon, and we could run into further problems?

    Thanks for your advice!
  6. BertyBoy macrumors 6502

    Feb 1, 2009
    Ah, I meant that if you have web pages or like that allows users to key freetext, it needs to check for SQL injection. Data from the database used for display only should be safe.
    If you're absolutely convinced that there will be no further issues with receiving data then certainly set about cleaning it up.
    Start by identifying the suspect data. this can be as thorough as you want, from searching for known unprintables, to searching for anything other than A-Z, a-z, 0-9, space, etc. I can't help you with deciding how thorough you should be, that's up to the nature of the field containing the data.

    There's a REGEXP function in MySQL, read the MySQL documentation, it has plenty of examples (I checked). So you may, for instance, want to identify all rows that do NOT match against a valid pattern.
    ie. where name regexp '^([A-Z]|[a-z]|[0-9]| )*$' = 0;
    This would return all rows where the field 'name' was not made up entirely of A-Z or a-z or 0-9 or a regular space.
    It's then up to you to add in more valid chars, like basic punctuation, watching out for apostrophes and ampersands and slashes. Question marks, mathematical symbols, etc. as nauseum.
    You could use the equivalence classes, like alnum, digit, alpha, punct (all in the manual), but I've never used them in 24 years of Unix, I like to know the characters I'm matching against exactly.

    Fixing will depend on how many rows you finally identify as pants.

    Finding 2 million hosed post codes (it was about 1.94 million of 11 million on the system) was distressing to say the least. Hopefully you will fare better.
  7. SrWebDeveloper macrumors 68000


    Dec 7, 2007
    Alexandria, VA, USA
    I know for a fact upgrading from MySQL 4x to 5x has issues with encoding, specifically that 4x uses latin1 without specifying a character set and 5x uses UTF8 by default if no other type is set. This causes those characters to show up as you noted.

    The vendor suggests you re-export yourMySQL 4x db as latin1 and re-import into MySQL 5x using a tool like phpMyAdmin. If you've ever exported to .sql before look near the top of the file, just before the data inserts, for a line that ends with ") ENGINE=MyISAM" or whatever DB format you use. Then edit or add (whichever is necessary) on the end "DEFAULT CHARSET=latin1". For example, fully edited:

    This tells phpMyAdmin to use latin1 when importing to MySQL5x. Get the idea? This fixes the root cause of the issue, no other editing or coding is required so it's probably the best solution -- if this is indeed what happened to you. Just speculating.


Share This Page