Codulle - L'explorateur de code sources

Accueil>> AlBulle >> Albulle1.0rc2 >> core >> includes >> classes >> OMzip >> ConvertTables

Informations fichier

Nom du fichier : ConvertCharset.class.php
Taille du fichier : 21 Ko (604 lignes)
Language : PHP

  1. span style="color: #808080; font-style: italic;">/**
  2. * @author Mikolaj Jedrzejak <mikolajj@op.pl>
  3. * @copyright Copyright Mikolaj Jedrzejak (c) 2003-2004
  4. * @version 1.0 2004-07-27 00:37
  5. * @link http://www.unicode.org Unicode Homepage
  6. * @link http://www.mikkom.pl My Homepage
  7. *
  8. **/"\\\\","/""/" . "ConvertTables" . "/""CONVERT_TABLES_DIR""DEBUG_MODE", 1);
  9.  
  10. /**
  11. * -- 1.0 2004-07-28 --
  12. *
  13. * -- The most important thing --
  14. * I want to thank all people who helped me fix all bugs, small and big once.
  15. * I hope that you don't mind that your names are in this file.
  16. *
  17. * -- Some Apache issues --
  18. * I get info from Lukas Lisa, that in some cases with special apache configuration
  19. * you have to put header() function with proper encoding to get your result
  20. * displayed correctly.
  21. * If you want to see what I mean, go to demo.php and demo1.php
  22. *
  23. * -- BETA 1.0 2003-10-21 --
  24. *
  25. * -- You should know about... --
  26. * For good understanding this class you shouls read all this stuff first :) but if you are
  27. * in a hurry just start the demo.php and see what's inside.
  28. * 1. That I'm not good in english at 03:45 :) - so forgive me all mistakes
  29. * 2. This class is a BETA version because I haven't tested it enough
  30. * 3. Feel free to contact me with questions, bug reports and mistakes in PHP and this documentation (email below)
  31. *
  32. * -- In a few words... --
  33. * Why ConvertCharset class?
  34. *
  35. * I have made this class because I had a lot of problems with diferent charsets. First because people
  36. * from Microsoft wanted to have thair own encoding, second because people from Macromedia didn't
  37. * thought about other languages, third because sometimes I need to use text written on MAC, and of course
  38. * it has its own encoding :)
  39. *
  40. * Notice & remember:
  41. * - When I'm saying 1 byte string I mean 1 byte per char.
  42. * - When I'm saying multibyte string I mean more than one byte per char.
  43. *
  44. * So, this are main FEATURES of this class:
  45. * - conversion between 1 byte charsets
  46. * - conversion from 1 byte to multi byte charset (utf-8)
  47. * - conversion from multibyte charset (utf-8) to 1 byte charset
  48. * - every conversion output can be save with numeric entities (browser charset independent - not a full truth)
  49. *
  50. * This is a list of charsets you can operate with, the basic rule is that a char have to be in both charsets,
  51. * otherwise you'll get an error.
  52. *
  53. * - WINDOWS
  54. * - windows-1250 - Central Europe
  55. * - windows-1251 - Cyrillic
  56. * - windows-1252 - Latin I
  57. * - windows-1253 - Greek
  58. * - windows-1254 - Turkish
  59. * - windows-1255 - Hebrew
  60. * - windows-1256 - Arabic
  61. * - windows-1257 - Baltic
  62. * - windows-1258 - Viet Nam
  63. * - cp874 - Thai - this file is also for DOS
  64. *
  65. * - DOS
  66. * - cp437 - Latin US
  67. * - cp737 - Greek
  68. * - cp775 - BaltRim
  69. * - cp850 - Latin1
  70. * - cp852 - Latin2
  71. * - cp855 - Cyrylic
  72. * - cp857 - Turkish
  73. * - cp860 - Portuguese
  74. * - cp861 - Iceland
  75. * - cp862 - Hebrew
  76. * - cp863 - Canada
  77. * - cp864 - Arabic
  78. * - cp865 - Nordic
  79. * - cp866 - Cyrylic Russian (this is the one, used in IE "Cyrillic (DOS)" )
  80. * - cp869 - Greek2
  81. *
  82. * - MAC (Apple)
  83. * - x-mac-cyrillic
  84. * - x-mac-greek
  85. * - x-mac-icelandic
  86. * - x-mac-ce
  87. * - x-mac-roman
  88. *
  89. * - ISO (Unix/Linux)
  90. * - iso-8859-1
  91. * - iso-8859-2
  92. * - iso-8859-3
  93. * - iso-8859-4
  94. * - iso-8859-5
  95. * - iso-8859-6
  96. * - iso-8859-7
  97. * - iso-8859-8
  98. * - iso-8859-9
  99. * - iso-8859-10
  100. * - iso-8859-11
  101. * - iso-8859-12
  102. * - iso-8859-13
  103. * - iso-8859-14
  104. * - iso-8859-15
  105. * - iso-8859-16
  106. *
  107. * - MISCELLANEOUS
  108. * - gsm0338 (ETSI GSM 03.38)
  109. * - cp037
  110. * - cp424
  111. * - cp500
  112. * - cp856
  113. * - cp875
  114. * - cp1006
  115. * - cp1026
  116. * - koi8-r (Cyrillic)
  117. * - koi8-u (Cyrillic Ukrainian)
  118. * - nextstep
  119. * - us-ascii
  120. * - us-ascii-quotes
  121. *
  122. * - DSP implementation for NeXT
  123. * - stdenc
  124. * - symbol
  125. * - zdingbat
  126. *
  127. * - And specially for old Polish programs
  128. * - mazovia
  129. *
  130. * -- Now, to the point... --
  131. * Here are main variables.
  132. *
  133. * DEBUG_MODE
  134. *
  135. * You can set this value to:
  136. * - -1 - No errors or comments
  137. * - 0 - Only error messages, no comments
  138. * - 1 - Error messages and comments
  139. *
  140. * Default value is 1, and during first steps with class it should be left as is.
  141. *
  142. * CONVERT_TABLES_DIR
  143. *
  144. * This is a place where you store all files with charset encodings. Filenames should have
  145. * the same names as encodings. My advise is to keep existing names, because thay
  146. * were taken from unicode.org (www.unicode.org), and after update to unicode 3.0 or 4.0
  147. * the names of files will be the same, so if you want to save your time...uff, leave the
  148. * names as thay are for future updates.
  149. *
  150. * The directory with edings files should be in a class location directory by default,
  151. * but of course you can change it if you like.
  152. *
  153. * @package All about charset...
  154. * @author Mikolaj Jedrzejak <mikolajj@op.pl>
  155. * @copyright Copyright Mikolaj Jedrzejak (c) 2003-2004
  156. * @version 1.0 2004-07-27 23:11
  157. * @access public
  158. *
  159. * @link http://www.unicode.org Unicode Homepage
  160. **///This value keeps information if string contains multibyte chars.
  161. // This value keeps information if output should be with numeric entities.
  162.  
  163. /**
  164. * CharsetChange::NumUnicodeEntity()
  165. *
  166. * Unicode encoding bytes, bits representation.
  167. * Each b represents a bit that can be used to store character data.
  168. * - bytes, bits, binary representation
  169. * - 1, 7, 0bbbbbbb
  170. * - 2, 11, 110bbbbb 10bbbbbb
  171. * - 3, 16, 1110bbbb 10bbbbbb 10bbbbbb
  172. * - 4, 21, 11110bbb 10bbbbbb 10bbbbbb 10bbbbbb
  173. *
  174. * This function is written in a "long" way, for everyone who woluld like to analize
  175. * the process of unicode encoding and understand it. All other functions like HexToUtf
  176. * will be written in a "shortest" way I can write tham :) it does'n mean thay are short
  177. * of course. You can chech it in HexToUtf() (link below) - very similar function.
  178. *
  179. * IMPORTANT: Remember that $UnicodeString input CANNOT have single byte upper half
  180. * extended ASCII codes, why? Because there is a posibility that this function will eat
  181. * the following char thinking it's miltibyte unicode char.
  182. *
  183. * @param string $UnicodeString Input Unicode string (1 char can take more than 1 byte)
  184. * @return string This is an input string olso with unicode chars, bus saved as entities
  185. * @see HexToUtf()
  186. **/""//1 7 0bbbbbbb (127)
  187. //2 11 110bbbbb 10bbbbbb (2047)
  188. "&#%d;"//3 16 1110bbbb 10bbbbbb 10bbbbbb
  189. "&#%d;"//4 21 11110bbb 10bbbbbb 10bbbbbb 10bbbbbb
  190. "&#%d;"/**
  191. * ConvertCharset::HexToUtf()
  192. *
  193. * This simple function gets unicode char up to 4 bytes and return it as a regular char.
  194. * It is very similar to UnicodeEntity function (link below). There is one difference
  195. * in returned format. This time it's a regular char(s), in most cases it will be one or two chars.
  196. *
  197. * @param string $UtfCharInHex Hexadecimal value of a unicode char.
  198. * @return string Encoded hexadecimal value as a regular char.
  199. * @see UnicodeEntity()
  200. **/""/**
  201. * CharsetChange::MakeConvertTable()
  202. *
  203. * This function creates table with two SBCS (Single Byte Character Set). Every conversion
  204. * is through this table.
  205. *
  206. * - The file with encoding tables have to be save in "Format A" of unicode.org charset table format! This is usualy writen in a header of every charset file.
  207. * - BOTH charsets MUST be SBCS
  208. * - The files with encoding tables have to be complet (Non of chars can be missing, unles you are sure you are not going to use it)
  209. *
  210. * "Format A" encoding file, if you have to build it by yourself should aplly these rules:
  211. * - you can comment everything with #
  212. * - first column contains 1 byte chars in hex starting from 0x..
  213. * - second column contains unicode equivalent in hex starting from 0x....
  214. * - then every next column is optional, but in "Format A" it should contain unicode char name or/and your own comment
  215. * - the columns can be splited by "spaces", "tabs", "," or any combination of these
  216. * - below is an example
  217. *
  218. * <code>
  219. * #
  220. * # The entries are in ANSI X3.4 order.
  221. * #
  222. * 0x00 0x0000 # NULL end extra comment, if needed
  223. * 0x01 0x0001 # START OF HEADING
  224. * # Oh, one more thing, you can make comments inside of a rows if you like.
  225. * 0x02 0x0002 # START OF TEXT
  226. * 0x03 0x0003 # END OF TEXT
  227. * next line, and so on...
  228. * </code>
  229. *
  230. * You can get full tables with encodings from http://www.unicode.org
  231. *
  232. * @param string $FirstEncoding Name of first encoding and first encoding filename (thay have to be the same)
  233. * @param string $SecondEncoding Name of second encoding and second encoding filename (thay have to be the same). Optional for building a joined table.
  234. * @return array Table necessary to change one encoding to another.
  235. **/""/**
  236. * Because func_*** can't be used inside of another function call
  237. * we have to save it as a separate value.
  238. **///Print an error message
  239. "r"//This die(); is just to make sure...
  240. /**
  241. * We asume that line is not longer
  242. * than 1024 which is the default value for fgets function
  243. **//**
  244. * We don't need all comment lines. I check only for "#" sign, because
  245. * this is a way of making comments by unicode.org in thair encoding files
  246. * and that's where the files are from :-)
  247. **/"#")
  248. {
  249. /**
  250. * Sometimes inside the charset file the hex walues are separated by
  251. * "space" and sometimes by "tab", the below preg_split can also be used
  252. * to split files where separator is a ",", "\r", "\n" and "\f"
  253. **/"/[\s,]+/", $OneLine, 3); //We need only first 2 values
  254. /**
  255. * Sometimes char is UNDEFINED, or missing so we can't use it for convertion
  256. **/"#""0x"), """0x"), ""//if (substr($OneLine,...
  257. } //if($OneLine=trim(f...
  258. } //while(!feof($FirstFileWi...
  259. } //for($i = 0; $i < func_...
  260. /**
  261. * The last thing is to check if by any reason both encoding tables are not the same.
  262. * For example, it will happen when you save the encoding table file with a wrong name
  263. * - of another charset.
  264. **/"$FirstEncoding, $SecondEncoding"/**
  265. * ConvertCharset::Convert()
  266. *
  267. * This is a basic function you are using. I hope that you can figure out this function syntax :-)
  268. *
  269. * @param string $StringToChange The string you want to change :)
  270. * @param string $FromCharset Name of $StringToChange encoding, you have to know it.
  271. * @param string $ToCharset Name of a charset you want to get for $StringToChange.
  272. * @param boolean $TurnOnEntities Set to true or 1 if you want to use numeric entities insted of regular chars.
  273. * @return string Converted string in brand new encoding :)
  274. * @version 1.0 2004-07-27 01:09
  275. **//**
  276. * Check are there all variables
  277. **/"""\$StringToChange""""\$FromCharset""""\$ToCharset");
  278. }
  279. /**
  280. * Now a few variables need to be set.
  281. **/"";
  282. $this->Entities = $TurnOnEntities;
  283. /**
  284. * For all people who like to use uppercase for charset encoding names :)
  285. **//**
  286. * Of course you can make a conversion from one charset to the same one :)
  287. * but I feel obligate to let you know about it.
  288. **/"utf-8"/**
  289. * This divison was made to prevent errors during convertion to/from utf-8 with
  290. * "entities" enabled, because we need to use proper destination(to)/source(from)
  291. * encoding table to write proper entities.
  292. *
  293. * This is the first case. We are convertinf from 1byte chars...
  294. **/"utf-8")
  295. {
  296. /**
  297. * Now build table with both charsets for encoding change.
  298. **/"utf-8"/**
  299. * For each char in a string...
  300. **/"";
  301. $UnicodeHexChar = ""// This is fix from Mario Klingemann, it prevents
  302. // droping chars below 16 because of missing leading 0 [zeros]
  303. "0".$HexChar;
  304. //end of fix by Mario Klingemann
  305. // This is quick fix of 10 chars in gsm0338
  306. // Thanks goes to Andrea Carpani who pointed on this problem
  307. // and solve it ;)
  308. "gsm0338") && ($HexChar == '1B'// end of workarround on 10 chars from gsm0338
  309. "utf-8""+"//for($UnicodeH...
  310. "$HexChar"/**
  311. * Sometimes there are two or more utf-8 chars per one regular char.
  312. * Extream, example is polish old Mazovia encoding, where one char contains
  313. * two lettes 007a (z) and 0142 (l slash), we need to figure out how to
  314. * solve this problem.
  315. * The letters are merge with "plus" sign, there can be more than two chars.
  316. * In Mazowia we have 007A+0142, but sometimes it can look like this
  317. * 0x007A+0x0142+0x2034 (that string means nothing, it just shows the possibility...)
  318. **/"+"// for
  319. /**
  320. * This is second case. We are encoding from multibyte char string.
  321. **/"utf-8")
  322. {
  323. $HexChar = "";
  324. $UnicodeHexChar = ""/**
  325. * ConvertCharset::DebugOutput()
  326. *
  327. * This function is not really necessary, the debug output could stay inside of
  328. * source code but like this, it's easier to manage and translate.
  329. * Besides I couldn't find good coment/debug class :-) Maybe I'll write one someday...
  330. *
  331. * All messages depend on DEBUG_MODE level, as I was writing before you can set this value to:
  332. * - -1 - No errors or notces are shown
  333. * - 0 - Only error messages are shown, no notices
  334. * - 1 - Error messages and notices are shown
  335. *
  336. * @param int $Group Message groupe: error - 0, notice - 1
  337. * @param int $Number Following message number
  338. * @param mix $Value This walue is whatever you want, usualy it's some parameter value, for better message understanding.
  339. * @return string String with a proper message.
  340. **///$Debug [$Group][$Number] = "Message, can by with $Value";
  341. //$Group[0] - Errors
  342. //$Group[1] - Notice
  343. $Debug[0][0] = "Error, can NOT read file: " . $Value . "<br>";
  344. $Debug[0][1] = "Error, can't find maching char \"". $Value ."\" in destination encoding table!" . "<br>";
  345. $Debug[0][2] = "Error, can't find maching char \"". $Value ."\" in source encoding table!" . "<br>";
  346. $Debug[0][3] = "Error, you did NOT set variable " . $Value . " in Convert() function." . "<br>";
  347. $Debug[0][4] = "You can NOT convert string from " . $Value . " to " . $Value . "!" . "<BR>";
  348. $Debug[1][0] = "Notice, you are trying to convert string from ". $Value ." to ". $Value .", don't you feel it's strange? ;-)" . "<br>";
  349. $Debug[1][1] = "Notice, both charsets " . $Value . " are identical! Check encoding tables files." . "<br>";
  350. $Debug[1][2] = "Notice, there is no unicode char in the string you are trying to convert." . "<br>"// function DebugOutput
  351.  
  352. } //class ends here
  353.  

Liste des projets

Téléchargez Codulle

Publié par Codulle - v0.1.1 - © Bubulles Creations