RoundCube Webmail
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

814 lines
27 KiB

20 years ago
10 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
3 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
3 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
3 years ago
3 years ago
20 years ago
20 years ago
20 years ago
3 years ago
3 years ago
20 years ago
20 years ago
3 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
3 years ago
20 years ago
20 years ago
20 years ago
3 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
20 years ago
  1. <?php
  2. /**
  3. +-----------------------------------------------------------------------+
  4. | This file is part of the Roundcube Webmail client |
  5. | |
  6. | Copyright (C) The Roundcube Dev Team |
  7. | Copyright (c) 2005-2007, Jon Abernathy <jon@chuggnutt.com> |
  8. | |
  9. | Licensed under the GNU General Public License version 3 or |
  10. | any later version with exceptions for skins & plugins. |
  11. | See the README file for a full license statement. |
  12. | |
  13. | PURPOSE: |
  14. | Converts HTML to formatted plain text (based on html2text class) |
  15. +-----------------------------------------------------------------------+
  16. | Author: Thomas Bruederli <roundcube@gmail.com> |
  17. | Author: Aleksander Machniak <alec@alec.pl> |
  18. | Author: Jon Abernathy <jon@chuggnutt.com> |
  19. +-----------------------------------------------------------------------+
  20. */
  21. /**
  22. * Takes HTML and converts it to formatted, plain text.
  23. *
  24. * Thanks to Alexander Krug (http://www.krugar.de/) to pointing out and
  25. * correcting an error in the regexp search array. Fixed 7/30/03.
  26. *
  27. * Updated set_html() function's file reading mechanism, 9/25/03.
  28. *
  29. * Thanks to Joss Sanglier (http://www.dancingbear.co.uk/) for adding
  30. * several more HTML entity codes to the $search and $replace arrays.
  31. * Updated 11/7/03.
  32. *
  33. * Thanks to Darius Kasperavicius (http://www.dar.dar.lt/) for
  34. * suggesting the addition of $allowed_tags and its supporting function
  35. * (which I slightly modified). Updated 3/12/04.
  36. *
  37. * Thanks to Justin Dearing for pointing out that a replacement for the
  38. * <TH> tag was missing, and suggesting an appropriate fix.
  39. * Updated 8/25/04.
  40. *
  41. * Thanks to Mathieu Collas (http://www.myefarm.com/) for finding a
  42. * display/formatting bug in the _build_link_list() function: email
  43. * readers would show the left bracket and number ("[1") as part of the
  44. * rendered email address.
  45. * Updated 12/16/04.
  46. *
  47. * Thanks to Wojciech Bajon (http://histeria.pl/) for submitting code
  48. * to handle relative links, which I hadn't considered. I modified his
  49. * code a bit to handle normal HTTP links and MAILTO links. Also for
  50. * suggesting three additional HTML entity codes to search for.
  51. * Updated 03/02/05.
  52. *
  53. * Thanks to Jacob Chandler for pointing out another link condition
  54. * for the _build_link_list() function: "https".
  55. * Updated 04/06/05.
  56. *
  57. * Thanks to Marc Bertrand (http://www.dresdensky.com/) for
  58. * suggesting a revision to the word wrapping functionality; if you
  59. * specify a $width of 0 or less, word wrapping will be ignored.
  60. * Updated 11/02/06.
  61. *
  62. * *** Big housecleaning updates below:
  63. *
  64. * Thanks to Colin Brown (http://www.sparkdriver.co.uk/) for
  65. * suggesting the fix to handle </li> and blank lines (whitespace).
  66. * Christian Basedau (http://www.movetheweb.de/) also suggested the
  67. * blank lines fix.
  68. *
  69. * Special thanks to Marcus Bointon (http://www.synchromedia.co.uk/),
  70. * Christian Basedau, Norbert Laposa (http://ln5.co.uk/),
  71. * Bas van de Weijer, and Marijn van Butselaar
  72. * for pointing out my glaring error in the <th> handling. Marcus also
  73. * supplied a host of fixes.
  74. *
  75. * Thanks to Jeffrey Silverman (http://www.newtnotes.com/) for pointing
  76. * out that extra spaces should be compressed--a problem addressed with
  77. * Marcus Bointon's fixes but that I had not yet incorporated.
  78. *
  79. * Thanks to Daniel Schledermann (http://www.typoconsult.dk/) for
  80. * suggesting a valuable fix with <a> tag handling.
  81. *
  82. * Thanks to Wojciech Bajon (again!) for suggesting fixes and additions,
  83. * including the <a> tag handling that Daniel Schledermann pointed
  84. * out but that I had not yet incorporated. I haven't (yet)
  85. * incorporated all of Wojciech's changes, though I may at some
  86. * future time.
  87. *
  88. * *** End of the housecleaning updates. Updated 08/08/07.
  89. */
  90. /**
  91. * Converts HTML to formatted plain text
  92. *
  93. * @package Framework
  94. * @subpackage Utils
  95. */
  96. class rcube_html2text
  97. {
  98. const LINKS_NONE = 0;
  99. const LINKS_END = 1;
  100. const LINKS_INLINE = 2;
  101. const LINKS_DEFAULT = self::LINKS_END;
  102. /**
  103. * Contains the HTML content to convert.
  104. *
  105. * @var string $html
  106. */
  107. protected $html;
  108. /**
  109. * Contains the converted, formatted text.
  110. *
  111. * @var string $text
  112. */
  113. protected $text;
  114. /**
  115. * Maximum width of the formatted text, in columns.
  116. *
  117. * Set this value to 0 (or less) to ignore word wrapping
  118. * and not constrain text to a fixed-width column.
  119. *
  120. * @var int $width
  121. */
  122. protected $width = 70;
  123. /**
  124. * Target character encoding for output text
  125. *
  126. * @var string $charset
  127. */
  128. protected $charset = 'UTF-8';
  129. /**
  130. * List of preg* regular expression patterns to search for,
  131. * used in conjunction with $replace.
  132. *
  133. * @var array $search
  134. * @see self::$replace
  135. */
  136. protected $search = [
  137. '/\r/', // Non-legal carriage return
  138. '/\n*<\/?html>\n*/is', // <html>
  139. '/\n*<head[^>]*>.*?<\/head>\n*/is', // <head>
  140. '/\n*<script[^>]*>.*?<\/script>\n*/is', // <script>
  141. '/\n*<style[^>]*>.*?<\/style>\n*/is', // <style>
  142. '/[\n\t]+/', // Newlines and tabs
  143. '/<p[^>]*>/i', // <p>
  144. '/<\/p>[\s\n\t]*<div[^>]*>/i', // </p> before <div>
  145. '/<br[^>]*>[\s\n\t]*<div[^>]*>/i', // <br> before <div>
  146. '/<br[^>]*>\s*/i', // <br>
  147. '/<i[^>]*>(.*?)<\/i>/i', // <i>
  148. '/<em[^>]*>(.*?)<\/em>/i', // <em>
  149. '/(<ul[^>]*>|<\/ul>)/i', // <ul> and </ul>
  150. '/(<ol[^>]*>|<\/ol>)/i', // <ol> and </ol>
  151. '/<li[^>]*>(.*?)<\/li>/i', // <li> and </li>
  152. '/<li[^>]*>/i', // <li>
  153. '/<hr[^>]*>/i', // <hr>
  154. '/<div[^>]*>/i', // <div>
  155. '/(<table[^>]*>|<\/table>)/i', // <table> and </table>
  156. '/(<tr[^>]*>|<\/tr>)/i', // <tr> and </tr>
  157. '/<td[^>]*>(.*?)<\/td>/i', // <td> and </td>
  158. ];
  159. /**
  160. * List of pattern replacements corresponding to patterns searched.
  161. *
  162. * @var array $replace
  163. * @see self::$search
  164. */
  165. protected $replace = [
  166. '', // Non-legal carriage return
  167. '', // <html>|</html>
  168. '', // <head>
  169. '', // <script>
  170. '', // <style>
  171. ' ', // Newlines and tabs
  172. "\n\n", // <p>
  173. "\n<div>", // </p> before <div>
  174. '<div>', // <br> before <div>
  175. "\n", // <br>
  176. '_\\1_', // <i>
  177. '_\\1_', // <em>
  178. "\n\n", // <ul> and </ul>
  179. "\n\n", // <ol> and </ol>
  180. "\t* \\1\n", // <li> and </li>
  181. "\n\t* ", // <li>
  182. "\n-------------------------\n", // <hr>
  183. "<div>\n", // <div>
  184. "\n\n", // <table> and </table>
  185. "\n", // <tr> and </tr>
  186. "\t\t\\1\n", // <td> and </td>
  187. ];
  188. /**
  189. * List of preg* regular expression patterns to search for,
  190. * used in conjunction with $ent_replace.
  191. *
  192. * @var array $ent_search
  193. * @see self::$ent_replace
  194. */
  195. protected $ent_search = [
  196. '/&(nbsp|#160);/i', // Non-breaking space
  197. '/&(quot|rdquo|ldquo|#8220|#8221|#147|#148);/i', // Double quotes
  198. '/&(apos|rsquo|lsquo|#8216|#8217);/i', // Single quotes
  199. '/&gt;/i', // Greater-than
  200. '/&lt;/i', // Less-than
  201. '/&(copy|#169);/i', // Copyright
  202. '/&(trade|#8482|#153);/i', // Trademark
  203. '/&(reg|#174);/i', // Registered
  204. '/&(mdash|#151|#8212);/i', // mdash
  205. '/&(ndash|minus|#8211|#8722);/i', // ndash
  206. '/&(bull|#149|#8226);/i', // Bullet
  207. '/&(pound|#163);/i', // Pound sign
  208. '/&(euro|#8364);/i', // Euro sign
  209. '/&(amp|#38);/i', // Ampersand: see _converter()
  210. '/[ ]{2,}/', // Runs of spaces, post-handling
  211. ];
  212. /**
  213. * List of pattern replacements corresponding to patterns searched.
  214. *
  215. * @var array $ent_replace
  216. * @see self::$ent_search
  217. */
  218. protected $ent_replace = [
  219. "\xC2\xA0", // Non-breaking space
  220. '"', // Double quotes
  221. "'", // Single quotes
  222. '>',
  223. '<',
  224. '(c)',
  225. '(tm)',
  226. '(R)',
  227. '--',
  228. '-',
  229. '*',
  230. '£',
  231. 'EUR', // Euro sign. €
  232. '|+|amp|+|', // Ampersand: see _converter()
  233. ' ', // Runs of spaces, post-handling
  234. ];
  235. /**
  236. * List of preg* regular expression patterns to search for
  237. * and replace using callback function.
  238. *
  239. * @var array $callback_search
  240. */
  241. protected $callback_search = [
  242. '/<(a) [^>]*href=("|\')([^"\']+)\2[^>]*>(.*?)<\/a>/i', // <a href="">
  243. '/<(h)[123456]( [^>]*)?>(.*?)<\/h[123456]>/i', // h1 - h6
  244. '/<(th)( [^>]*)?>(.*?)<\/th>/i', // <th> and </th>
  245. ];
  246. /**
  247. * List of preg* regular expression patterns to search for in PRE body,
  248. * used in conjunction with $pre_replace.
  249. *
  250. * @var array $pre_search
  251. * @see self::$pre_replace
  252. */
  253. protected $pre_search = [
  254. "/\n/",
  255. "/\t/",
  256. '/ /',
  257. '/<pre[^>]*>/',
  258. '/<\/pre>/'
  259. ];
  260. /**
  261. * List of pattern replacements corresponding to patterns searched for PRE body.
  262. *
  263. * @var array $pre_replace
  264. * @see self::$pre_search
  265. */
  266. protected $pre_replace = [
  267. '<br>',
  268. '&nbsp;&nbsp;&nbsp;&nbsp;',
  269. '&nbsp;',
  270. '',
  271. ''
  272. ];
  273. /**
  274. * Temp. PRE content
  275. *
  276. * @var string $pre_content
  277. */
  278. protected $pre_content = '';
  279. /**
  280. * Contains a list of HTML tags to allow in the resulting text.
  281. *
  282. * @var string $allowed_tags
  283. * @see self::set_allowed_tags()
  284. */
  285. protected $allowed_tags = '';
  286. /**
  287. * Contains the base URL that relative links should resolve to.
  288. *
  289. * @var string $url
  290. */
  291. protected $url;
  292. /**
  293. * Indicates whether content in the $html variable has been converted yet.
  294. *
  295. * @var bool $_converted
  296. * @see self::$html
  297. * @see self::$text
  298. */
  299. protected $_converted = false;
  300. /**
  301. * Contains URL addresses from links to be rendered in plain text.
  302. *
  303. * @var array $_link_list
  304. * @see self::_build_link_list()
  305. */
  306. protected $_link_list = [];
  307. /**
  308. * Links handling.
  309. * - 0 if links should be removed
  310. * - 1 if a table of link URLs should be listed after the text
  311. * - 2 if the link should be displayed to the original point in the text they appeared
  312. *
  313. * @var int $_links_mode
  314. */
  315. protected $_links_mode = 1;
  316. /**
  317. * Constructor.
  318. *
  319. * If the HTML source string (or file) is supplied, the class
  320. * will instantiate with that source propagated, all that has
  321. * to be done it to call get_text().
  322. *
  323. * @param string $source HTML content
  324. * @param bool $from_file Indicates $source is a file to pull content from
  325. * @param bool|int $links_mode Links handling mode
  326. * @param int $width Maximum width of the formatted text, 0 for no limit
  327. */
  328. function __construct($source = '', $from_file = false, $links_mode = self::LINKS_DEFAULT, $width = 75, $charset = 'UTF-8')
  329. {
  330. if (!empty($source)) {
  331. $this->set_html($source, $from_file);
  332. }
  333. $this->set_base_url();
  334. $this->set_links_mode($links_mode);
  335. $this->width = $width;
  336. $this->charset = $charset;
  337. }
  338. /**
  339. * Sets the links behavior mode
  340. *
  341. * @param bool|int $mode
  342. */
  343. private function set_links_mode($mode)
  344. {
  345. $allowed = [
  346. self::LINKS_NONE,
  347. self::LINKS_END,
  348. self::LINKS_INLINE
  349. ];
  350. if (!in_array((int) $mode, $allowed)) {
  351. $this->_links_mode = self::LINKS_DEFAULT;
  352. return;
  353. }
  354. $this->_links_mode = (int) $mode;
  355. }
  356. /**
  357. * Loads source HTML into memory, either from $source string or a file.
  358. *
  359. * @param string $source HTML content
  360. * @param bool $from_file Indicates $source is a file to pull content from
  361. */
  362. function set_html($source, $from_file = false)
  363. {
  364. if ($from_file && file_exists($source)) {
  365. $this->html = file_get_contents($source);
  366. }
  367. else {
  368. $this->html = $source;
  369. }
  370. $this->_converted = false;
  371. }
  372. /**
  373. * Returns the text, converted from HTML.
  374. *
  375. * @return string Plain text
  376. */
  377. function get_text()
  378. {
  379. if (!$this->_converted) {
  380. $this->_convert();
  381. }
  382. return $this->text;
  383. }
  384. /**
  385. * Prints the text, converted from HTML.
  386. */
  387. function print_text()
  388. {
  389. print $this->get_text();
  390. }
  391. /**
  392. * Sets the allowed HTML tags to pass through to the resulting text.
  393. *
  394. * Tags should be in the form "<p>", with no corresponding closing tag.
  395. */
  396. function set_allowed_tags($allowed_tags = '')
  397. {
  398. if (!empty($allowed_tags)) {
  399. $this->allowed_tags = $allowed_tags;
  400. }
  401. }
  402. /**
  403. * Sets a base URL to handle relative links.
  404. */
  405. function set_base_url($url = '')
  406. {
  407. if (empty($url)) {
  408. if (!empty($_SERVER['HTTP_HOST'])) {
  409. $this->url = 'http://' . $_SERVER['HTTP_HOST'];
  410. }
  411. else {
  412. $this->url = '';
  413. }
  414. }
  415. else {
  416. // Strip any trailing slashes for consistency (relative
  417. // URLs may already start with a slash like "/file.html")
  418. if (substr($url, -1) == '/') {
  419. $url = substr($url, 0, -1);
  420. }
  421. $this->url = $url;
  422. }
  423. }
  424. /**
  425. * Workhorse function that does actual conversion (calls _converter() method).
  426. */
  427. protected function _convert()
  428. {
  429. // Variables used for building the link list
  430. $this->_link_list = [];
  431. $text = $this->html;
  432. // Convert HTML to TXT
  433. $this->_converter($text);
  434. // Add link list
  435. if (!empty($this->_link_list)) {
  436. $text .= "\n\nLinks:\n------\n";
  437. foreach ($this->_link_list as $idx => $url) {
  438. $text .= '[' . ($idx+1) . '] ' . $url . "\n";
  439. }
  440. }
  441. $this->text = $text;
  442. $this->_converted = true;
  443. }
  444. /**
  445. * Workhorse function that does actual conversion.
  446. *
  447. * First performs custom tag replacement specified by $search and
  448. * $replace arrays. Then strips any remaining HTML tags, reduces whitespace
  449. * and newlines to a readable format, and word wraps the text to
  450. * $width characters.
  451. *
  452. * @param string &$text Reference to HTML content string
  453. */
  454. protected function _converter(&$text)
  455. {
  456. // Convert <BLOCKQUOTE> (before PRE!)
  457. $this->_convert_blockquotes($text);
  458. // Convert <PRE>
  459. $this->_convert_pre($text);
  460. // Remove body tag and anything before
  461. // We used to have '/^.*<body[^>]*>\n*/is' in $this->search, but this requires
  462. // high pcre.backtrack_limit setting when converting long HTML strings (#8137)
  463. if (($pos = stripos($text, '<body')) !== false) {
  464. $pos = strpos($text, '>', $pos);
  465. $text = substr($text, $pos + 1);
  466. $text = ltrim($text);
  467. }
  468. // Run our defined tags search-and-replace
  469. $text = preg_replace($this->search, $this->replace, $text);
  470. // Run our defined tags search-and-replace with callback
  471. $text = preg_replace_callback($this->callback_search, [$this, 'tags_preg_callback'], $text);
  472. // Strip any other HTML tags
  473. $text = strip_tags($text, $this->allowed_tags);
  474. // Run our defined entities/characters search-and-replace
  475. $text = preg_replace($this->ent_search, $this->ent_replace, $text);
  476. // Replace known html entities
  477. $text = html_entity_decode($text, ENT_QUOTES, $this->charset);
  478. // Replace unicode nbsp to regular spaces
  479. $text = preg_replace('/\xC2\xA0/', ' ', $text);
  480. // Remove unknown/unhandled entities (this cannot be done in search-and-replace block)
  481. $text = preg_replace('/&([a-zA-Z0-9]{2,6}|#[0-9]{2,4});/', '', $text);
  482. // Convert "|+|amp|+|" into "&", need to be done after handling of unknown entities
  483. // This properly handles situation of "&amp;quot;" in input string
  484. $text = str_replace('|+|amp|+|', '&', $text);
  485. // Bring down number of empty lines to 2 max
  486. $text = preg_replace("/\n\s+\n/", "\n\n", $text);
  487. $text = preg_replace("/[\n]{3,}/", "\n\n", $text);
  488. // remove leading empty lines (can be produced by e.g. P tag on the beginning)
  489. $text = ltrim($text, "\n");
  490. // Wrap the text to a readable format
  491. // for PHP versions >= 4.0.2. Default width is 75
  492. // If width is 0 or less, don't wrap the text.
  493. if ($this->width > 0) {
  494. $text = wordwrap($text, $this->width);
  495. }
  496. }
  497. /**
  498. * Helper function called by preg_replace() on link replacement.
  499. *
  500. * Maintains an internal list of links to be displayed at the end of the
  501. * text, with numeric indices or simply the link to the original point in the text they
  502. * appeared. Also makes an effort at identifying and handling absolute
  503. * and relative links.
  504. *
  505. * @param string $link URL of the link
  506. * @param string $display Part of the text to associate number with
  507. */
  508. protected function _handle_link($link, $display)
  509. {
  510. if (empty($link)) {
  511. return $display;
  512. }
  513. // Ignored link types
  514. if (preg_match('!^(javascript:|mailto:|#)!i', $link)) {
  515. return $display;
  516. }
  517. // skip links with href == content (#1490434)
  518. if ($link === $display) {
  519. return $display;
  520. }
  521. if (preg_match('!^([a-z][a-z0-9.+-]+:)!i', $link)) {
  522. $url = $link;
  523. }
  524. else {
  525. $url = $this->url;
  526. if (substr($link, 0, 1) != '/') {
  527. $url .= '/';
  528. }
  529. $url .= "$link";
  530. }
  531. if (self::LINKS_NONE === $this->_links_mode) {
  532. // When not using link list use URL if there's no content (#5795)
  533. // The content here is HTML, convert it to text first
  534. $h2t = new rcube_html2text($display, false, false, 1024, $this->charset);
  535. $display = $h2t->get_text();
  536. if (empty($display) && preg_match('!^([a-z][a-z0-9.+-]+://)!i', $link)) {
  537. return $link;
  538. }
  539. return $display;
  540. }
  541. if (self::LINKS_INLINE === $this->_links_mode) {
  542. return $this->_build_link_inline($url, $display);
  543. }
  544. return $this->_build_link_list($url, $display);
  545. }
  546. /**
  547. * Helper function called by _handle_link() on link replacement.
  548. *
  549. * Displays the link next to the original point in the text they
  550. * appeared.
  551. *
  552. * @param string $url URL of the link
  553. * @param string $display linktext
  554. */
  555. protected function _build_link_inline($url, $display)
  556. {
  557. return $display . ' &lt;' . $url . '&gt;';
  558. }
  559. /**
  560. * Helper function called by _handle_link() on link replacement.
  561. *
  562. * Maintains an internal list of links to be displayed at the end of the
  563. * text, with numeric indices to the original point in the text they
  564. * appeared.
  565. *
  566. * @param string $url URL of the link
  567. * @param string $display Part of the text to associate number with
  568. */
  569. protected function _build_link_list($url, $display)
  570. {
  571. if (($index = array_search($url, $this->_link_list)) === false) {
  572. $index = count($this->_link_list);
  573. $this->_link_list[] = $url;
  574. }
  575. return $display . ' [' . ($index+1) . ']';
  576. }
  577. /**
  578. * Helper function for PRE body conversion.
  579. *
  580. * @param string &$text HTML content
  581. */
  582. protected function _convert_pre(&$text)
  583. {
  584. // get the content of PRE element
  585. while (preg_match('/<pre[^>]*>(.*)<\/pre>/ismU', $text, $matches)) {
  586. $this->pre_content = $matches[1];
  587. // Run our defined tags search-and-replace with callback
  588. $this->pre_content = preg_replace_callback($this->callback_search,
  589. [$this, 'tags_preg_callback'], $this->pre_content);
  590. // convert the content
  591. $this->pre_content = sprintf('<div><br>%s<br></div>',
  592. preg_replace($this->pre_search, $this->pre_replace, $this->pre_content));
  593. // replace the content (use callback because content can contain $0 variable)
  594. $text = preg_replace_callback('/<pre[^>]*>.*<\/pre>/ismU',
  595. [$this, 'pre_preg_callback'], $text, 1);
  596. // free memory
  597. $this->pre_content = '';
  598. }
  599. }
  600. /**
  601. * Helper function for BLOCKQUOTE body conversion.
  602. *
  603. * @param string &$text HTML content
  604. */
  605. protected function _convert_blockquotes(&$text)
  606. {
  607. $level = 0;
  608. $offset = 0;
  609. while (($start = stripos($text, '<blockquote', $offset)) !== false) {
  610. $offset = $start + 12;
  611. do {
  612. $end = stripos($text, '</blockquote>', $offset);
  613. $next = stripos($text, '<blockquote', $offset);
  614. // nested <blockquote>, skip
  615. if ($next !== false && $next < $end) {
  616. $offset = $next + 12;
  617. $level++;
  618. }
  619. // nested </blockquote> tag
  620. if ($end !== false && $level > 0) {
  621. $offset = $end + 12;
  622. $level--;
  623. }
  624. // found matching end tag
  625. else if ($end !== false && $level == 0) {
  626. $taglen = strpos($text, '>', $start) - $start;
  627. $startpos = $start + $taglen + 1;
  628. // get blockquote content
  629. $body = trim(substr($text, $startpos, $end - $startpos));
  630. // adjust text wrapping width
  631. $p_width = $this->width;
  632. if ($this->width > 0) $this->width -= 2;
  633. // replace content with inner blockquotes
  634. $this->_converter($body);
  635. // restore text width
  636. $this->width = $p_width;
  637. // Add citation markers and create <pre> block
  638. $body = preg_replace_callback('/((?:^|\n)>*)([^\n]*)/', [$this, 'blockquote_citation_callback'], trim($body));
  639. $body = '<pre>' . htmlspecialchars($body, ENT_COMPAT | ENT_HTML401 | ENT_SUBSTITUTE, $this->charset) . '</pre>';
  640. $text = substr_replace($text, $body . "\n", $start, $end + 13 - $start);
  641. $offset = 0;
  642. break;
  643. }
  644. // abort on invalid tag structure (e.g. no closing tag found)
  645. else {
  646. break;
  647. }
  648. }
  649. while ($end || $next);
  650. }
  651. }
  652. /**
  653. * Callback function to correctly add citation markers for blockquote contents
  654. */
  655. public function blockquote_citation_callback($m)
  656. {
  657. $line = ltrim($m[2]);
  658. $space = isset($line[0]) && $line[0] == '>' ? '' : ' ';
  659. return $m[1] . '>' . $space . $line;
  660. }
  661. /**
  662. * Callback function for preg_replace_callback use.
  663. *
  664. * @param array $matches PREG matches
  665. *
  666. * @return string Element content
  667. */
  668. public function tags_preg_callback($matches)
  669. {
  670. switch (strtolower($matches[1])) {
  671. case 'th':
  672. return $this->_toupper("\t\t". $matches[3] ."\n");
  673. case 'h':
  674. return $this->_toupper("\n\n". $matches[3] ."\n\n");
  675. case 'a':
  676. // Remove spaces in URL (#1487805)
  677. $url = str_replace(' ', '', $matches[3]);
  678. return $this->_handle_link($url, $matches[4]);
  679. }
  680. }
  681. /**
  682. * Callback function for preg_replace_callback use in PRE content handler.
  683. *
  684. * @param array $matches PREG matches
  685. *
  686. * @return string PRE content
  687. */
  688. public function pre_preg_callback($matches)
  689. {
  690. return $this->pre_content;
  691. }
  692. /**
  693. * Strtoupper function with HTML tags and entities handling.
  694. *
  695. * @param string $str Text to convert
  696. *
  697. * @return string Converted text
  698. */
  699. private function _toupper($str)
  700. {
  701. // string can containing HTML tags
  702. $chunks = preg_split('/(<[^>]*>)/', $str, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
  703. // convert toupper only the text between HTML tags
  704. foreach ($chunks as $idx => $chunk) {
  705. if ($chunk[0] != '<') {
  706. $chunks[$idx] = $this->_strtoupper($chunk);
  707. }
  708. }
  709. return implode($chunks);
  710. }
  711. /**
  712. * Strtoupper multibyte wrapper function with HTML entities handling.
  713. *
  714. * @param string $str Text to convert
  715. *
  716. * @return string Converted text
  717. */
  718. private function _strtoupper($str)
  719. {
  720. $str = html_entity_decode($str, ENT_COMPAT, $this->charset);
  721. $str = mb_strtoupper($str);
  722. $str = htmlspecialchars($str, ENT_COMPAT, $this->charset);
  723. return $str;
  724. }
  725. }