Letter ő doesn't shows. #7

Closed
opened 2021-12-07 23:16:46 -05:00 by p4ulcristian · 7 comments
p4ulcristian commented 2021-12-07 23:16:46 -05:00 (Migrated from github.com)

Letter ő doesn't shows in pdf. I tried setting charset="UTF-8". Here is the HTML:

<!doctype html>
<html lang="hu">
 <head>
  <meta charset="utf-8">
  <meta name="author" content="paul">
  <meta name="title" content="Árajánlat">
  <style type="text/css">@page { font-family: sans-serif; margin: 0in; size: A4 portrait; } </style>
  <style type="text/css">body { background-color: #fff; color: #000; font-family: Montserrat; font-size: 12pt; line-height: 1.3; } </style>
  <link type="text/css" rel="stylesheet" href="jar:file:/Users/paulcristian/.m2/repository/clj-htmltopdf/clj-htmltopdf/0.2/clj-htmltopdf-0.2.jar!/h
tmltopdf-base.css">
  <link type="text/css" rel="stylesheet" href="http://localhost:3000/pdf/css/pdf-fonts.css">
  <link type="text/css" rel="stylesheet" href="http://localhost:3000/pdf/css/pdf.css">
 </head>
 <body>
  ő
 </body>
</html>
Letter `ő` doesn't shows in pdf. I tried setting `charset="UTF-8"`. Here is the HTML: ``` <!doctype html> <html lang="hu"> <head> <meta charset="utf-8"> <meta name="author" content="paul"> <meta name="title" content="Árajánlat"> <style type="text/css">@page { font-family: sans-serif; margin: 0in; size: A4 portrait; } </style> <style type="text/css">body { background-color: #fff; color: #000; font-family: Montserrat; font-size: 12pt; line-height: 1.3; } </style> <link type="text/css" rel="stylesheet" href="jar:file:/Users/paulcristian/.m2/repository/clj-htmltopdf/clj-htmltopdf/0.2/clj-htmltopdf-0.2.jar!/h tmltopdf-base.css"> <link type="text/css" rel="stylesheet" href="http://localhost:3000/pdf/css/pdf-fonts.css"> <link type="text/css" rel="stylesheet" href="http://localhost:3000/pdf/css/pdf.css"> </head> <body> ő </body> </html> ```
gered commented 2021-12-08 07:48:22 -05:00 (Migrated from github.com)

Thanks for reporting this! I also see the issue myself too, I am guessing I'm have missed some encoding settings somewhere but need some more time to look into this. I will get back to you on this.

Thanks for reporting this! I also see the issue myself too, I am guessing I'm have missed some encoding settings somewhere but need some more time to look into this. I will get back to you on this.
p4ulcristian commented 2021-12-08 08:13:40 -05:00 (Migrated from github.com)

Thank you, meanwhile can you guide me to the part of the code which is responsible for this? I need it, so I want to help, maybe make my solution for this, and share it here.

Thank you, meanwhile can you guide me to the part of the code which is responsible for this? I need it, so I want to help, maybe make my solution for this, and share it here.
gered commented 2021-12-08 08:42:18 -05:00 (Migrated from github.com)

So, I'm not entirely sure where exactly the encoding issue is. Adding the :debug :display-html? option, shows us that the HTML string looks fine up until this point as the prepare-html function is what checks for that option and println's the HTML.

However, the write-pdf! function reads in the HTML string into Jsoup and turns it into a W3C object representation needed by Open HTML to PDF itself. It is possible that there is some encoding settings here that were missed (I was sure that UTF-8 encoding was the default here, but I might be wrong?). As well, it is possible that the PdfRendererBuilder instance used here in this function might need some tweaking.

Overall, several things to investigate and these are only my thoughts off the top of my head ... could be something else entirely. ;) I will have time to look at this in more detail this evening.

So, I'm not entirely sure where exactly the encoding issue is. Adding the `:debug :display-html?` option, shows us that the HTML string looks fine up until [this point](https://github.com/gered/clj-htmltopdf/blob/master/src/clj_htmltopdf/core.clj#L116) as the `prepare-html` function is what checks for that option and `println`'s the HTML. However, the [`write-pdf!`](https://github.com/gered/clj-htmltopdf/blob/master/src/clj_htmltopdf/core.clj#L84) function reads in the HTML string into Jsoup and turns it into a W3C object representation needed by Open HTML to PDF itself. It is possible that there is some encoding settings here that were missed (I was sure that UTF-8 encoding was the default here, but I might be wrong?). As well, it is possible that the `PdfRendererBuilder` instance used here in this function might need some tweaking. Overall, several things to investigate and these are only my thoughts off the top of my head ... could be something else entirely. ;) I will have time to look at this in more detail this evening.
p4ulcristian commented 2021-12-08 09:01:06 -05:00 (Migrated from github.com)

My friend tested it, it works well on Ubuntu. I'll check it later aswell. I use macOS, but my locale is on utf-8, so something else must be the problem. Thank you for taking the time :)

My friend tested it, it works well on Ubuntu. I'll check it later aswell. I use macOS, but my locale is on utf-8, so something else must be the problem. Thank you for taking the time :)
gered commented 2021-12-08 18:42:11 -05:00 (Migrated from github.com)

Alright, so this got a bit more involved than I was expecting it to at first! Turns out that it is not an encoding issue at all, but instead is a font issue.

When Open HTML To PDF is rendering text, whenever it encounters a glyph it does not have in any of the fonts it is currently using to render (e.g. based on CSS styling and the currently applicable font-family setting), it replaces it with a '#' character.

The default embedded fonts (sans-serif, serif, etc) only support a very basic Western European character set. So custom fonts will be needed in many cases to provide extended character sets / glyphs. For example, for Japanese or Chinese text, etc. Sounds obvious to me in hindsight, but I have to admit I'd never considered that before. :-)

Your example HTML had some font-family CSS styling, but seemed to be missing a @font-face section to actually load the custom font you have (Montserrat). I don't know what is in your pdf-fonts.css file, but I assume that it does not have a working @font-face section either to load this font, else I suspect this would've worked just fine for you!

Anyway, this example worked for me:

(->pdf
  [:div {:style "font-family: Montserrat"}
   [:p "ő"]]
  "test-characters.pdf"
  {:styles {:fonts [{:font-family "Montserrat"
                     :src         "/path/to/Montserrat-Regular.ttf"}]}
   :debug  {:display-html?    true
            :display-options? true}})

Give it a try and let me know how this works!

(Also thanks for reporting this issue, this exposed a whole bunch of font/character/glyph stuff that I had not previously considered at all! I will be pushing out an update to clj-htmltopdf that includes some extra things, but the core idea of needing to provide your own custom fonts for additional language/character support will always be an annoying extra requirement regardless.)

Alright, so this got a bit more involved than I was expecting it to at first! Turns out that it is not an encoding issue at all, but instead is a font issue. When Open HTML To PDF is rendering text, whenever it encounters a glyph it does not have in any of the fonts it is currently using to render (e.g. based on CSS styling and the currently applicable `font-family` setting), it replaces it with a '#' character. The default embedded fonts (sans-serif, serif, etc) only support a very basic Western European character set. So custom fonts will be needed in many cases to provide extended character sets / glyphs. For example, for Japanese or Chinese text, etc. Sounds obvious to me in hindsight, but I have to admit I'd never considered that before. :-) Your example HTML had some `font-family` CSS styling, but seemed to be missing a `@font-face` section to actually load the custom font you have (Montserrat). I don't know what is in your `pdf-fonts.css` file, but I assume that it does not have a working `@font-face` section either to load this font, else I suspect this would've worked just fine for you! Anyway, this example worked for me: ```clojure (->pdf [:div {:style "font-family: Montserrat"} [:p "ő"]] "test-characters.pdf" {:styles {:fonts [{:font-family "Montserrat" :src "/path/to/Montserrat-Regular.ttf"}]} :debug {:display-html? true :display-options? true}}) ``` Give it a try and let me know how this works! (Also thanks for reporting this issue, this exposed a whole bunch of font/character/glyph stuff that I had not previously considered at all! I will be pushing out an update to clj-htmltopdf that includes some extra things, but the core idea of needing to provide your own custom fonts for additional language/character support will always be an annoying extra requirement regardless.)
p4ulcristian commented 2021-12-09 04:43:27 -05:00 (Migrated from github.com)

Thank you very much! I did messed up the @font-face src requirement.
Now it works with:

@font-face {
  font-family: 'Montserrat';
  font-style: normal;
  font-weight: 500;

  src: url('http://localhost:3000/pdf/fonts/Montserrat/fonts/ttf/Montserrat-Medium.ttf')
       format('truetype');
}

Thank you for looking into this and for making this library.
We made a price-quote system with the help of it. I convert the pdf-s to base64, then send them to client-side, and deleting the generated files from the server-side. (just sharing an use-case for this library).

Thank you very much! I did messed up the `@font-face` `src` requirement. Now it works with: ``` @font-face { font-family: 'Montserrat'; font-style: normal; font-weight: 500; src: url('http://localhost:3000/pdf/fonts/Montserrat/fonts/ttf/Montserrat-Medium.ttf') format('truetype'); } ``` Thank you for looking into this and for making this library. We made a price-quote system with the help of it. I convert the pdf-s to base64, then send them to client-side, and deleting the generated files from the server-side. (just sharing an use-case for this library).
gered commented 2021-12-10 08:55:10 -05:00 (Migrated from github.com)

Cool! Glad it worked, and thanks for sharing your use of this library. I love hearing about that kind of stuff. :-)

Cool! Glad it worked, and thanks for sharing your use of this library. I love hearing about that kind of stuff. :-)
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: gered/clj-htmltopdf#7
No description provided.