Replacing WordPress
Is it a story, or is it a technical documentation? Probably it is both, I only know it started with a specification of some kind and went on to become a solution. And now I try to compile the specification and the implementation notes into a journey description. My journey to Python and my new web representation and what I had learn on my way.
The code in this article has still hot needle quality.
Motivation
Part One
I think WordPress is a good platform to create a web presence. It is just a pain in the ass if you start to care about privacy. You start to remove all the included tracker features contacting google analytics, you continue with the removal of "social" media features, because you are not sure whether these communicate the respecitve platforms without pressing the social media button and you remove all those resource links loading resources from foreign servers, because the respective resource servers might be able to identify you and also the page you visited.
After this isolation of your web presence, you are halfway sure that the privacy of your visitors is save. Then you might receive a notification, that you should update WordPress or one of the plugins. ...
If you do not update, you might run into the situation where your web presence might become insecure for your visitors. If you update, you have to review everything again. Is your server isolation still in effect, or did the update reinstate some of the removed side-communications?
Part Two
I decided to learn Python and needed a project for this. I stumbled upon the article "Gitblog - the software that powers my blog" 1 , and I liked the idea to publish via
git push $
I'd guess most developers are in agreement with this. Posting articles just the same way as you push new code to a git server - developers a bound to like this.
The Gitblock solution is based on Java and nothing is wrong with this. I used Java in software development for more than 20 years, with a noticable break where I mainly used ABAP to come back to Java again.
But since I wanted to learn Python anyhow, this was the ideal project to get a start with that language.
Project Duration
From start to go live it took slighly less than 2 month. The first Wiki entry dates back to December 31 st 2021, but this was nothing more than a note to myself. January 16 th the real drafting on the specification started and rarely anything from that first specification survived. But the most important requirements stayed stable and are met with the solution.
-
No JavaScript
- There is one article containing JavaScript, because the article contains a quiz. But articles are content, not publishing solution.
- All plain HTML+CSS static content
- State of the art semantic HTML
-
Search (the exception from the static content)
- Currently YaCy integration
- Planned to have a python based Search
The website with all previously published articles migrated did go live on March 14 th . And migration was a really heavy topic. Most articles where originally written in one of my Wikis and then published in WordPress, but some of the very short video, audio or article recommendations where not written in the Wiki. Some articles got corrections maintained directly in WordPress instead of a correction in the Wiki with republishing afterwards. For some of those corrections I just made a comment in WordPress to make readers aware of the mistake.
In short: Inconsistency in previous publishing's made migration a major effort.
Other inconsistencies where:
- Articles maintained in Wiki had partly chapter headlines starting with header level 2 and partly starting with header level 3.
- Quotations where only preceeded with "Zitat:" and concluded with "Zitat Ende", but not everywhere, since I started this only when I started with the audio recordings of my articles - to make sure I do not forget to mention the start and the end of the quotation in the recording.
- Quotation where not enclosed in < blockquote > tags.
To provide state of the art semantical HTML I had to copy-edit every article during the migration. The good news is: The new setup will drive and support me to publish in a more constent manner.
Considering this major migration effort, I'm pretty proud the project did take "only" 2 month, especially since it was the Python learning project.
As it happens often on the way, I learned much more than just Python. I learned new things about PDF generation, fonts, git, regex, HTML, CSS, vim, the IDE spyder, the web server nginx and even more.
Requirement Specification
The requirement specification was subject to changes. As it often happens, this was mainly because it not only described requirements, but made also already assumptions about technical details of the solution.
A funny fact: I spend years explaining my own customers that it is important not to write requirement specifications with a technical solution in mind. The requirement specification should focus strictly on non technical scenario descriptions. The rationale behind this: Very often a customer would ask to eliminate work efforts caused by previously implemented workarounds. This workaround is viewed as tool by the customer, and following the customers suggestion leads to the implementation of yet another workaround. Very often you get a much better overall solution, if you sunset also existing workarounds, which is difficult, because those where so helpful in the past.
My previous publishing scenario
I use one of my MediaWiki siblings to collect information and I use also this wiki to create articles based on that information. This part of the publishing szenario stays in place. I considered to change this as well and to write articles in future in the editor vim, but I decided to keep infomation collection and article compilation together in one place.
To get this article published, together with its audio recording. I used the HTML export option of an PDF export extension.
I then used the editor vim with 3 regex statements to strip the header and the footer from that export, and, if necessary, also the references to categories, which would otherwise establish links pointing into the void when displayed in WordPress.
The remaining HTML was then pasted into one HTML input field in the Create Post UI of WordPress. Thus the page internal links in the table of contents and to and from the reference section of the page stayed functional.
If pictures were included, I uploaded these first to WordPress and used these uploaded pictures already in the wiki. That way the links to the pictures stayed as they were in the later WordPress version of the article.
All in all not too cumbersome a process, but with room for improvement. Especially when corrections where required it was much to easy to apply the correction directly in WordPress instead of doing the correction in the Wiki and to republish it. And this leads to problems on the long run. For some time I thought about some automation of text deployment to WordPress to mitigate this. But those thought are now obviously obsolete.
How do I want to do it in future ?
This chapter is from my early specification notes. I tried to figure out what I really want.
This is not really easy to tell. I'm still struggling to have one opinion with myself about this topic.
I'd like to edit my pages with MediaWiki markup or, as alternative, with markdown. From the implementation side it would be simplest to keep the editing process as I do it today and only to change the publishing.
The publishing and the result as shown in Gitblog is quite to my taste. However, this solution is based on Java, and I think 20+ years of Java is enough. I'd like to base my own solution on Python. Not because it is so much better than Java, which might or might not be the case, but because I decided to learn Python down to its depths and such a project is a perfect opportunity.
This does not mean, that I need to write everything from scratch, there are already a lot of modules in existence to build upon.
At the other hand I'd like to be able to write my articles also completely offline, just using vim as markup editor. But is this a realistic scenario? Am I not researching every detail anyhow online during the authoring? So many things you did read about and you are quite sure about, but you need a source as reference when you write it into an article. Will I ever really do authoring offline?
But why not both options?
During commit a pre-commit handler can check the mime type and do one thing if it is an html-fragment to be placed into an empty html page template, and do another thing if it is an markdown file with the extension md.
Looking at GitLab Docs 2 it is undeniable that powerful versions of markdown exist. However, installing GitLab means also to install a big bunch of software, which is not really smaller than wordpress. In search for a small minimalistic solution GitLab is probably out of scope.
To be honest, I'm not sure I ended up with less installations than Gitlab. But as you see in this text, I initially expected that would need to meddle with the HTML from the MediaWiki, as I did before. Fortunately specifications are a moving target, something we developers will often complain about.
How shall it look like in the Future
''This chapter is from my early specification notes as well. It was less off the mark than the previous chapter.
For a start it should look like before, just without those things no longer required. E.g. a logon is no longer required, since I push and merge new articles to the server, instead of logging in and using an authoring front-end.
It looks different, but not too much different
Search
Initially the search will use my YaCy instance. I have to look how well this integrates.
Yes, YaCy is integrated. But I consider the current search integration as improvable.
Site Pane?
Is a side pane required any longer? Probably not, I'm not sure.
No site pane any more.
Header Collapse or not?
Can I collapse the header with the site navigation during scroll down and make it available when the user starts to scroll up? I mean without JavaScript, only with CSS? I will see.
This point got no priority at all. Nice idea probably, but in the end I didn't care.
Small Header
The header will become smaller, since I will shorten the main site name to "Idee" with a smaller "der eigenen Erkenntnis" and I will write it in small-caps to hide the problem with uppercase "Erkenntnis" not fitting exactly to the lowercase "e" at the end of "Idee". And the Header-Text will move on Top of the Header Picture. ''Forget it. Ok, the header got smaller, but that's it.
Article PDF
Every article will get an PDF-Download Button. The PDF is not necessarily optimized for print and offline reading, but it is nonetheless a good idea to simplify the access to references in the reference section via QR-Code for the respective links.
Article PDF is implemented and also a possibility to suppress its generation for low value content, e.g. if the "article" is just a recommendation note.
References in the PDF are rendered as in the online version, but showing additionally the HTTP address as text. QRCode creation for every reference does not take place.
Article Archive
WordPress shows am archive drop down. That needs scripting and dynamic population. An alternative would be the generation of one archive page, which allows drill down to the year, which allows drill down to the month. Such pages can stay unchanged, as long as I do not change the portal part. as soon as the respective month or respective year passed by. The usability is most probably not less than a drop down, which at some point gets a bit messy to scroll on small screens.
Implemented as described. Only a portal change does in the chosen implementation not require a regeneration of pages, which is a huge improvement compared with the initial specification.
Sizing Pictures?
Should I size pictures during commit? Should I sample audio files to a number of different qualities? A lot of options are open now, with the development of an own page factory.
The question-mark in the title can be answered Today with No.
Picture based article selection
I could create a picture gallery to select articles by picture. But then probably I should create a picture for every article... . Not really, I also, in rare cases, do not create audio. I wouldn't force myself into picture creation, where the picture does not add value.
The original text does hint it already. Nothing in this regard has happened.
Semantic Web
Articles will have state of the art HTML5 article structure. This needs some intelligent logic if it comes to the correct use of tags like the cite-tag. I probably need to think about the markdown and MediaWiki representation of the HTML cite-tag to make this one working nicely.
I obviously meant quotations. The markup representation for quotations is the respective HTML tag < blockquote >. The mentioned < cite > tag could probably come into use in my articles as well in the references, but this is probably not a could idea. Possible that I'll introduce this later.
However, a lot of semantic is simple. The article content resides in the article-tag. The article-tag contains a header-tag, whose headline and media are descriptive for the complete article, like the QR Code of the URL, the PDF file and the audio file. Video is not planned.
The HTML head meta-tags for articles and the og meta-tags bring a lot of invisible semantic to the page.
A lot of options. In the end I will strip this text down to those things, which made it into the product.
It's all in, apart of the cite tag, which was an error in the specification.
Citation
That would be an interactive page function.
-
Reuse Citations I made in various citation formats.
- Click at a function link at the footnote. the citation gets shown and a citation format can be selected.
- The result can be used via copy-paste buffer.
-
Cite statements made by me in various citation formats.
- Mark a text passage in my article and a citation function link gets shown.
- Then as above.
Yes, this function needs java-script, which would be a draw back from the plain HTML philosophy I started with. Citing myself a lot of other publications and knowing about the effort to create the citations as I need them, I think such a function is worth to be scripted.
Plain HTML is not a religion. It is rather: Avoid scripting where it is not required. Ok, nice idea to make it simple for others to cite me. But it is not implemented and probably will never be.
JavaScript
JavaScript can be disabled in the browser without any impact for the casual user, Only "extended features" may rely on JavaScript, as the citation feature. Features not enabled due to disabled JavaScript stay invisible to the user.
I translate myself for myself :-). Function-links are written into the page via JavaScript. If JavaScript is not enabled, the page will not contain any malfunctioning function links.
Till now I did not need to follow this specification, since the implementation doesn't use JavaScript anywhere. But the specification stays valid.
Sitemaps
The "Sitemaps XML format " 3 description explains the concept and the XML document structure of sitemaps.
RSS
Updates have a reason, most probably additional or corrected information went into the article. To make an announcement of such changes is imperative for an information provider, and the RSS.xml is the place for this.
Monthly Archive
As sitemaps can be structured by one or more sitemap index files, it does make sense to use this to structure the sitemaps by "yyyy-MM", getting one sitemap per month.
Thus its probably simple to create a monthly page of articles to be selected by the user in the archive overview created from the sitemaps index.
The first sentence describes how the implementation was done later. But the second sentence was too optimistic. The sitemap, if not pepped up with extensions, does not contain enough information to create monthly archive pages from it.
HTML5 Article (with prepared location for portal injection
This is the article HTML draft. it contains a line with div id="main", which I planned to be the place where I place the portal part via Phyton. That div turned out to be unnecessary, instead an xml comment is now placed after body and before main tag as include instruction. The include is performed by the nginx web-server.
<!DOCTYPE html>
<html lang="de-DE" xml:lang="de-DE" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8"/>
<meta content="pandoc, fs-commit-msg-hook 1.0" name="generator"/>
<meta content="width=device-width, initial-scale=1.0, user-scalable=yes"
name="viewport"/>
<meta content="2022-01-19T16:05:43" property="article:modified_time"/>
<meta content="2020-10-15 09:49:27" property="article:published_time"/>
<meta content="Frank Siebert" property="article:author"/>
<meta content="Idee" property="og:site_name"/>
<meta content="de-DE" property="og:locale"/>
<meta content="The Article Title" property="og:title"/>
<link href="../website/css/fs.css" rel="stylesheet"/>
<title>
The Article Title</title>
<style>
>
<!-- styles by pandoc --white-space: pre-wrap;}
code{.smallcaps{font-variant: small-caps;}
span.underline{text-decoration: underline;}
span.column{display: inline-block; vertical-align: top; width: 50%;}
div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;}
div.task-list{list-style: none;}
ul</style>
</head>
<body>
<div id="main"> <!-- Portal injection parent -->
<main>
<article>
<header>
<h1>
The Article Title</h1>
<div>
<time datetime="yyyy-MM-dd hh:mm:ss" pubdate="true">
yyyy-MM-dd</time>
<address>
Author Name</address>
<!-- probably PDF download link location -->
</div>
<!-- probably audio player location -->
</header>
<!-- article content (paragraphs, toc, headlines (< h1), images, footnotes)
</article>
</main>
</div>
</body>
</html>
Scenario
This is still specification before the development started. It seems to be repetition, but it is not, because it contains a decision not made before. But it contains also things which should not be written into a scenario. If you wear the hut of the customer, the architect and the developer all in one person, then there is no second person taking care to write the specification correctly.
Obviously I'm not completely sure about the scenario. But if it changes over time, then that's a common thing often seen also in other projects.
Decision: Material collection and writing happens in a MediaWiki. The export may happen with the current export tool, or it might happen with a Python based Wiki-page parser 4
Aspiration: I want to have useful meta tags generated during the commit. The Author Information and the date of publishing, the date of update and a documentation of changes should be automatically processed into an HTML meta information and into a standardized representation in the visible text. This is important to ensure that corrections are processed in a transparent, reader friendly manner.
For the sake of usability, a commit should not lead to an automatic release of the text. The commit is for the draft version only. This basically means that I will work in branches and the final publishing is done with a merge into the master branch,
Forget this branching explanation. Yes, commits generate HTML for review, nut branches are not necessary and therefore also no merge. The final publishing is done via the git push command, as explained earlier.
During the merge into the master branch on the server:
- The HTML is processed to contain a header, a style-sheet, meta information and change markers if the document had not been merged for the first time.
- The page is fed into a search engine for indexing
- The page is fed into an rss feed generator to provide a new entry in the rss feed.
- The page is fed into a sitemap generator to provide an updated sitemap
In the end everything is done during commit, with the exception of search engine indexing, which can be done by the YaCy-Search Engine only after publishing.
Search Index
And even more specification, if you like to call it such. Probably it is more am investigation of options regarding search.
For Python some search index implementations exist. There is one Doit-Yourself-Example by Bart de Goede 5 , at the opposite end of the spectrum we find Gensim 6 , which probably can do much more than just index, and there is a module named Whoosh 7 , and there is rank-bm25 8 , which implements multiple variants of the bm25 search algorithm.
I tend to base my search on the latter module, and I'm curious how well this will work.
Search Index Related Learning Material
- "Improvements to BM25 and Language Models Examined" 9
- "What is the difference between Okapi bm25 and NMSLIB?" 10
Implementation
With the chapter "Toolchain" the implementation started. The chapters are sorted by initial implementation sequence.
Toolchain
MediaWiki-Tools git ~/projects/wikitools/
This is the git used to implement the tools to access the MediaWiki instances. The default instance used is my private sammel-wiki, but there is no reason why I should not also access my installations-wiki to create postings from it. Well, the language probably, since my blog is in German language.
The language problem is solved with the creation of two sites, one in German and one in English
The wikitools project git existed already, hosting the code for a program "reference.py" to scrape Webpages for the creation of a reference tag stored in a new created reference wiki-page for the scraped website.
Related to WordPress-replacement-project is the new tool "export.py", which extracts a wiki-page with expanded templates as MediaWiki markup file. The output of this tool is placed into a configured directory, which is, how convenient, the authoring directory of the authoring git.
Authoring git ~/projects/idee
Authoring takes place in the folder ./author/ , for a start via MediaWiki files. This means I can also use my MediaWiki instances for authoring and afterwards use the export.py from my wikitools to save the article as "authoring source" into this folder.
During git commit the commit-msg hook implementation checks for committed ./author/*.mediawiki files to be processed.
TODO: Consider options to structure this in
./author/yyyy/MM/
folders.
DONE: The result is NO.
The respective mediawiki files are processed into plain HTML by the method pandocmw() , pandoc beeing the conversion tool used.
Processing results are stored in:
- ./plain/ - the plain html files
- ./website/image/ - the image files
The plain html files are further processed into PDF files by the method pandoc-html-pdf() , pandoc again beeing the conversion tool used.
Processing results are stored in:
- ./website/pdf/ - the pdf files
Created HTML and PDF files open automatically in Firefox for review.
At this point of the processing the Pictures as well as PDF files are supposed to be final, at least after some possible round trips of review and correction.
TODO: check the creation of an asset list for each authored article, to prevent the deployment of the article without the corresponding pictures and audios.
DONE: The result is NO. The risk of this to happen is minimal and it is also very fast corrected, if it should happen
TODO: one option to simplify the commit of all required assets is the creation of an asset list for the committed article. Probably this can be in the form of a prepared commit message for these assets.
DONE: The result is NO. All git comments can be issued in the root of the git repository, making sure everything is included.Just "git add .", "git commit", that is simple enough.
During git commit the commit-msg hook implementation checks for committed ./plain/*.html files to be processed.
The respective plain html pages are processed by the method injectportal() into webpages of a website via:
- the injection of the portal into the page
- the placement of the PDF accesss link, if a pdf with the same name exists
- the placement of the HTML5 audio player, if an audio with the same name exists
The processing results are stored in:
- ./website/article/ - webpages containing articles
Then the website is updated via:
- the update of the sitemap,xml
- the update of the feed.xml
- the update of the index.html featuring the latest post as first entry.
Processing results are stored in:
- ./website/ - Entry point of the web representation
./website/ contains all website related content.
Resulting in the structure:
website
article
├── css
├── media
├── pdf └──
Privacy Statement and such administrative overhead will be deployed as article and linked as special page in the portal.
TODO: Think further about URL compatibility with the current WordPress site.
Option: I have exported data from the mysql table, which enables the creation of a redirect list.
DONE: Compatibiliy is a must and it is ensured. It is important not only for the redirect, but also to preserve the correct dates of the articles. The stem of the article page serves as urn to access the article data in the publishing list.
This git is a client git, connected to the server git. Deployment to the server is done via git merge .
Autoring git - Configuration
The configuration of the git gets stored and versioned in the git repository. The path to the configuration is ./config/ . The implemented hooks are part of the configuration and are stored in ./config/hooks/ , the preexisting examples are stored in ./config/hooks/samples/ .
./config/gitconfig
#!/bin/bash
# configure the wiki
# We develop hooks and want version control for that
git config --local core.hooksPath ./config/hooks
# We want to easy reading of German äüö in the file names
git config --local core.quotepath off
# We provide some variable override options in a modified template
git config --local commit.template ./config/commit-message
# We process the files committed and need absolute file paths from $GIT_DIR
# written into the commit-message
git config --local status.relativePaths false
The configuration settings are applied with the above shown bash script.
Folder Structure
To give you complete overview of the final git folder structure, here it is:
frank @Asimov:~/projects/idee$ tree -d
.
author
├── bash
├── config
├── hooks
│ └── samples
│ └── generator
├── __pycache__
│ └── nginx
├── plain
├── test
├── website
└── archive
├── article
├── audio
├── css
├── env
├── bootstrap
│ └── css
│ └── files
├── image
├── js
├── pdf
├── portal
├── qrcode
├── sitemap
└──
25 directories
The folder website/env/bootstrap/css contains the CSS to format the YaCy search result page. In the current implementation I refrained from merging somehow the portal header into that page. Probably I could convince YaCy to return the results as XML and to render a portal page via XSLT, There is room for improvement.
The folder /generator contains the Python part of the project. If you consider software development and content development as two different projects, then the git contains the development project as a nested project.
To get a re-usable software product from this, it is required to separate the projects into separate git repositories. But for initial development the combined repository saved a lot of time.
Server git: /home/git/idee.git
This is the server git, and as such it is without work directory. When the pushed changes had been merged, a hook needs to take care to write the content into the web-server directory.
Not all content, but the content belonging the website as documented above, needs to be processed.
The implementation uses a simple fetch by a client git located in the web-server directory.
Export
There are tools you can install in your MediaWiki instance to support the export to PDF and to HTML. However, these require you to change the wiki installation and the result might not be tailored to your need,
To publish articles I decided to write my own export tool, extracting a single page containing the composed article, with all included templates expanded. Mediawiki ships with a special page Special:Export, which uses the same API function as used in my implementation. The API call was already implemented in the module mwclient.py, but to get it to function with long pages I had to change the respective GET request into a POST request. I informed the developers via their Github issue tracker with the message "expandtemplates should use "post" instead of "get" · Issue -272 · mwclient-mwclient" 11 .
The current implementation of the export.py is by no means beautified. I just began with python and I'm pretty much unaware of established coding conventions. As I progress with python, things will get nicer over time.
export.py
wikitools git repository
I let a lot of comments survive, which document also some of the wrong ideas I had. E.g. during the migration I used the Pandoc feature to download the images from their web location on my WordPress installation. Pandoc creates own filenames for the pictures via SHA1 hash during this process.
Afterwards I thought I should change the export function to enable Pandoc also to download images from the Wiki. But the wiki requires authentication and overall the identifier for the image might Image:, File: Bild:, Datei:, and in additional languages you might get additional alternatives. I'm not sure whether Pandoc does address all those correctly, but at least that's SEP 12 as long I do not meddle myself in that soup.
Now nothing is done in the export, the media download in Pandoc is deactivated and the image path is adjusted right after HTML creation at a point, where I meddled with that path anyhow already.
#!/usr/bin/python3
"""
Export MediaWiki Pages with expanded templates and page includes.
@author: Frank Siebert
@website: https://idee.frank-siebert.de
@license: https://creativecommons.org/publicdomain/zero/1.0/deed.en
@date: 2022-03-15
"""
import sys
import os
import getopt
import termios
import fcntl
import subprocess
import time
import re
from pathlib import Path
import configparser
from termcolor import colored
from mwclient import Site
from mwclient.errors import LoginError
= 'Usage: export.py [-w \'wiki\'] \'Page_Name\'\n'\
HELPTEXT '\n'\
'-w \'wiki\' Name the wiki to be used, using the section\n'\
' name in the configuration file.\n'\
'\n'\
'Page_Name The page to be exported. In case of spaces either\n'\
' surrounded by \' or with _ instead of spaces.\n'
def askforkeypress(prompt, keylist, onerror):
"""
Ask and wait for user input.
Parameters
----------
prompt: Str
The promt shown to the user to ask for input.
keylist: List
A list of characters as possible input keys.
onerror: Object
An object to return to the caller in case of an error.
"""
= sys.stdin.fileno()
fileno = termios.tcgetattr(fileno)
oldterm = termios.tcgetattr(fileno)
newattr 3] = newattr[3] & ~termios.ICANON & ~termios.ECHO
newattr[
termios.tcsetattr(fileno, termios.TCSANOW, newattr)
= fcntl.fcntl(fileno, fcntl.F_GETFL)
oldflags | os.O_NONBLOCK)
fcntl.fcntl(fileno, fcntl.F_SETFL, oldflags
# stay in the same line to wait for input
print(prompt, end=' ', flush=True)
= None
char try:
while char not in keylist:
try:
= sys.stdin.read(1)
char .1)
time.sleep(except IOError:
= onerror
char except ValueError:
= onerror
char print(char)
finally:
termios.tcsetattr(fileno, termios.TCSAFLUSH, oldterm)
fcntl.fcntl(fileno, fcntl.F_SETFL, oldflags)return char
def export():
"""
Export the page as mediawiki markup.
Uses the API used by Special:Export including templates
https://wiki.frank-siebert.de/script-inst/index.php?title=Special:Export
For long pages the use of POST is important. I changed the library function
in this regard. A respective fix awaits its merge into the mwclient module.
Request Type: POST
Request Parameters:
catname=&pages=Replacing+Wordpress&curonly=1&templates=1&wpDownload=
1&wpEditToken=%2B%5C&title=Special%3AExport
"""
print(pagename)
# Login
= config[CFGSECTION]['Host']
host = config[CFGSECTION]['ScriptPath']
scriptpath = config[CFGSECTION]['User']
user = config[CFGSECTION]['Passwort']
password = config[CFGSECTION]['ExportDirectory']
exportdir = config[CFGSECTION]['References']
references
= Site(host, path=scriptpath)
site try:
site.login(user, password)"login")
site.get_token(except LoginError:
print("login failed")
for result in site.search(pagename, what='title'):
print(result)
= site.pages.get(pagename)
page if page:
print(page)
# expand = page.templates().count > 0
# might fail if client.py is updated, because
# def expandtemplates(self, text, title=None, generatexml=False)
# sould use post instead of get
= page.text(section=None,
wikitext =True,
expandtemplates=True,
cache='main')
slot
# load page until it is no longer changeing
# stop waiting after 5 seconds
= wikitext
webpage .5)
time.sleep(= 10
i while len(webpage) != len(wikitext) and i > 0:
= wikitext
webpage .5)
time.sleep(-= 1
i # make clear, which one is not used from now on
= None
webpage
# Later we might need a <reference/> tag to identify the
# location where footnotes shall be placed.
# Lets check non-existance of the tag and existance of the
# reference section.
# Insert the tag now, if it is not present at the expected place.
# In vim the regex :%s:^=.*Fußnoten.*=\n<references.+/> finds
# the german footnotes with the reference tag in the next line.
# strip down to headline text only
= references.strip().strip('=').strip()
ref_cfg
# I tried repr(pattern) to get the raw string, but got it with
# ' at the start and end
# repr(pattern)[1:-1] would strip them off, but this is easier
# to read and to write
# requires re.M
= r"" + "\n=.*{}.*=.*".format(ref_cfg)
r_ref_cfg_pattern # a negative lookahead for the reference tag
= r"" + "(?!\n.*<references.*/>.*$)"
r_nreference # ok if footnote header found without reference tag
= r"" + r_ref_cfg_pattern + r_nreference
r_okpattern
= re.compile(r_okpattern)
okcheck = okcheck.search(wikitext)
exists if exists:
print(colored('\nWARNING:', 'yellow'),
"<references/> tag is not in the expected location")
print("Automatic insertion of the tag takes place.")
= okcheck.sub(exists.group(0) +
wikitext "\n<references/>", wikitext)
# replace category references
= re.compile(r"(.*)",
categorypattern1 =re.MULTILINE)
flags= categorypattern1.sub(r"\1", wikitext)
wikitext
# replace category references, take care not to touch Images
= re.compile(r"\[\[[K|C]ategor.*\|(.*)\]\]",
categorypattern2 =re.MULTILINE)
flags= categorypattern2.sub(r"\1", wikitext)
wikitext
# A mediawiki extension enables me to embedd images from my
# idee demain into articles I write in the wiki just by pasting
# the URL in to an otherwise empty paragraph.
# https://something....filename.ext
# This, who wonders, is not recognized by pandoc.
# I need to beautify these image links.
# only if the line starts with http
# transitional code for the migration
= re.compile(r"" + "^http(.*)png", re.MULTILINE)
imagepattern = imagepattern.sub(r"[[Image:http\1png|No Caption]]",
wikitext
wikitext)
= re.compile(r"" + "^http(.*)jpg", re.MULTILINE)
imagepattern = imagepattern.sub(r"[[Image:http\1jpg|No Caption]]",
wikitext
wikitext)
# Above imagepattern replacements take care for external
# links, as I used them in the past to upload images to WordPress
# instead to the MediaWiki, when I intended to use them in
# articles.
#
# The new scenario is the upload to the MediaWiki and to create
# image entries with Capture Text and probaly Size Information.
# For those the image file name needs to be expanded with the
# URL to retrieve the images from the Wiki.
#
# The exported image information looks as follows:
# [[Image:Imagename.png|NNxNNpx|Capture Text]]
# Or:
# [[Image:Imagename.png|Capture Text]]
#
# The text "Imgagename.png" needs to be expanded into:
# $host/$script-path/index.php?title=File:Imagename.png
#
# Instead of Imgage the returned WikiText may contain Bild or File
# or Datei as Keyword.
# This became a NOOP, because changing the image path here
# to help pandoc to download them does not make any sense.
# Probably an image export function would make sense here.
# mstr1 = r"\[\[(Image|Bild|Datei|File):"
# mstr2 = r"([^h][^t][^t][^p].*p[n|j]g)|.*\]\]"
# mstr = mstr1 + mstr2
# imagepattern = re.compile(mstr, flags=re.MULTILINE)
# replstr = r"" + site + r"index.php?title=" + r"\2"
# wikitext = imagepattern.sub(replstr, wikitext)
# Write the result to disk.
# TODO: Enable also enviroment variables to determine HOME.
# Using ~ is nice, but quite OS dependent
if exportdir[0] == '~':
= Path.home() / exportdir.strip('~').strip('/') \
wikifile / "{0}.{1}".format(pagename, "mediawiki")
else:
= Path(exportdir) \
wikifile / "{0}.{1}".format(pagename, "mediawiki")
= wikifile.resolve()
wikifile with open(wikifile, 'w') as outfile:
print(wikitext, file=outfile)
outfile.flush()
outfile.close()print('\nThe mediawiki file was exported to:\n' + outfile.name)
= '\nDo you want to review the file? yes/no (y/n):'
prompt = askforkeypress(prompt=prompt,
pressed =['y', 'Y', 'n', 'N'],
keylist='n')
onerrorif pressed in ['y', 'Y']:
"vim", wikifile])
subprocess.run([
# TODO: Consider to execute the commit into the wiki git
# htmltext = subprocess.run(["pandoc", "-f", "mediawiki", \
# "-t", "html"], input=bytearray(wikitext.encode()), \
# capture_output=True)
# print(htmltext.stdout.decode("utf-8"))
break
return
if __name__ == "__main__":
# Check command line arguments, provide help and call the functions
= False
CREATEFLAG
try:
= getopt.getopt(sys.argv[1:], "hw:", ["help", "wiki"])
opts, args
except getopt.GetoptError:
print(HELPTEXT)
2)
sys.exit(
# Defaults
= 'Default'
CFGSECTION
for opt, arg in opts:
if opt in {"-h", "--help"}:
print(HELPTEXT)
sys.exit()if opt in {"-w", "--wiki"}:
= arg
CFGSECTION
= ['pagename']
arg_names = dict(zip(arg_names, args))
args
# print(args)
# Kept as inspiration for future
# ------------------------------
# Arg_list = collections.namedtuple('Arg_list', arg_names)
# args = Arg_list(*(args.get(arg, None) for arg in arg_names))
= args.get('pagename')
pagename if not pagename:
print(colored('\nERROR:', 'red'), 'Page_Name parameter is missing.')
print(HELPTEXT)
2)
sys.exit(
= configparser.ConfigParser()
config
= Path.home() / '.config' / 'wikitools' / 'wikitools.cfg'
configpath
config.read(configpath)= config.sections()
sections if CFGSECTION not in sections:
print(colored('\nERROR:', 'red'), 'Configuration is missing.')
print(HELPTEXT)
3)
sys.exit(
export()
0) sys.exit(
~/.config/wikitools/wikitools.cfg
The code reads a configuration, which enables me to post the code without the risk of exposing my user and password for the wiki instances I use.
One default wiki can be configured and as many additional wikis as you like. With the command line paramente -w you can address the configuration section you want to use in current export.
The configuration is shared between multiple wikitools.
#
# The configurations location has to be
#
# UserHome/.config/wikitools/
#
# If the command line names no wiki section
# the Default section is used.
#
# The command line option -w
# with a parameter can be used to name a wiki,
# for which a configuration section exists.
#
# access to a key from python
# config[cfgsection]['key']
[Default]
DefaultCategory = Your Category
Host = Your wiki host name
ScriptPath = Your wiki script path
User = Your wiki user name
Passwort = Your wiki password
References = == Fußnoten ==
ExportDirectory = ~/projects/idee/author
[yourwiki]
DefaultCategory = Your Category
Host = Your wiki host name
ScriptPath = Your wiki script path
User = Your wiki user name
Passwort = Your wiki password
References = == Footnotes ==
ExportDirectory = ~/projects/idee/author
WordPress Migration
Meta data and, if I decide to use this, also the content of WordPress articles and sites, are available in the MariaDB database wp_idee.
In the context of the project it made sense to grant remote access to the MariaDB from the local area network. This is well described in the official documentation 13 .
root @sol:~# cd /etc/mysql/mariadb.conf.d/
root @sol:/etc/mysql/mariadb.conf.d# ls
50-client.cnf 50-mysql-clients.cnf 50-mysqld_safe.cnf 50-server.cnf
root @sol:/etc/mysql/mariadb.conf.d# vim 50-server.cnf
In file 50-server.cnf comment out the the bind-address 127.0.0.1, making it bind to the network cards addresses. In my case there is just one if these,
...]
[# Instead of skip-networking the default is now to listen only on
# localhost which is more compatible and is not less secure.
# bind-address = 127.0.0.1
...] [
Restart of the sql server.
root @sol:/etc/mysql/mariadb.conf.d# systemctl restart mysql
Login into the sql server as root to manage users.
root @sol:/etc/mysql/mariadb.conf.d# mysql -p
MariaDB [(none)]> SELECT User, Host FROM mysql.user;
+-----------+-----------+
| User | Host |
+-----------+-----------+
| ninja | localhost |
| root | localhost |
| wiki | localhost |
| wordpress | localhost |
+-----------+-----------+
4 rows in set (0.000 sec)
MariaDB [(none)]> CREATE USER wpremote@'10.19.67.%' IDENTIFIED BY 'password-of-new-user';
Query OK, 0 rows affected (0.001 sec)
MariaDB [(none)]> SELECT User, Host FROM mysql.user;
+-----------+------------+
| User | Host |
+-----------+------------+
| wpremote | 10.19.67.% |
| ninja | localhost |
| root | localhost |
| wiki | localhost |
| wordpress | localhost |
+-----------+------------+
5 rows in set (0.000 sec)
MariaDB [(none)]> GRANT ALL PRIVILEGES ON wp_idee.* TO 'wpremote'@'10.19.67.%' WITH GRANT OPTION;
Query OK, 0 rows affected (0.017 sec)
Remote Access with the new user.
frank @Asimov:~$ mysql -u wpremote -h sol -p wp_idee
Enter password:
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 205
Server version: 10.3.31-MariaDB-0+deb10u1 Debian 10
Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
MariaDB [wp_idee]>
Table wp_posts
MariaDB [wp_idee]> describe wp_posts;
+-----------------------+---------------------+------+-----+---------------------+
| Field | Type | Null | Key | Default |...
+-----------------------+---------------------+------+-----+---------------------+
| ID | bigint(20) unsigned | NO | PRI | NULL |
| post_author | bigint(20) unsigned | NO | MUL | 0 |
| post_date | datetime | NO | | 0000-00-00 00:00:00 |
| post_date_gmt | datetime | NO | | 0000-00-00 00:00:00 |
| post_content | longtext | NO | | NULL |
| post_title | text | NO | | NULL |
| post_excerpt | text | NO | | NULL |
| post_status | varchar(20) | NO | | publish |
| comment_status | varchar(20) | NO | | open |
| ping_status | varchar(20) | NO | | open |
| post_password | varchar(255) | NO | | |
| post_name | varchar(200) | NO | MUL | |
| to_ping | text | NO | | NULL |
| pinged | text | NO | | NULL |
| post_modified | datetime | NO | | 0000-00-00 00:00:00 |
| post_modified_gmt | datetime | NO | | 0000-00-00 00:00:00 |
| post_content_filtered | longtext | NO | | NULL |
| post_parent | bigint(20) unsigned | NO | MUL | 0 |
| guid | varchar(255) | NO | | |
| menu_order | int(11) | NO | | 0 |
| post_type | varchar(20) | NO | MUL | post |
| post_mime_type | varchar(100) | NO | | |
| comment_count | bigint(20) | NO | | 0 |
+-----------------------+---------------------+------+-----+---------------------+
23 rows in set (0.003 sec)
Post Types
MariaDB [wp_idee]> select distinct post_type from wp_posts;
+---------------+
| post_type |
+---------------+
| attachment |
| nav_menu_item |
| page |
| post |
| revision |
+---------------+
5 rows in set (0.003 sec)
The post types of interest are most probably only page and post. I'm not sure about revision, but we can take a look, what's tagged as post type "revision".
The query "select post_title from wp_posts where post_type='revision';" brings repeatedly the same title, most probably the post_modified information should differ in those.
The query "select post_title from wp_posts where post_type='page';" shows only 2 pages, which are already processed into one for the new solution.
The query "select post_title from wp_posts where post_type='post';" shows every title only once, i would hope with the initial post information only. I'll not only hope, but check this before I continue.
The query "select post_name from wp_posts where post_type='post';" shows the pages unique identifiers, the last element of the pages URL. This information is very helpful to ensure that the migrated pages are named exactly the same in the new solution and are found via one redirect instruction in nginx. The redirect is important, since I will not have the date encoded in the URL, as I had it in wordpress.
Post Mime Types
Post Mime types help to select PDF, Audio and zip and Spreadsheets, which had been embedded into the posts in WordPress. The HTML posts themselves got the mime type "" in this DB, which saves space, but is irritating.
Missing Information
One Information is missing. The language or locale of the posted content. In the new solution I'll use de-DE and en-US as locales and as language information. I did not really create a lot of English pages, but this might change and the few ones I already have shall get presented correctly.
WordPress Meta Data Export
Create an export query for the HTML pages posted in WordPress. Include
- a site column with default value "Idee",
- a locale column with default value "de-DE",
- a author column with default value "Frank Siebert"
to be changed manually for the few English posts existing into
- site = "Concept" (Concept of new cognition elicitation personally thinking) replacing Idee (Idee der eigenen Erkenntnis). That's the best recursive acronym translation I found.
- locale = "en-US"
Get the earliest post date into one column and the latest post date into another column of the same row. Have one row per post.
The export query is created in a bash script, whose output can be piped into a local tab delimited file. The last line of this file needs to be deleted, since it contains an automatically saved draft, and .. (see next chapter).
wpmeta bash script:
#!/bin/bash
sql="select \
min(post_date) over (partition by post_name) as post_date, \
max(post_modified) over (partition by post_name), \
max(comment_count) over (partition by post_name), \
'Idee' as site, \
'de-DE' as locale, \
'Frank Siebert' as author, \
post_name, \
post_title \
from wp_posts \
where (post_mime_type='' and post_type='post') \
order by post_date asc;"
mysql wp_idee -u wpremote -h sol -p'not-exposed-pwd' \
--default-character-set=utf8 -N -e "$sql" > ../config/migrationlist.csv
Using bash for the query and persisting the result directing the output into a file, the data in the file becomes tab delimited.
Special-Case for nginx redirect
Querying the database revealed: The post "Social Distancing und Lockdown" has for unknown reasons the url "/2021/01/26/255/". This will not be the case in the new solution, I'll not have a 255.html around there, but I'll either create a special case in the redirect for this article or rather ignore this case at all.
In the result of the final export query a corrected page_name needs to be maintained, to migrate this article correctly.
Option in Consideration
For all articles written in the Wiki and published afterwards, export.py from the wikitools will do perfect service, I hope.
But things change over time and quite a number of articles where written in WordPress, when I had no MediaWiki in place. Also some last minute typo correction, I know this, where maintained directly in WordPress after publishing.
And the article "Das SARI-Rätsel" contains an interactive java-script part, which required some authoring in html. This last point might not stay a single happenstance, as developer I might find more often a reason to extend a page functionally, or even to write the complete page directly in html as source format.
My concept does support such deviations from the standard scenario. But this is not the point here. The point is: I need an wp-export tool to generate MediaWiki files from the current WordPress pages for the migration. That's the only way to ensure that the migrated pages contain exactly what they are supposed to contain. With this I can also incorporate comments posted by me after publishing as Update Notes.
I need to leave a note about the last paragraph. For the reasons already mentioned above I made a highly manual migration. Overall content quality improved considerably during migration, even if I missed one or two typos.
A migration file created via sql from the wp_posts table ist stored as ./config/migration.csv (with spaces instead of colons) containing date, time and title columns.
wp-export.py
This was never implemented. I place this tool in the directory ./tools/ of the Authoring GIT and create an alias wpe for the single file export and probably I'll also have an wpbe command as WordPress batch-export.
Manage Meta Data
I have now migration data in a tab delimited csv available, and I need to manage publishing meta data to make sure the correct meta data is shown in every web-page, in the sitemaps and RSS-feed.
I started to implement this as a specialized dictionary implementation, but I falter to proceed in this direction. The python modules pandas and agate seem to offer fascinating power working with csv files, and I have to investigate both in much more detail in the future.
My choice fell on the module agate (Documentation: "agate 1.6.3" 14 ) for this implementation, since it is a smaller implementation and because I do not need any extensive statistical horsepower for the meta data stored. I use the module in version 1.6.1 as it is provided currently by Debian Stable.
I have to revise me decision. Agate makes the reading of the csv easy and provides a powerful table object, making the data accessible very nicely. But the documentation states that it returns always copies to the data structure, which I like very much, but the documentation fails to show options to update data in the table object, which I will need to do without getting new instances of the meta data table next to the singleton created for that purpose.
Installing python3-pandas
frank @Asimov:~$ sudo apt-get install python-pandas-doc python3-pandas
sudo] password for frank:
[Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
libblosc1 libclang-cpp9 libffi-dev liblbfgsb0 libllvm9 libncurses-dev libpfm4
libtinfo-dev libz3-dev llvm-9 llvm-9-dev llvm-9-runtime llvm-9-tools
numba-doc python-odf-doc python-odf-tools python-tables-data
python3-bottleneck python3-et-xmlfile python3-iniconfig python3-jdcal
python3-llvmlite python3-numba python3-numexpr python3-odf python3-openpyxl
python3-pandas-lib python3-py python3-pytest python3-scipy python3-tables
python3-tables-lib python3-xlwt
Suggested packages:
ncurses-doc llvm-9-doc python-bottleneck-doc llvmlite-doc nvidia-cuda-toolkit
python3-statsmodels python-scipy-doc python3-netcdf4 python-tables-doc
vitables python3-xlrd python-xlrt-doc
The following NEW packages will be installed:
libblosc1 libclang-cpp9 libffi-dev liblbfgsb0 libllvm9 libncurses-dev libpfm4
libtinfo-dev libz3-dev llvm-9 llvm-9-dev llvm-9-runtime llvm-9-tools numba-doc
python-odf-doc python-odf-tools python-pandas-doc python-tables-data
python3-bottleneck python3-et-xmlfile python3-iniconfig python3-jdcal
python3-llvmlite python3-numba python3-numexpr python3-odf python3-openpyxl
python3-pandas python3-pandas-lib python3-py python3-pytest python3-scipy
python3-tables python3-tables-lib python3-xlwt
0 upgraded, 35 newly installed, 0 to remove and 1 not upgraded.
Need to get 84.2 MB of archives.
After this operation, 496 MB of additional disk space will be used.
Do you want to continue? [Y/n] y
A lot of suggestions next to a lot of required packages. If I would not intend to go deeper into data analysis with python, I would stay with my initial dict-based implementation instead.
python3-pandas web references
- "pandas documentation" 15
- "Add new rows and columns to Pandas dataframe" 16 With the most helpful explanation to insert via loc (or update via iloc) rows with:
len(df.index)]=list(data[0].values()) df.loc[
- "Pandas Tutorial" 17
Bioconductor
The search "derive from dataframe" had one result on startpage.com. This search result does not help me to find out, what to take special care for when I derive my own class from the dataframe class, but it relates strongly to a lot of articles I posted on my site.
- "Getting Started with Bioconductor 3.7" 18
A quick check reveals - it is available in Debian just at my fingertips.
frank @Asimov:~$ sudo apt-cache search bioconductor
bio-tradis - analyse the output from TraDIS analyses of genomic sequences
libtfbs-perl - scanning DNA sequence with a position weight matrix
q2-dada2 - QIIME 2 plugin to work with adapters in sequence data
r-bioc-affy - BioConductor methods for Affymetrix Oligonucleotide Arrays
r-bioc-affyio - BioConductor tools for parsing Affymetrix data files
...
r-bioc-variantannotation - BioConductor annotation of genetic variants
r-bioc-xvector - BioConductor representation and manpulation of external
sequences
r-bioc-zlibbioc - (Virtual) zlibbioc Bioconductor package
r-cran-ape - GNU R package for Analyses of Phylogenetics and Evolution
r-cran-biocmanager - access the Bioconductor project package repository
I suppose I'll never have enough leisure time to dig into everything I'm interested in. And it is written in R, a programming language I would like to learn as well.
I have to remind me from time to time that I'm doing this to learn and not to to prove I can do this implementation in very short time. More reading, less coding, take your time and it will take less time.
MediaWiki to Plain HTML Conversion
You are right, if you make the assumption that I started much earlier writing Python code. But we are now finally at the point where first parts of the final Python implementation can be used in the final setup.
I spare every evolutionary step of the Python code development. All following code is in the most recent state.
Commit-msg Nessage Hook
''Originally I wrote an commit-msg hook directly as executable Python program. But the message hook shown next is a bash scriot.
~/projects/idee/config/hooks/commit-msg
#!/bin/bash
/usr/bin/python3 generator/commitmsg.py $1
The parameter $1 is the commit message, for which the solution does use a modified template.
Commit Message Template
The commit message template provides the possibility to define some meta data to taken into the content and to steer apart different content creation options.
~/projects/idee/config/commit-message
# Overwrite values if neccesary, based on https://ogp.me/
# pdf:draft=false
# og:locale=de-DE
# og:site_name=Idee
# article:author=Frank Siebert
#
Note: There is a significant empty line as the first line.
Probably I will eliminate either the og:site_name or the og:locale line in future. I'm not yet sure.
Main Program: commit-msg.py
You might have seen it above, this is the Python program called by the message hook bash script. The commit message is passed on as parameter.
The commit-msg.py registers workers to take care for specific the commit message entries identified by a regex match at a dispatcher class.
~/projects/idee/generator/commit-msg.py
"""Website Generator - "pandoc, fs-commit-msg-hook 1.0".
@author: Frank Siebert
@license: https://creativecommons.org/publicdomain/zero/1.0/deed.en
@date: 2022-03-15
https://wiki.frank-siebert.de/inst/Replacing_Wordpress
https://idee.frank-siebert.de/article/replacing-wordpress.html
Website Generator uses Beautiful Soup, Pandoc and GIT to manage
authoring in *.wikimedia files and to convert those into:
* plain html pages, one per wikimedia file as article
* PDF files, one per wikimedia file for article download
* a Web site portal pages by injecting the portal into the plain html pages
Website Generator generates as additional portal assets:
* sitemap.xml
* feed.xml
* ...
Website Generator works with Python 3 and up. It works better if lxml
and/or html5lib is installed, as Beautiful Soup states it runs better then.
"""
# Systen Imports
import sys
import getopt
from termcolor import colored
from gitmsgdispatcher import GitMsgDispatcher
from mwworker import MwWorker
from pdfworker import PdfWorker
from plainworker import PlainWorker
# Ask for a key press and return the pressed key, if it is part of the keylist.
# In case of errors return the value of onerror, to enable the caller to
# decide on the most convinient way to preceed.
if __name__ == "__main__":
= 'Usage: commit-msg \'message-file\'\n'\
HELPTEXT '\n'\
'message-file The commit message file with the list of files\n'\
' to be processed.\n'\
try:
= getopt.getopt(sys.argv[1:], "h:", ["help"])
opts, args
except getopt.GetoptError:
print(HELPTEXT)
2)
sys.exit(
for opt, arg in opts:
if opt in {"-h", "--help"}:
print(HELPTEXT)
sys.exit()
= ['message-file']
arg_names = dict(zip(arg_names, args))
args
# print(args)
# Kept as inspiration for future
# ------------------------------
# Arg_list = collections.namedtuple('Arg_list', arg_names)
# args = Arg_list(*(args.get(arg, None) for arg in arg_names))
= args.get('message-file')
messagefile if not messagefile:
print(colored('\nERROR:', 'red'), 'message-file parameter is missing.')
print(HELPTEXT)
2)
sys.exit(
= MwWorker(r".*(new file|modified).*author[/].*\.mediawiki")
mwworker = PlainWorker(r".*(new file|modified).*plain[/].*\.html")
plainworker = PdfWorker(r"" + PdfWorker.pdfworkitem)
pdfworker
= GitMsgDispatcher(messagefile, [mwworker, plainworker, pdfworker])
disp
0) sys.exit(
Message Dispatcher: gitmsgdispatcher.py
The message dispatcher dispatches work-items from the git message to registered workers. Those workers then can place new work-items for later workers, to pick up work where they stopped working.
~/projects/idee/generator/commit-msg.py
"""
GitMessageDispatcher with MsgWorker base class.
@author: Frank Siebert
@license: https://creativecommons.org/publicdomain/zero/1.0/deed.en
@date: 2022-03-15
Instantiate the GitMessageDispatcher with the message
and with specialized workers. The workers provide a
pattern matching the lines they claim for work.
"""
import re
from pathlib import Path
from pubmetadata import PubMetaData
from sitemap import SiteMap
from archive import Archive
from rssbuilder import RSSBuilder
from idxbuilder import IDXBuilder
class GitMsgDispatcher:
"""
Dispatch the lines of the git message to registered workers.
Parameters
----------
gitmessagepath : Path
Path as type str or type Path pointing to the git message.
msgworkers : List of MsgWorker
The list of message workers is used as worker queue. Workers first
in the queue get their workitems first.
Workers can return their work result to be picked up by
later workers.
The ParameterValueWorker runs allways first. Do not add
the ParamterValueWorker to the provided list of workers,
or it will run twice.
The ParameterValueWorker takes care to provide the parameter
values provided by the message in place for all workers.
When all workers finished, the sitemap is updated, the RSS
feed is updated and the index page is updated.
Returns
-------
GitMsgDispatcher.
"""
def __init__(self, gitmessagepath, msgworkers):
"""
Dispatch the lines of the git message to registered workers.
Parameters
----------
gitmessagepath : Path
DESCRIPTION.
*msgworkers : List of MsgWorker
DESCRIPTION.
Returns
-------
GitMsgDispatcher.
"""
self.gitmessagepath = gitmessagepath
"""
Extract the relevant part of the message
"""
self.worklist = []
self.parameters = ParameterValueWorker()
with open(self.gitmessagepath, 'r') as infile:
# TODO: Do Better. Latest when the git server joins the game
# Message section of most interest: "Changes to be committed"
# But we are also interested in the parameter values
# we placed ealier into the file.
# The Start helps us to find the End.
= re.compile(r"^# Changes to be committed:")
start # Next uppercase entry starts another message section
= re.compile(r"^# [A-Z]")
end
= None
started for line in infile:
if self.parameters.pattern.match(line):
self.worklist.append(line)
if not started and start.match(line):
= True
started elif started and end.match(line):
= False
started if started is True:
self.worklist.append(line)
if started is False:
break
infile.close()
self.msgworkers = msgworkers
self.msgworkers.insert(0, self.parameters)
self.dispatch()
# Save changed publishing meta data, if any.
if PubMetaData.instance:
PubMetaData.instance.save()# Generate Sitemaps (bilingual)
SiteMap().update()# Generate Archive
Archive().update()# Generate RSS feed (bilingual)
RSSBuilder().update()# Generate Index Pages (English and German Version)
IDXBuilder().update()
def dispatch(self):
"""
Dispatch the git message lines to registered message workers.
Returns
-------
None.
"""
for worker in self.msgworkers:
print("Dispatching work to: {}".format(type(worker)))
for item in self.worklist:
if type(item) == str:
if worker.pattern.match(item):
self, item)
worker.work(if type(item) == dict:
= item[MsgWorker.task_worker_match]
worker_match if worker.pattern.match(worker_match):
self, item)
worker.work(
class MsgWorker:
"""
Create a worker for lines matching the pattern.
Parameters
----------
pattern : Pattern
A regex pattern matching the lines the worker takes care for.
Returns
-------
MsgWorker.
"""
# Cases to work against
= re.compile(r"^#.*(modified:|new file:)")
CREATEPAT = re.compile(r"^#.*renamed:")
RENAMEPAT = re.compile(r"^#.*deleted:")
DELETEPAT
# Task types
= "workermatch"
task_worker_match = "tasktype"
task_type = "create"
task_create = "rename"
task_rename = "delete"
task_delete
def __init__(self, pattern):
if isinstance(pattern, str):
self.pattern = re.compile(pattern)
elif isinstance(pattern, re.Pattern):
self.pattern = pattern
self.dispatcher = None # initialized in work method
self.item = None # initialized in work method
self.inpath = None # initialized in work method, if any
self.delpath = None # initialized in work methid, if any
self.outpath = None # initialized in work method, if any
def get_pattern(self):
"""
Get the pattern for matching list items.
Returns
-------
Pattern
A regex pattern matching the lines the worker takes care for.
"""
return self.pattern
def work(self, dispatcher, item):
"""
Overwrite this method to implement if required.
Call the super().work() method in your new method, to get
- self.dispatcher initialized
- the inpath initialized (if any), to the file to be processed
- the delpath initialized (if any)
- the delete() or the process() method or both called, whatever applies
Parameters
----------
dispatcher : GitMsgDispatcher
The dispatcher, which assigned the work item
item : str or dict
One matching line from the git message.
Or complex workitem added by ealier workers.
Returns
-------
None.
"""
self.dispatcher = dispatcher
self.item = item
# For some workers a dictionary is passed as item
if isinstance(item, str):
= item[14:].strip()
filename self.inpath = Path(filename)
if self.RENAMEPAT.match(item):
# part filename in new and old
# filename = line[14:].strip()
# self.inpath = Path(filename)
self.delpath = None # Needs to be assigned now
self.rename()
if self.CREATEPAT.match(item):
self.process()
if self.DELETEPAT.match(item):
self.delpath = self.inpath # clear deletion request
self.delete()
if isinstance(item, dict):
self.inpath = None
self.delpath = None
if item[self.task_type] == self.task_rename:
self.rename()
if item[self.task_type] == self.task_create:
self.process()
if item[self.task_type] == self.task_delete:
self.delete()
def delete(self):
"""
Overwrite this method to implement the actual work to be done.
The method is called by super.work(), if the message line is
about a rename or a deletion.
Since renames might go along with additional content change, deletion
and re-processing take place both in that case.
The path to the resource named in the message is available via
self.delpath
Depending on the type of content, more than just deleting the
file might be required.
Parameters
----------
None
Returns
-------
None.
"""
def process(self):
"""
Overwrite this method to implement the actual work to be done.
The method is called by super.work(), if the message line is
about a rename or a new file.
Since renames might go along with additional content change, deletion
and re-processing take place both in that case.
The path to the resource named in the message is available via
self.delpath
Parameters
----------
None
Returns
-------
None.
"""
def rename(self):
"""
Rename by delete and process.
Since we do not know, whether next to the rename additional
changes were applied, deletion and recreation is savest.
"""
self.delete()
self.process()
class ParameterValueWorker(MsgWorker):
"""
The ParameterValueWorker reads parameter value pairs.
Example of a line with parameter value pair:
# article:author=Firstname Lastname
These parameters in the git message allow the injection
of values for metadata, which would be otherwise not available.
Other workers can access the values dictionary via:
dispatcher.parameters.values
Parameters
----------
super: MsgWorker
The ParameterValueWorker is derived from the MsgWorker.
Returns
-------
ParameterValueWorker.
"""
def __init__(self, pattern=r"^#.*="):
super().__init__(pattern)
self.values = {}
def work(self, dispatcher, item):
"""."""
super().work(dispatcher, item)
if item.count('=') == 1:
= item.rpartition('=')
lineparts self.values.update(
0].strip('#').strip():
{lineparts[2].strip()}
lineparts[
)
if __name__ == "__main__":
pass
MediaWiki Worker: mwworker.py
This worker converts the MediaWiki file to the plain HTML version, which is used for Copy-Edit reading, which is a combined task with the audio recording.
I found out that I find typos best, if I read the text loud. Creating the audio is therefore a good option for me to improve the quality of the written text.
~/projects/idee/generator/mwworker.py
"""
MwWorker is derived from the MsgWorker base class.
@author: Frank Siebert
@license: https://creativecommons.org/publicdomain/zero/1.0/deed.en
@date: 2022-03-15
The MwWorker takes care of *.mediawiki files
in the author directory, if changes are committed
for them.
"""
import re
import subprocess
import sys
from bs4 import BeautifulSoup
from bs4 import Comment
from bs4.builder._htmlparser import HTMLParserTreeBuilder
from gitmsgdispatcher import MsgWorker
from gitmsgconstants import GitMsgConstants as gmc
from pdfworker import PdfWorker
from pubmetadata import PubMetaData
from pubmetadata import pageurn
class MwWorker(MsgWorker):
"""
The MwWorker takes care of *.mediawiki files in the author/ directory.
Example of a line taken care for
# modified: author/PDF-Icon.mediawiki
The line has to be from the section git message section:
# Changes to be committed:
The main output is an HTML created from the mediawiki file,
which is plain (without portal part) and stored in the
folder GITROOT/plain/
A minor output, a PDF, might be requirested via the message line:
# pdf:draft=true
The respective PDF is created from HTML and stored in the folder
GITROOT/website/pdf/
Parameters
----------
super: MsgWorker
The MwWorker is derived from the MsgWorker.
Returns
-------
MwWorker.
"""
def __init__(self, pattern):
super().__init__(pattern)
self.values = {}
@staticmethod
def __make_url_migration__(soup):
r"""
Migrate the wordpress url pattern to the new one.
Articles: r"idee.frank.siebert.de.\d{4}.\d{2}.\d{2}"
Parameters
----------
soup : BeautifulSoup
HTML represented by BeautifulSoup top level object
Returns
-------
soup. r"https://idee.frank-siebert/"
"""
= r"(https://idee\.frank-siebert\.de)"
site_r = r"([/]\d{4}[/]\d{2}[/]\d{2}[/])" # '/yyyy/MM/dd'
date_r = r"[/][a][r][t][i][c][l][e][/]"
article_r
# Links to own articles will be addressed by relative path,
# In article migration we point to pages in the same location.
= re.compile(site_r + date_r)
repattern = soup.find_all("a", attrs={"href": repattern})
tags
for tag in tags:
# in case page internal id was addressed
= tag["href"].split("#")
url 0] = "./" + repattern.sub("", url[0].rstrip("/"))\
url[+ ".html"
= '#'.join(url)
new_url = new_url.lower() # change camel case to lower case
new_url "href": new_url})
tag.attrs.update({
# References to own articles in the new portal
# shall be relative as well.
= re.compile(site_r+article_r)
reart = soup.find_all(re.compile(r"^a$"), attrs={"href": reart})
tags for tag in tags:
= "./" + reart.sub("", tag["href"])
new_url = new_url.lower()
new_url "href": new_url})
tag.attrs.update({return soup
def process(self):
"""
Process the mediawiki files into plain html files.
Returns
-------
None.
"""
# The file name is the title
= self.inpath.stem
title
# inject meta information from commit message
# Creates the single instance of Publishing Dictionary
self.dispatcher.parameters.values)
PubMetaData(
= PubMetaData.instance.get_new_revision(
article_data
title
)
# compose the output path
self.outpath = gmc.plainpath / pageurn(title)
self.outpath = self.outpath.with_suffix(".html")
self.outpath.resolve()
# To enable --toc, the parameter -s (standalone) needs to be set.
# This parameter leads to the generation of an html header with
# some meta tags.
# <!DOCTYPE html>
# <html lang="" xml:lang="" xmlns="http://www.w3.org/1999/xhtml">
# <head>
# <meta charset="utf-8"/>
# <meta content="pandoc" name="generator"/>
# <meta content="width=device-width, initial-scale=1.0,
# user-scalable=yes" name="viewport"/>
# <title>
# Verstehen
# </title>
# <style>
# code{white-space: pre-wrap;}
# span.smallcaps{font-variant: small-caps;}
# span.underline{text-decoration: underline;}
# div.column{display: inline-block; vertical-align:
# top; width: 50%;}
# div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;}
# ul.task-list{list-style: none;}
# </style>
# </head>
# <body>
# </body>
# </html>
# The TOC is created as <nav id="TOC"> tag,
# and it is not placed at the __TOC__
# location specified in the mediawiki page.
# Also __NOTOC__ is not honored.
# Own meta data lines need be injected and
# the toc needs to be moved to the correct location if specified,
# or removed, if specified.
= gmc.imagepath
imgdir
imgdir.resolve()= subprocess.run(["pandoc",
htmltext # extract media to the folder
# disabled after migration
# "--extract-media={}".format(imgdir),
# standalone (full html)
"-s",
# create table of content
"--toc",
"--toc-depth=5",
# mediawiki markup as input format
"-f", "mediawiki",
# html as output format
"-t", "html",
# input file
"-i", self.inpath
# don't use stdout, return the result
=True)
], capture_output
= htmltext.stdout.decode("utf-8")
html_doc = HTMLParserTreeBuilder()
builder = BeautifulSoup(html_doc, builder=builder)
soup
# stupid but not avoidable:
# pandoc does not know where we store the plain html.
# therefore it cannot set the links to medias correctly.
# we have to give a helping hand
# We could pandoc tell to use another working director to get
# the paths correct. TODO: change this when folders move again
= soup.find_all("img")
tags if tags:
for tag in tags:
"src": "../website/image/" + tag["src"]})
tag.attrs.update({# since we are already here, provide a cheap
# picture maximization via hraf to target _blank
= soup.new_tag("a")
newtag
tag.insert_after(newtag)0, tag)
newtag.insert(= tag["src"]
href # Special exception for licence icons
= re.compile(
creative_commons r".*CC-Icon.png")
= creative_commons.sub(
href "creative-commons-cc0-1-0-universal.html",
href)= re.compile(
creative_commons_0 r".*CC0-Icon.png")
= creative_commons_0.sub(
href "creative-commons-cc0-1-0-universal.html",
href)"href": href, "target": "_blank"})
newtag.attrs.update({
# inject language information
= soup.find("html")
tag "lang": article_data[PubMetaData.locale]})
tag.attrs.update({"xml:lang": article_data[PubMetaData.locale]})
tag.attrs.update({
# inject stylesheet link
# <link rel="stylesheet" href="../website/css/fs.css"/>
= soup.find("head")
tag = soup.new_tag("link")
newtag
newtag.attrs.update("rel": "stylesheet", "href": "../website/css/fs.css"})
{6, newtag)
tag.insert(
for key in article_data.keys():
if (key.startswith('og:') or key.startswith('article:')):
= soup.new_tag("meta")
newtag "property": key,
newtag.attrs.update({"content": article_data[key]})
6, newtag)
tag.insert(
# my own invention: article:urn
= soup.new_tag("meta")
newtag "property": PubMetaData.urn,
newtag.attrs.update({"content": article_data.name})
6, newtag)
tag.insert(
# http://www.gnuterrypratchett.com/
= soup.new_tag("meta")
newtag "http-equiv": "X-Clacks-Overhead",
newtag.attrs.update({"content": "Terry Pratchett"})
# inject the generator meta information.
# one exists already
= soup.find("meta", attrs={"name": "generator"})
tag "name": "generator", "content": gmc.generator})
tag.attrs.update({
# WikiLinks [https://webpage https//webpage]
# leads to nested anchor tags.
# The resulting page works in firefox, but it is no valid html.
# We use soup for the correction.
= soup.find_all("a")
tags for tag in tags:
= tag.find("a")
nested_a if nested_a:
= "" + nested_a.text
atext
nested_a.decompose()
tag.append(atext)
# use a better symbol for backreferences
= soup.find_all("a", text='↩︎')
tags for tag in tags:
tag.clear()'↑')
tag.append(
# Move the TOC to the correct location
= soup.find("nav", id='TOC')
toc = soup.find("p", text='__TOC__')
tag if tag:
tag.replace_with(toc)else:
= soup.find("p", text='__NOTOC__')
tag if tag:
tag.decompose()
toc.decompose()
# Footnotes get not placed at the location
# of the <references/> tag.
# Footnotes are generated as section
# <section class="footnotes" role="doc-endnotes">
# Search Section and use it to replace References.
= soup.find("section", class_="footnotes")
footnotes if footnotes:
= soup.find("references")
tag if tag:
tag.replace_with(footnotes)else:
print("Provode a reference tag as footnote target location.")
1)
sys.exit(
# Category-Links get a title "wikilink"
# Add those anchors a class "category" to hide them until
# I decide to use them.
# But "Kategorie:Artikel" gets removed. These are all articles.
= soup.find("a", href="Kategorie:Artikel")
tag if tag:
tag.decompose()
= soup.find_all("a", title="wikilink")
tags for tag in tags:
"class": "category"})
tag.attrs.update({
= r"https://idee\.frank-siebert\.de"
site_r = r"[/]\d{4}[/]\d{2}[/]\d{2}[/]" # '/yyyy/MM/dd'
date_r = r"[/][a][r][t][i][c][l][e][/]"
article_r
# Links to own articles will be addressed by relative path,
# In article migration we point to pages in the same location.
= re.compile(site_r + date_r)
repattern = soup.find_all("a", href=repattern)
tags
for tag in tags:
# in case page internal id was addressed
= tag["href"].split("#")
url 0] = "./" + repattern.sub("", url[0].rstrip("/")) + ".html"
url[= '#'.join(url)
new_url = new_url.lower() # change camel case to lower case
new_url "href": new_url})
tag.attrs.update({
# Links to other resources will be also addressed by relative path,
# Those resources need to be addressed by ../
= re.compile(site_r)
repattern = soup.find_all("a", href=repattern)
tags
for tag in tags:
= repattern.sub("..", tag["href"])
new_url = new_url.lower() # change camel case to lower case
new_url "href": new_url})
tag.attrs.update({
# References to own articles in the new portal
# shall be relative as well.
= re.compile(site_r+article_r)
reart = soup.find_all(re.compile(r"^a$"), attrs={"href": reart})
tags for tag in tags:
= "./" + reart.sub("", tag["href"])
new_url = new_url.lower()
new_url "href": new_url})
tag.attrs.update({
# its about articles, one article a page.
# For later site function injection, we need a
# container around the main content.
# After reading https://html.spec.whatwg.org/dev/sections.html
# I go for this structure:
# <body>
# <header">
# </header> Injected by SSI module in nginx
# <main> as semantic element for the main content
# <article> as semantic element for the article
# <header> an article header
# <h1>
# <div>
# <time pubdate="true" datetime=
# "2022-01-19T13:03:08">
# 2022-01-19
# </time>
# <address>Author Name</address>
= soup.find("body")
body = soup.new_tag("body") # temporary container
newbody
# SSI header injection is a function of the language
if article_data[PubMetaData.locale].startswith("de"):
= Comment('# include file="/portal/idee-header.html" ')
newtag else:
= Comment('# include file="/portal/concept-header.html" ')
newtag 0, newtag)
newbody.insert(
= soup.new_tag("main")
newtag 1, newtag)
newbody.insert(
= newtag
tag = soup.new_tag("article")
article 0, article)
tag.insert(
# previous body content becomes article content
# the new body replaces the old
= body.contents.copy()
article.contents
body.replace_with(newbody)
# inject article header information about
# title, creation date and author
= article
tag = soup.new_tag("header")
newtag 1, newtag)
tag.insert(= newtag
tag = soup.new_tag("h1")
newtag
newtag.append(title)0, newtag)
tag.insert(= soup.new_tag("div")
newtag 1, newtag)
tag.insert(= newtag
tag = soup.new_tag("time")
newtag 10])
newtag.append(article_data[PubMetaData.pubdate][:"datetime":
newtag.attrs.update({10][:19]})
article_data[PubMetaData.pubdate][:# probably deprecated by itemprop alternative
"pubdate": "true"})
newtag.attrs.update({0, newtag)
tag.insert(= soup.new_tag("address")
newtag "article:author"))
newtag.append(article_data.get(1, newtag)
tag.insert(
= soup.prettify()
html_doc
with open(self.outpath, 'w') as outfile:
print(html_doc, file=outfile)
outfile.flush()
outfile.close()
print('wrote file {0}'.format(self.outpath))
"firefox", self.outpath], capture_output=False)
subprocess.run([
# Placing a worklist item for the PdfWorker
if article_data[PubMetaData.pdfdraft] == "true":
self.dispatcher.worklist.append(
PdfWorker.make_pdf_worklist_item(
article_data.name,
html_doc,
gmc.plainpath,
MsgWorker.task_create,=True
draft
)
)
def delete(self):
"""
Delete the generated HTML.
Resources used by the HTML need additional care.
If the delete was triggered by rename, no resources have to be deleted.
If it was triggered by a delete, a check is required,
whether the resources are used by other pages as well.
But resources are place anyhow in the final website location.
They must not be deleted by the MwWorker.
"""
if __name__ == "__main__":
from gitmsgdispatcher import GitMsgDispatcher
print("Running Test-Cases")
= MwWorker(r".*(new file|modified).*author[/].*\.mediawiki")
mwworker = PdfWorker(r"" + PdfWorker.pdfworkitem)
pdfworker
# MESSAGEFILE = "test/PDF-Icon-TestCase-1"
# MESSAGEFILE = "test/mw_new_testcase"
# MESSAGEFILE = "test/WordPress-testcase-1"
# MESSAGEFILE = "test/ich-denke-TestCase-1"
# MESSAGEFILE = "test/FragenSieIhrenArzt-TestCase1"
# MESSAGEFILE = "test/PandemieBeenden-TestCase-1"
# MESSAGEFILE = "test/LegalTribune-TestCase-1"
= "test/TwoArticles-TestCase-1"
MESSAGEFILE = GitMsgDispatcher(MESSAGEFILE, [mwworker, pdfworker]) disp
Publishing Meta Data Management: pubmetadata.py
I have already written about the meta data export from WordPress and about the Python Module choice to manage the meta data in a csv file.
During migration it is of vital importance to identify the correct meta data entry to get correct publishing date shown in the article. And later we want to keep track of the original publishing date as well, if we perform updates.
The code above already shows that this is done in the PubMetaData class. Because the pages URN is the identifier in the stored publishing meta data, the Python file with class PubMetaData also contains the method pageurn(pagename), which computes the Unified Resource Name from the MediaWiki file name, which equals the article title used in the MediaWiki.
While the mwworker.py does not trigger a meta data save, it is vital for the migration and in the new scenario also for article updates, that the mwworker uses meta data if such exists for the processed article.
~/projects/idee/generator/pubmetadata.py
"""
The PubMetaData manages the meta information about the publishings.
@author: Frank Siebert
@license: https://creativecommons.org/publicdomain/zero/1.0/deed.en
@date: 2022-03-15
This includes the migrated data from WordPress as well as publishing data
created by new publishings with the new page generator.
"""
import datetime
import pandas as pd
from gitmsgconstants import GitMsgConstants as gmc
def pageurn(pagename):
"""
Create a browser friendly urn from the pagename.
German special characters are replaced by readable two-character
alternatives, and spaces in the filename are replaced with '-'.
Parameters
----------
pagename : String
The pagename, which is also the title of the article.
All characters can appear, but we want not all in the resulting URL.
Returns
-------
:String
Alternative URL friendly name.
"""
= pagename.lower().strip() \
urn ' ', '-').replace('/', '-') \
.replace('ß', 'ss').replace('ä', 'ae') \
.replace('ö', 'oe').replace('ü', 'ue') \
.replace('&', 'and').replace('\\', '-') \
.replace('?', '').replace(':', '') \
.replace('.', '-').replace(',', '') \
.replace("(", "").replace(")", "") \
.replace("\"", "").replace("!", "-") \
.replace("„", "").replace("“", "") \
.replace("#", "").replace("%", "") \
.replace("'", "")
.replace(
# remove stacked hypens
while "--" in urn:
= urn.replace("--", "-")
urn
= urn.rstrip('-')
urn
return urn
class PubMetaData():
"""
The PubMetaData manages the meta information about the publishings.
Parameters
----------
None.
Returns
-------
None.
"""
= None
instance
# used as column names as well as meta tag names
# article:urn is my own invention, who cares? It serves as unique index.
= "article:urn"
urn = "article:author"
author = "article:published_time"
pubdate = "article:modified_time" # Updatedate sounds stupid
revdate = "comments:count" # of some interest during migration.
commentcount = "og:title"
title = "og:site_name"
site = "og:locale"
locale
# not used in persistance
= "pdf:draft"
pdfdraft # not used now in persistance
= "deleted_time"
deletion
class _PubMetaData():
def __len__(self):
return len(self._storage)
def __init__(self, disp_msgparam):
"""
Initiale only one publishing dictionary.
Returns
-------
None.
"""
# The msg parameters from the message dispatcher
self._msgparam = disp_msgparam
# Registers for updates and deletions
self._updates = []
self._deletions = []
if not gmc.publishingdatapath.exists():
self._read_migration_list()
else:
self._read()
def _read_migration_list(self):
"""
Read the migration list.
The migrationlist.csv is one of two trusted sources
for the correct publishing date.
The second one is the pubmetadata.csv.
This method moves the migration list entries to pubmetadata.
As soon as the pubmetadata has been saved once,
this method is no longer required.
The data structure aligns to the planned pubmetadata
data structure.
The urn is the stem part of the url the page will finally have.
It serves as index in the pandas dataframe, which translates
into the name of the respecive Series of ones articles data.
article:published_time 2022-02-15T14:41:13.367917
article:modified_time 2022-02-15T14:41:13.367917
comments:count 0
og:site_name Idee
og:locale de-DE
article:author Frank Siebert
og:title Creative Commons CC0 1.0 Universal
pdf:draft true
Name: creative-commons-cc0-1-0-universal, dtype: object
Returns
-------
None.
"""
self._storage = pd.read_csv(gmc.migrationlistpath,
='\t',
delimiter=PubMetaData.urn)
index_col
def get_new_revision(self, title=None, urn=None):
"""
Provide publishing dictionary data for the title.
A message worker may use this method to get information
about the current publishing in work.
To make this useful, the meta information from the current
git message is incorporated into the article entry, if the meta
information is not already in by pevious publishings, bringing
all metadata required into one place.
If the worker succeeds and his work was not DRAFT publishing, the
worker may provide the article_data to get it saved via update().
If the workers task was the deletion of the publishing, the
worker nay provide the article_data to get the deletion
information saved via deletion().
Parameters
----------
title:
The title of the article, whose data has to be updated. If
privided, it is used to compute the urn of the article.
urn:
The unique resource name of the article, whose data has to be
updated.
Returns
-------
dict:
The titles data dictionary with revised data entries.
"""
= datetime.datetime.now().isoformat()
nowdate
if not urn and not title:
return None
if not urn:
= pageurn(title)
urn
if urn in self._storage.index: # .to_list():
= self._storage.loc[urn]
article_data # Working copy
= article_data.copy()
article_data else:
for index, article_data in self._storage.iterrows():
= pd.Series(
article_data ={
data
PubMetaData.title: title,
PubMetaData.pubdate: nowdate,0,
PubMetaData.commentcount: None,
PubMetaData.site: None,
PubMetaData.locale: None,
PubMetaData.author:
},=article_data.index,
index=article_data.dtype,
dtype=urn)
name= article_data.copy()
article_data "Name": urn})
article_data.update({break
# Set the revision date
article_data.update({PubMetaData.revdate: nowdate})
# Iterate the message parameter keys and add parameters and their
# value, if data for this key is not present in the titles
# data series.
# This also adds a key, if it is not part of the pubmetadata.csv.
for key in self._msgparam.keys():
if not article_data.get(key):
= self._msgparam[key]
article_data.loc[key] return article_data
def update(self, series):
self._updates.append(series)
def delete(self, series):
self._deletions.append(series)
def save(self):
"""
Save the publishing dict data.
Incorporates updates and new entries,
and removes entries deleted (Implementation
pending, probably I decide to extend the
data structure with a deleted column).
Deletions never took place till now,
might take a while till its implemented.
Returns
-------
None.
"""
for article_data in self._updates:
= article_data.name
urn self._storage.loc[urn] = article_data
for article_data in self._deletions:
pass # implementation pending
self._storage.to_csv(gmc.publishingdatapath,
=';', quotechar='"')
sep
def _read(self):
"""
Read the publishing dict data from previous publishings.
Returns
-------
None.
"""
self._storage = pd.read_csv(gmc.publishingdatapath,
=';',
delimiter=PubMetaData.urn)
index_col
def __init__(self, disp_msgparam):
if not PubMetaData.instance:
= PubMetaData._PubMetaData(disp_msgparam)
PubMetaData.instance
def __getattr__(self, name):
"""
Get attrubute value by name.
Parameters
----------
name : str
Name of the attribute.
Returns
-------
TYPE
Value of the attribute.
"""
return getattr(self.instance, name)
def __len__(self):
return len(PubMetaData.instance)
if __name__ == "__main__":
pass
Constants
That's a rather strange decision. Why would someone create a class to store constants?
I see these values less as real constants, but they are more likely to become, at least partly, configuration entries, when I decide to separate the generator code into a software package usable for more than one content project,
The existence of this class and its current content is a strong signal for the unfinished nature of the project. It's just ready for first use, nothing more.
~/projects/idee/generator/gitmsgconstants.py
"""
GitMsgConstants provides project wide constants.
@author: Frank Siebert
@license: https://creativecommons.org/publicdomain/zero/1.0/deed.en
@date: 2022-03-15
No Instance is required. Could leverage in future a config file.
"""
from pathlib import Path
class GitMsgConstants():
"""
Dispatch the lines of the git message to registered workers.
Parameters
----------
gitmessagepath : Path
Path as type str or type Path pointing to the git message.
msgworkers : List of MsgWorker
The list of message workers is used as worker queue. Workers first
in the queue get their workitems first.
Workers can return their work result to be picked up by
later workers.
Returns
-------
GitMsgConstants.
"""
= "pandoc, fs-commit-msg-hook 1.0"
generator = "https://idee.frank-siebert.de"
website = "3cd97bab8bb20288768b35fd72979ec3bbf4b2a8.png"
pdfimage
= Path("plain")
plainpath = Path("config")
confpath = Path("website")
sitepath = sitepath / "article"
articlepath = sitepath / "audio"
audiopath = sitepath / "css" / "fs.css"
csspath = sitepath / "portal" / "header.html"
headerpath = sitepath / "image"
imagepath = sitepath / "pdf"
pdfpath = sitepath / "qrcode"
qrpath
= confpath / "migrationlist.csv"
migrationlistpath = sitepath / "pubmetadata.csv"
publishingdatapath
= "pdf:draft"
pdfdraft = "og:locale"
locale
= sitepath / "archive"
archivepath = archivepath / Path("idee-archive.html")
idee_archive = archivepath / Path("concept-archive.html")
concept_archive = sitepath / Path("sitemap.xml")
sitemap = sitepath / Path("idee-map.xml")
idee_map = sitepath / Path("concept-map.xml")
concept_map = sitepath / "sitemap"
sitemappath = sitepath / "portal" / "monthly-map.xml"
map_template = sitepath / "portal" / "monthly-archive.html"
archive_template
= sitepath / Path("idee-rss.xml")
idee_rss = sitepath / Path("concept-rss.xml")
concept_rss
= sitepath / Path("idee-index.html")
idee_index = sitepath / Path("concept-index.html")
concept_index
if __name__ == "__main__":
pass
HTML Formatting: fs.css
If we generate HTML, we want also a nice view on it. The CSS is a critical part to get a nice looking result.
~/projects/idee/website/css/fs.css
/* ***************************************************************************
* Frank Siebert's CSS
+
* Licence: CC0
* httpx://frank-siebert.de/article/creative-commons-cc0-1-0-universal.html
* ***************************************************************************/
:root {
/* kind of blue */
--theme-color: #006080;
/* black on white */
--theme-text-color: #000000;
/* white background */
--theme-background-color: #ffffff;
/* for minor meta information */
--theme-meta-color: #999999;
/* Arial and Helvetica exist on my Computer */
/* --theme-font-family: Arial, Helvetica, Verdana, Tahoma, sans-serif; */
--theme-font-family: Liberation Sans, sans-serif;
/* One theme font only, based on the theme font-family */
--theme-font: 16px/1.4 Liberation Sans, sans-serif;
/* Improve readability */
--theme-letter-spacing: 0.05em;
}
html {padding: 0px 5px 0px 0px;
margin: 0;
border: 0;
font: var(--theme-font);
letter-spacing: var(--theme-letter-spacing);
background-color: lightgray;
}
body {width: 100%;
height: 100%;
min-width: 280px;
max-width:1200px;
padding: 0 0 0 0;
margin-top: 0;
margin-bottom: 0;
margin-left:auto;
margin-right:auto;
border-right: 1px solid var(--theme-color);
border-left: 1px solid var(--theme-color);
color: var(--theme-text-color);
background-color: var(--theme-background-color);
font-size: 1em;
word-wrap: break-word;
}
/* **************************************************************************
* keep the two body elements in sync
* **************************************************************************/
.row,
div,
body header
body main { min-height: 100px;
padding: 5px;
background-color: var(--theme-background-color);
background-repeat: no-repeat;
background-position: top center;
background-size: auto;
}
body header nav { padding: 0 0 0 0;
/* background: #ddcc99; */
}
/* **************************************************************************
* The tag <figure> comes with build in padding,
* but we have to have the same for the article.
*
* These styles keep the respectve block elements horizontally alligned.
*
* ==== MEDIA SCREEN Variants ====
* **************************************************************************/
@media screen and (min-width: 641px) {
, /* yacy search */
body div div,
header figure,
header nav,
header hr/* main>h3 is used in the archive.html*/
,
main article>h3 {
maindisplay: block;
margin: 1em 3em 1em 3em;
/* border-style: dotted;
* border-width: 2px; */
}>h1 {
maindisplay: block;
margin: 0.6em 1.8em 0.6em 1.8em;
}.searchinput {
max-width: 600px;
}
}
@media screen and (max-width: 640px) {
, /* yacy search */
body div div,
header figure,
header nav,
header hr/* main>h3 is used in the archive.html*/
,
main article>h3 {
maindisplay: block;
margin: 1em 1em 1em 0;
}>h1 {
maindisplay: block;
margin: 0.6em 0.6em 0.6em 0.2em;
}.searchinput {
max-width: 260px;
}
}
/* **************************************************************************
* ==== END OF MEDIA SCREEN Variants ====
* **************************************************************************/
/* the main content is the article */
article { display: block;
}
/* **************************************************************************
/* ==== all about headlines ====
* **************************************************************************/
/* Ich glaube nicht, dass ich a tags unter die Überschriften legen werde.
* h1 a, h2 a, h3 a, h4 a, h5 a, h6 a { text-decoration: none; } */
, h2, h3, h4, h5, h6
h1
{ line-height: 1.1;
margin: 0;
padding: 1em 0 0.5em 0;
color: var(--theme-color);
font-family: var(--theme-font-family);
font-weight: bold;
}
font-size: 1.8em; }
h1 { font-size: 1.6em; }
h2 { font-size: 1.4em; }
h3 { font-size: 1.2em; }
h4 { , h6 { font-size: 1em; }
h5
/* Newspaper Style First Letter of First Paragraph Upper-Case */
>p:first-of-type::first-letter,
article+p::first-letter,
hr+p::first-letter,
h2+p::first-letter,
h3+p::first-letter {
h4font-family: serif;
font-size: 1.8em;
font-weight: bold;
}
/* **************************************************************************
* ==== Article Header ====
* - h1 headline
* - address information
* - page qr-code
* - licence information
* - audio player
* **************************************************************************/
article header {min-height: 0
}
article header h1 {padding: 0 0 0.2em 0;
}
article header div {color: var(--theme-meta-color);
font-size: 0.8em;
padding: 0 0 1em 0;
}
/* The browser decided, that address gets rendered italic,
* but we do not want this */
,
article header time
article header address {padding-right: 20px;
display: inline;
font: var(--theme-font);
font-size:inherit
}
/* **************************************************************************
* ==== Article Block Elements
* **************************************************************************/
p {margin: 0;
font-size: 1em;
padding: 0 0 1em 0;
}
:last-child
p
{padding-bottom: 0;
}
table th {background: #ddd;
border-right: 1px solid #fff;
padding: 10px 20px;
}
:last-child {
table tr thborder-right: 1px solid #ddd;
}
table td {padding: 5px 20px;
border: 1px solid #ddd;
}
/* **************************************************************************
* ==== Figures in the header and in the article ====
* **************************************************************************/
width: 100%; height: auto; }
figure img { width: 50%; height: auto; min-height:2em;}
figure audio { font: var(--theme-font); font-size: 1em;
header figure figcaption { color: var(--theme-color); font-weight: bold}
margin: 10px }
article figure { font: var(--theme-font); font-size: 0.8em;
figure figcaption { color: var(--theme-color); font-style: italic; padding: 2px;}
display: Inline; }
article header div figure { width: 50px; }
article header div figure img { display: Inline; width: 150px }
article header div figure figcaption { margin: .5em .5em .5em .5em; }
article header div figure audio {
/* **************************************************************************
* ==== Navigation in the header ====
* **************************************************************************/
>nav>a {
headerfont-size: 1.2em;
padding: 0 0.5em 0 0;
display: inline-grid;
grid-template-columns: 30px auto auto auto;
}
>nav>a>img {
headerwidth: 24px;
vertical-align: sub;
}
>nav>form {
headerdisplay: inline;
padding: 0 0.5em 0 0;
margin: 0 0 0 0;
}
>nav>form>input{
headerfont: var(--theme-font);
letter-spacing: var(--theme-letter-spacing);
font-size: 1em;
vertical-align: super;
padding: 0 0 0 0;
margin: 0 0 0 0;
border-color: var(--theme-color);
}
/* context break is meta information */
hr {height:1px;
border-width:0;
background-color: var(--theme-meta-color);
}
/* **************************************************************************
* inline HTML TAGS
* **************************************************************************/
pre {background: #f5f5f5;
border: 1px solid #ddd;
padding: 10px;
text-shadow: 1px 1px rgba(255, 255, 255, 0.4);
font-size: 0.8em;
line-height: 1.25;
margin: 0 0 1em 0;
overflow: auto;
}
, sub {
supfont-size: 0.75em;
height: 0;
line-height: 0;
position: relative;
vertical-align: baseline;
}
sup {bottom: 1ex;
}
sub {top: 1ex;
}
small { font-size: 0.75em
}
/* **************************************************************************
* ==== Navigation and their targets ====
* **************************************************************************/
*:target {
border-bottom: 0.3em solid var(--theme-color);
}
a { text-decoration: none;
font: var(--theme-font);
font-size: 1em;
font-weight: bold;
color: var(--theme-color);
border-width: 0 0 0 0;
border-style: none;
}:link { color: var(--theme-color); }
a:visited { color: var(--theme-text-color); }
a
/* figure:has(a:focus), */ /* Wait for CSS 4 */
:focus,
a:hover /* ,
aa:active */ {
color: var(--theme-background-color);
background-color: var(--theme-color);
outline: none;
}
:focus,
figure a:hover {
figure acolor: var(--theme-background-color);
background-color: var(--theme-color);
outline: none;
border: none;
}
>div>a:focus,
header>div>a:hover {
headerbackground-color: var(--theme-background-color);
color: var(--theme-color);
outline: none;
border: none;
}
.category { visibility: hidden; }
a
/* **************************************************************************
* ==== YaCy Search ====
* **************************************************************************/
.urlinfo :nth-child(2),
p.urlinfo :nth-child(3),
p.urlinfo :nth-child(4),
p.urlinfo :nth-child(5),
p.favicon,
.navbar,
.starter-template,
.hidden,
.urlactions,
.input-group-btn,
.sidebar,
#datehistogram,
#api {
display: none;
}
div {min-height: 10px;
margin: 0 0 0 0;
padding: 0 0 0 0;
}
#resNav ul li {
spandisplay: inline;
font-size: 1.4em;
}
.searchinput {
font: var(--theme-font);
letter-spacing: var(--theme-letter-spacing);
font-size: 1em;
border-color: var(--theme-color);
outline: 5px solid var(--theme-meta-color);
}
.linktitle,
.pagination {
font-size: 1.4em;
border-top: 2px solid var(--theme-meta-color);
}
/* **************************************************************************
* ==== syntaxhighlight ====
* CSS as created in the html style-element by WeasyOrint for syntaxhighlight
* Changes for the print version need to be applied in fspdf.css
* Changes for the browser version need to be applied at the end of this file.
* **************************************************************************/
white-space: pre-wrap;}
code{.smallcaps{font-variant: small-caps;}
span.underline{text-decoration: underline;}
span.column{display: inline-block; vertical-align: top; width: 50%;}
div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;}
div.task-list{list-style: none;}
ul> code.sourceCode { white-space: pre; position: relative; }
pre > code.sourceCode > span { display: inline-block; line-height: 1.25; }
pre > code.sourceCode > span:empty { height: 1.2em; }
pre .sourceCode > span { color: inherit; text-decoration: inherit; }
code.sourceCode { margin: 1em 0; }
div.sourceCode { margin: 0; }
pre
@media screen {
.sourceCode { overflow: auto; }
div
}
@media print {
> code.sourceCode { white-space: pre-wrap; }
pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; }
pre
}
.numberSource code
precounter-reset: source-line 0; }
{ .numberSource code > span
preposition: relative; left: -4em; counter-increment: source-line; }
{ .numberSource code > span > a:first-child::before
precontent: counter(source-line);
{ position: relative; left: -1em; text-align: right; vertical-align: baseline;
border: none; display: inline-block;
none; -webkit-user-select: none;
-webkit-touch-callout: none; -moz-user-select: none;
-khtml-user-select: -ms-user-select: none; user-select: none;
padding: 0 4px; width: 4em;
color: #aaaaaa;
}.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa;
prepadding-left: 4px; }
.sourceCode
div
{ }
@media screen {
> code.sourceCode > span > a:first-child::before {
pre text-decoration: underline; }
}
.al { color: #ff0000; font-weight: bold; } /* Alert */
code span.an { color: #60a0b0; font-weight: bold; font-style: italic;
code span/* Annotation */
} .at { color: #7d9029; } /* Attribute */
code span.bn { color: #40a070; } /* BaseN */
code span.bu { } /* BuiltIn */
code span.cf { color: #007020; font-weight: bold; } /* ControlFlow */
code span.ch { color: #4070a0; } /* Char */
code span.cn { color: #880000; } /* Constant */
code span.co { color: #60a0b0; font-style: italic; } /* Comment */
code span.cv { color: #60a0b0; font-weight: bold; font-style: italic;
code span/* CommentVar */
} .do { color: #ba2121; font-style: italic; } /* Documentation */
code span.dt { color: #902000; } /* DataType */
code span.dv { color: #40a070; } /* DecVal */
code span.er { color: #ff0000; font-weight: bold; } /* Error */
code span.ex { } /* Extension */
code span.fl { color: #40a070; } /* Float */
code span.fu { color: #06287e; } /* Function */
code span.im { } /* Import */
code span.in { color: #60a0b0; font-weight: bold; font-style: italic;
code span/* Information */
} .kw { color: #007020; font-weight: bold; } /* Keyword */
code span.op { color: #666666; } /* Operator */
code span.ot { color: #007020; } /* Other */
code span.pp { color: #bc7a00; } /* Preprocessor */
code span.sc { color: #4070a0; } /* SpecialChar */
code span.ss { color: #bb6688; } /* SpecialString */
code span.st { color: #4070a0; } /* String */
code span.va { color: #19177c; } /* Variable */
code span.vs { color: #4070a0; } /* VerbatimString */
code span.wa { color: #60a0b0; font-weight: bold; font-style: italic;
code span/* Warning */
}
/* **************************************************************************
* ==== syntaxhighlight ====
* Own Part
* **************************************************************************/
.sourceCode {
prewidth: 80ch; /* classic terminal width for code sections */
}
MediaWiki to HTML Recapitulation
At this point it is possible to copy a MediaWiki title and to use it via paste into the command line:
we 'MediaWiki title'
Halt! We are missing something here. The command we is unknown to your system. But you can get rid of this problem by placing the folling line into the file
~/.bash_aliases
alias we='~/projects/wikitools/src/export.py'
This simplifies your life a lot, since you need to remember only w iki e xport has to be written as we on the command line.
The export, if you made the default wiki configuration in your configuration file correctly, will create the file 'MediaWiki title.mediawiki' in the directory ~/projects/idee/author/ .
It does not matter in which working directory you are, when you invoke this command.
You can also use
~/projects/idee$ git add .
~/projects/idee$ git commit
This will add your new MediaWiki file to the commit list and start the commit. It will trigger the invocation of the MWWorker and you will get an HTML file named 'mediawiki-title.html' placed into the directory ~/projects/idee/plain/ and opened into Firefox.
Well, you might need to comment out some parts of the code, because some not yet implemented parts are referenced in it.
HTML to PDF Conversion
I was pretty sure that I would be able to convert the plain HTML into a portal page. Therefore the priority was PDF generation first.
Logically I started this PDF generation using Pandoc. Quite a big part in the later chapter [Migration#Migration] reports the various problems I did run into and how I managed to solve them. I keep these parts in the documentation, since they might help one or another person to solve these problems.
In the end I found out, that I will not find any possibility in Pandoc to get working links in the footnote section, which point back to footnote number in the text.
This was too much functional loss and I was not able to accept it. I ended up using WeasyPrint. A lot of code commented out was required to make the results in Pandoc look ok.
For WeasyPrint I needed to create an extra CSS file, but the result looks good, at least for my taste.
The WeasyPrint installation description is further down in this document, there where it happened in my project, nearly at the end of all.
The PDFWorker
~/projects/idee/generator/pdfworker.py
"""
PdfWorker is derived from the MsgWorker base class.
@author: Frank Siebert
@license: https://creativecommons.org/publicdomain/zero/1.0/deed.en
@date: 2022-03-15
The PdfWorker takes care of a worklist item placed
by an earlier worker.
"""
import re
from pathlib import Path
from bs4 import BeautifulSoup
from bs4.builder._htmlparser import HTMLParserTreeBuilder
from weasyprint import HTML
from weasyprint import CSS
from gitmsgdispatcher import MsgWorker
from gitmsgconstants import GitMsgConstants as gmc
# from pubmetadata import pageurn
class PdfWorker(MsgWorker):
"""
The PdfWorker takes care of a worklist item placed by an earlier worker.
The class method makePdfWorklistItem() can be used to create a work item,
which can be placed into the worklist.
The respective PDF is created from HTML and stored in the folder
GITROOT/website/pdf/
Parameters
----------
super: MsgWorker
The MwWorker is derived from the MsgWorker.
Returns
-------
MwWorker.
"""
# keys
= "pdfworkitem"
pdfworkitem = "urn"
urn = "title"
title = "workpath"
workpath = "html_doc"
html_doc = "draft"
draft
def __init__(self, pattern):
super().__init__(pattern)
self.values = {}
def process(self):
"""
Create a PDF file for the HTML.
Parameters
----------
html_doc: Type String of HTML
workpath: Type Path, Folder of html file location (planned or factual)
Converts the HTML provived as String containing an article with updated
publishing date into PDF.
It might be a draft for a new plain html or it might be
publishing version with updated publishing date but still
without portal injection.
This makes no difference for the processing result.
Returns
-------
None.
Implementation Notes
--------------------
The PDF generation fails, if pictures in tables are embedded inside
of a figure tag. To address this, we have to open the html file,
look for figures inside of tables, and remove the figure without
removing the figures content.
Then we need to save the result in a temporary file and tell
pandoc the correct workdirectory for the successful resolutiin
of relative pathes in href and src entries in the html.
"""
= self.item[PdfWorker.html_doc]
html_doc = self.item[PdfWorker.workpath]
workpath = self.item[PdfWorker.draft]
draft
= HTMLParserTreeBuilder()
builder = BeautifulSoup(html_doc, builder=builder)
soup
= soup.find("title")
title
self.outpath = gmc.pdfpath / (self.item[PdfWorker.urn] + ".pdf")
self.outpath = self.outpath.resolve()
= workpath.resolve()
workpath
if draft:
= title.text.strip() + " - DRAFT"
newtitle
title.clear()
title.append(newtitle)
# First we need to remove some things.
# The article header
= soup.find("article")
tag = tag.find("header")
header
# tags = header.find_all("figcaption")
# for tag in tags:
# tag.decompose()
= header.find_all("figure")
tags if len(tags) == 3:
2].decompose() # remove audio
tags[if len(tags) > 1:
1].decompose() # remove PDF Icon in the PDF Version
tags[# if len(tags) > 0:
# # size the qrcode picture
# tag = tags[0].find("img")
# tag.attrs.update({
# "height": "80px",
# "width": "80px"
# })
# tags[0].unwrap()
# figures in tables do not work in pandoc
# tables = soup.find_all("table")
# for table in tables:
# tags = table.find_all("figcaption")
# for tag in tags:
# tag.unwrap()
# tags = table.find_all("figure")
# for tag in tags:
# tag.unwrap()
# tables = soup.find_all("table")
# for table in tables:
# figs = table.find_all("figure")
# for fig in figs:
# figcap = fig.find("figcaption")
# if figcap:
# figcap.unwrap()
# fig.unwrap()
# headers = soup.find_all("header")
# for header in headers:
# figs = header.find_all("figure")
# for fig in figs:
# figcap = fig.find("figcaption")
# if figcap:
# figcap.unwrap()
# fig.unwrap()
# We need to change relative paths to own articles into absolute
# paths.
= re.compile(r"^\.\/")
rhref = soup.find_all("a", href=rhref)
anchors for anchor in anchors:
= rhref.sub("https://idee.frank-siebert.de/article/",
url "href"])
anchor["href": url})
anchor.attrs.update({
# On paper we need complete written URLs
= re.compile(r"^http.*")
rhref = soup.find("section", class_="footnotes")
tag
if tag:
= tag.find_all("a", href=rhref)
anchors for anchor in anchors:
= anchor["href"]
url "br"))
anchor.parent.append(soup.new_tag(
anchor.parent.append(url)
= Path(r"/home/frank/projects/idee/website/css/fspdf.css")
csspath
csspath.resolve()# if csspath.exists():
# print("css exists")
= soup.prettify()
html_doc
= HTML(string=html_doc, base_url=str(workpath))
weasy_html =self.outpath,
weasy_html.write_pdf(target=[CSS(filename=str(csspath))]
stylesheets
)
# subprocess.run(["pandoc",
# # mediawiki markup as input format
# "-f", "html",
# # html as output forma
# "-t", "pdf",
# # input file
# # "-i", inpath,
# # output file
# "-o", self.outpath,
# # "--pdf-engine=xelatex",
# "--pdf-engine=weasyprint",
# "--variable=mainfont:Liberation Sans",
# "--variable=sansfont:Liberation Sans",
# "--variable=monofont:Liberation Mono",
# "--css", csspath,
# # "--variable=mainfont:DejaVu Serif",
# # "--variable=sansfont:DejaVu Sans",
# # "--variable=monofont:DejaVu Sans Mono",
# # "--variable=geometry:a4paper",
# # "--variable=geometry:margin=2.5cm",
# # "--variable=linkcolor:blue"
# ],
# capture_output=False,
# # the correct workdirectory to find the images
# cwd=workpath,
# # html string as stdin
# input=html_doc.encode("utf-8"))
# print('wrote file {0}'.format(self.outpath))
# subprocess.run(["firefox", pdfpath], capture_output=False)
def delete(self):
"""
Delete the generated HTML.
Resources used by the HTML need additional care.
If the delete was triggered by rename, no resources have to be deleted.
If it was triggered by a delete, a check is required,
whether the resources are used by other pages as well.
But resources are place anyhow in the final website location.
They must not be deleted by the MwWorker.
"""
@staticmethod
def make_pdf_worklist_item(urn, html_doc, workpath, task_type,
=False):
draft"""
Create a worklist item for the PdfWorker.
Parameters
----------
title : str
Title of the article
urn: str
The unique resource name, also stem of the related files
html_doc : str
The generated HTML to transform.
workpath : TYPE
Where to work to have the relative links right.
task_type : str, optional
One of MsgWorker.task_*
draft : TYPE, optional
Flag whether this is a PDF draft work item. The default is False.
Returns
-------
None.
"""
return {
MsgWorker.task_worker_match: PdfWorker.pdfworkitem,
PdfWorker.urn: urn,
PdfWorker.html_doc: html_doc,
PdfWorker.workpath: workpath,
MsgWorker.task_type: task_type,
PdfWorker.draft: draft
}
if __name__ == "__main__":
pass
The PDF Style Sheet
~/projects/idee/website/css/fspdf.css
/* ***************************************************************************
* Frank Siebert's PDF CSS
+
* Licence: CC0
* httpx://frank-siebert.de/article/creative-commons-cc0-1-0-universal.html
* ***************************************************************************/
html {font-family: Liberation Sans, sans-serif !important;
font: 12px/1.4 Liberation Sans, sans-serif !important;
background-color: #ffffff !important;
}
@page {
size: A4; /* Change from the default size of A4 */
margin: 1.5cm; /* Set margin on each page */
@top-right {
content: counter(page);
color: #006080;
font-size: 1.2em;
}
@top-left {
content: string(pageheader);
color: #006080;
font-size: 1.2em;
}
}
header h1 {string-set: pageheader content();
}
width: 150px !important; }
article header div figure img {
/* **************************************************************************
* ==== syntaxhighlight ====
* **************************************************************************/
/* Allow only intentional line breaks in source code */
> code.sourceCode > span {
pre white-space: nowrap !important;
}
.sourceCode {
prewidth: 80ch !important; /* classic terminal width for code sections */
}
HTML to PDF Recapitulation
At this point of the implementation the PDFWorker related parts no longer need to be commented out. And you can request in the commit message the generation of a draftPDF, if you commit a new MediaWiki file.
Note that this is only a chapter in a much longer description.
The portal page contains:
- A QRCode poining to its own URL
-
A PDF
- For low content pages PDF generation can be suppressed.
-
License Information
- For low content pages License Information can be suppressed.
- Audio controls if an an audio was created.
- The portal header
The audio is not generated, it needs to recorded and saved in the folder ~/projects/idee/website/audio/ with the same filename as computed for the plain HTML file, but with extension mpg.
Note to myself: Consider to allow ogg as alternative extension.
The license information is given by license icons, linking to an article text about the license. Apart of the code to place the icon and to link to article this part is mainly content.
The only missing pieces are the qrcode generator, but that exists as ready to use Python module, and the portal header to be integrated.
The original plan was to include the portal HTML fragment into the article HTML file, because HTML does not support any includes, even not with same origin policy. Luckily I discovered that the web server nginx supports such includes on the server side. The respective include instruction is already included in the plain HTML version.
Portal Header
The Portal Header is an HTML fragment file.
~/projects/idee/website/portal/idee-portal.html
<header>
<figure>
<a href="/idee-index.html" alt="Home" tabindex="1">
<img src="../image/bookpress.jpg" alt="Idee der eigenen Erkenntnis"
srcset="../image/bookpress.jpg 1600w,
../image/bookpress-300x43.jpg 300w,
../image/bookpress-768x110.jpg 768w,
../image/bookpress-1024x147.jpg 1024w,
../image/bookpress-1568x225.jpg 1568w"
sizes="(max-width: 1600px) 100vw, 1600px" width="1600" height="auto"/>
</a>
<figcaption>
Idee der eigenen Erkenntnis</figcaption>
</figure>
<nav>
<form action="../yacysearch.html" accept-charset="UTF-8" method="get">
<input type="text" name="query" placeholder="Suche.." maxlength="80"
autocomplete="off" tabindex="2"/>
<input type="hidden" name="verify" value="cacheonly" />
<input type="hidden" name="maximumRecords" value="10" />
<input type="hidden" name="meanCount" value="5" />
<input type="hidden" name="resource" value="local" />
<input type="hidden" name="urlmaskfilter" value=".*" />
<input type="hidden" name="prefermaskfilter" value="" />
<input type="hidden" name="display" value="2" />
<input type="hidden" name="nav" value="all" />
<input type="submit" name="Enter" value="Search" title="Suche"
alt="Suche" hidden>
</form>
<a href="../idee-rss.xml" tabindex="3">
<img src="../image/RSS.png" alt="RSS-Feed" width="1em"/>
RSS</a>
<a href="../article/rechtliches.html" rel="nofollow"
alt="Impressum, Urheberrecht und Datenschutz" tabindex="4">
<img src="../image/Legal.png" alt="RSS-Feed" width="1em"/>
Rechtliches</a>
<a href="../archive/idee-archive.html"
alt="Archiv" tabindex="5">
<img src="../image/Archive.png" alt="Archiv" width="1em"/>
Archiv</a>
</nav>
<hr/>
<script src="../js/header.js" type="text/javascript" defer></script>
</header>
Portal Page Generation: The PlainWorker
~/projects/idee/generator/plainworker.py
"""
PlainWorker is derived from the MsgWorker base class.
@author: Frank Siebert
@website: https://idee.frank-siebert.de
@license: https://creativecommons.org/publicdomain/zero/1.0/deed.en
@date: 2022-03-15
The PlainWorker takes care of *.mediawiki files
in the author directory, if changes are committed
for them.
"""
import re
import subprocess
import qrcode
from bs4 import BeautifulSoup
from bs4.builder._htmlparser import HTMLParserTreeBuilder
from gitmsgdispatcher import GitMsgDispatcher
from gitmsgdispatcher import MsgWorker
from gitmsgconstants import GitMsgConstants as gmc
from pdfworker import PdfWorker
from pubmetadata import PubMetaData
class PlainWorker(MsgWorker):
"""
The PlainWorker takes care of *.mediawiki files in the author/ directory.
Example of a line taken care for
# modified: author/PDF-Icon.mediawiki
The line has to be from the section git message section:
# Changes to be committed:
The main output is an HTML created from the mediawiki file,
which is plain (without portal part) and stored in the
folder GITROOT/plain/
A minor output, a PDF, might be requirested via the message line:
# pdf:draft=true
The respective PDF is created from HTML and stored in the folder
GITROOT/website/pdf/
Parameters
----------
super: MsgWorker
The PlainWorker is derived from the MsgWorker.
Returns
-------
PlainWorker.
"""
= None
portal_header_fragment = "./creative-commons-cc0-1-0-universal.html"
licence = "../image/CC-Icon.png"
ccimg = "../image/CC0-Icon.png"
cc0img
def __init__(self, pattern):
super().__init__(pattern)
self.values = {}
@staticmethod
def __make_qrcode__(stem):
"""
Create a qrcode for the page, whose stem name is provided.
The created qrcode is saved in the sites qrcode directory.
We create a QR Code for each article, containing its URL
Parameters
----------
stem : String
Returns
-------
None.
"""
= gmc.website + "/article/" + stem + ".html"
docurl = qrcode.make(data=docurl)
image = gmc.qrpath / stem
qrpath = qrpath.with_suffix(".png")
qrpath
qrpath.resolve()
image.save(qrpath)print('wrote file {0}'.format(qrpath))
@staticmethod
def __make_portal_page__(soup, urn, create_pdf):
"""
Inject the portal into prepared HTML.
Function:
The tag <header> in the context of <body>
is replaced with the portal header.
Parameters
----------
soup : BeautifulSoup, require
DESCRIPTION. HTML page as BeautifulSoup Opject.
urn : Str
Unique Resource Identifier also used as stem in related files
Returns
-------
soup.
"""
# include the favicon just behind the css link
= soup.find("link")
csslink = soup.new_tag("link")
newtag "rel": "icon",
newtag.attrs.update({"href": r"../image/favicon.ico",
"type": "image/x-icon"
})
csslink.insert_after(newtag)
# inject article artefacts
= soup.find("article")
tag = tag.find("header")
tag
= soup.new_tag("div")
headermedia
tag.append(headermedia)
# Move Article -> Div Artefacts to Article -> Header -> Div
= soup.find('article')
tag = tag.find("div")
tag if tag:
headermedia.replace_with(tag)= tag
headermedia
= soup.find("article")
tag = tag.find("div")
tag
if create_pdf:
= soup.new_tag("figure")
newtag 1, newtag)
headermedia.insert(= newtag
tag = soup.new_tag("a")
newtag
tag.append(newtag)
"accesskey": "p",
newtag.attrs.update({# "download": "",
"href": r"../pdf/" + urn + ".pdf",
"target": "_blank",
"type": "application/pdf"
})
# Inject the PDF Icon
= newtag
tag = soup.new_tag("img")
newtag
tag.append(newtag)"src": "../image/" + gmc.pdfimage})
newtag.attrs.update({
# Inject the Audio Player, if an audio does exist
= gmc.audiopath / (urn + ".mp3")
audio
audio.resolve()if audio.exists():
= r"../audio/" + urn + ".mp3"
audio = soup.new_tag("figure")
newtag
headermedia.append(newtag)= newtag
tag = soup.new_tag("audio")
newtag
tag.append(newtag)"accesskey": "a",
newtag.attrs.update({"type": "audio/mp3",
"preload": "none",
"controls": "true",
"src": audio})
# Finally, no more additions expected,
# We give every anchor a tabindex
# 5 (or less) Tabindexes are in the portal header
= 6
index = soup.find_all(re.compile(r"^a$|^audio$|^input$"))
tags for tag in tags:
"tabindex": index})
tag.attrs.update({+= 1
index
return soup
def process(self):
"""
Process the plain HTML files into article HTML files.
Returns
-------
None.
"""
# inject meta information from commit message
# Creates the single instance of PubMetaData
self.dispatcher.parameters.values)
PubMetaData(
# compose the output path
self.outpath = gmc.articlepath / self.inpath.stem
self.outpath = self.outpath.with_suffix(".html")
self.outpath.resolve()
# The plain html contains a publising date.
# But this might be the date the plain html was created,
# and not the real publishing date, if no previous publishing
# took place.
# We need to read the plain html and use the title to search
# for a publishing date of previous publishings.
# If we do not find a previous publishing date, we need
# to change the publishing date entries to the current date.
with open(self.inpath, 'r') as infile:
= infile.read()
html_doc
infile.flush()
infile.close()
= HTMLParserTreeBuilder()
builder = BeautifulSoup(html_doc, builder=builder)
soup
# Own magic words:
# __NOPDF__ Do not create PDF
# __NOLIC__ Place no own CC0 license information
# Noting but whitespaces and magic word in one line
= True
create_pdf = soup.find("p", string=re.compile(r'^\s*__NOPDF__\s*$'))
tag if tag:
= False
create_pdf
tag.decompose()
= True
show_lic = soup.find("p", string=re.compile(r'^\s*__NOLIC__\s*$'))
tag if tag:
= False
show_lic
tag.decompose()
= soup.find("title").text.strip()
title
= PubMetaData.instance.get_new_revision(
article_data =title,
title=self.inpath.stem # takes preference before title
urn
)
= soup.find("meta", attrs={"property": PubMetaData.pubdate})
tag
tag.attrs.update({"property": PubMetaData.pubdate,
"content": article_data[PubMetaData.pubdate]})
= soup.find("time")
tag
tag.clear()10])
tag.append(article_data[PubMetaData.pubdate][:"datetime": article_data[PubMetaData.pubdate][:19]})
tag.attrs.update({# probably deprecated by itemprop alternative
"pubdate": "true"})
tag.attrs.update({
= soup.find(
tag "meta", attrs={"property": PubMetaData.revdate})
if not tag:
# inject the modified_time as meta tag
= soup.find("head")
head = soup.new_tag("meta")
tag 6, tag)
head.insert(
tag.attrs.update({"property": PubMetaData.revdate,
"content": article_data[PubMetaData.revdate]})
# take care for links
# For a start we know, that "../website/" becomes "../".
= soup.find_all(re.compile("link|a"),
tags ={"href": re.compile(r"../website/")})
attrsfor tag in tags:
= tag["href"]
shref = shref.replace("../website/", "../")
shref "href": shref})
tag.attrs.update({
= soup.find_all("img",
tags ={"src": re.compile(r"../website/")})
attrsfor tag in tags:
= tag["src"]
shref = shref.replace("../website/", "../")
shref "src": shref})
tag.attrs.update({
# Insert header div for article artefacts
# Embedd it into the article.
= soup.find("article")
tag = soup.new_tag("div")
headerdiv 1, headerdiv)
tag.insert(
# Create QR code for the document and the site.
# Embedd it into header div.
self.__make_qrcode__(self.inpath.stem)
= "../qrcode/" + self.inpath.stem + ".png"
qruri = soup.new_tag("figure")
newtag
headerdiv.append(newtag)= newtag
tag = soup.new_tag("figcaption")
newtag # Decided in the end to get rid of text for the RQ Code
# newtag.append(soup.new_string("URL"))
0, newtag)
tag.insert(
= soup.new_tag("a")
newtag "href": qruri})
newtag.attrs.update({0, newtag)
tag.insert(= newtag
tag = soup.new_tag("img")
newtag "width": "150px", "height": "150px"})
newtag.attrs.update({"src": qruri})
newtag.attrs.update({"alt": "QR Code"})
newtag.attrs.update({0, newtag)
tag.insert(
if show_lic:
= soup.new_tag("a")
newtag
headerdiv.append(newtag)"href": PlainWorker.licence})
newtag.attrs.update({= newtag
tag = soup.new_tag("img")
newtag # The following scaling is for the PDF
# In the browser the CSS overwrites this scaling:
"width": "28px", "height": "28px"})
newtag.attrs.update({"src": PlainWorker.ccimg})
newtag.attrs.update({"alt": "Creative Commons"})
newtag.attrs.update({0, newtag)
tag.insert(
= soup.new_tag("a")
newtag
headerdiv.append(newtag)"href": PlainWorker.licence})
newtag.attrs.update({= newtag
tag = soup.new_tag("img")
newtag # The following scaling is for the PDF
# In the browser the CSS overwrites this scaling:
"width": "28px", "height": "28px"})
newtag.attrs.update({"src": PlainWorker.cc0img})
newtag.attrs.update({"alt": "Zero"})
newtag.attrs.update({0, newtag)
tag.insert(
# Make a portal page from the html
= self.__make_portal_page__(soup, self.inpath.stem, create_pdf)
soup = soup.prettify()
html_doc
# Save the article.
with open(self.outpath, 'w') as outfile:
print(html_doc, file=outfile)
outfile.flush()
outfile.close()print('wrote file {0}'.format(self.outpath))
"firefox", self.outpath], capture_output=False)
subprocess.run([
# Flag a metadata update
PubMetaData.instance.update(article_data)
if create_pdf:
# Placing a worklist item for the PdfWorker
self.dispatcher.worklist.append(
PdfWorker.make_pdf_worklist_item(
article_data.name,
html_doc,
gmc.articlepath,
MsgWorker.task_create,=False
draft
)
)
def delete(self):
"""
Delete the generated HTML.
Resources used by the HTML need additional care.
If the delete was triggered by rename, no resources have to be deleted.
If it was triggered by a delete, a check is required,
whether the resources are used by other pages as well.
But resources are place anyhow in the final website location.
They must not be deleted by the PlainWorker.
"""
if __name__ == "__main__":
from mwworker import MwWorker
print("Running Test-Cases")
= MwWorker(r".*modified.*author[/].*\.mediawiki")
mwworker = PlainWorker(r".*[modified|new file].*plain[/].*\.html")
plainworker = PdfWorker(r"" + PdfWorker.pdfworkitem)
pdfworker
# MESSAGEFILE = "test/PDF-Icon-TestCase-2"
# MESSAGEFILE = "test/cc-plain-testcase"
# MESSAGEFILE = "test/englands-gesamttodesraten-TestCase-2"
# MESSAGEFILE = "test/endlich-TestCase-2"
# MESSAGEFILE = "test/ich-denke-TestCase-2"
# MESSAGEFILE = "test/astravacz-TestCase-2"
= "test/allesaufdentisch-TestCase-2"
MESSAGEFILE = GitMsgDispatcher(MESSAGEFILE, [mwworker, plainworker, pdfworker]) disp
Portal Page Conversion Recapitulation
With this code part included we can create the final article HTML. We can also view it in the Browser with its QRCode, PDF, license information and audio. Thanks to relative paths everything works in locally viewed HTML file. But to see it as a portal page, we need to setup nginx to perform the include.
To trigger this conversion, another git add and git commit sequence is required. This does make sense, since the scenario sees the plain HTML version a base for copy-editing and audio recording.
Idee Website Server Setup
User and Group git
A user named git is used and the server git repository resides in /home/git/idee.git/.
Create git
The following command creates an empty git repository without working directory (--bare), which is supossed to be shared between multiple users (--share=group).
git@sol:~$ git init --bare --share=group idee.git
Git initializes the folders with a sticky group permission flag, which inherits down the directory tree.
Push from client git
Since I started without a server git, I need to connect my client git with the server. I did this by changing the conf file the clients .git/ directory, providing information about the remote "origin".
.git/conf
[core]
repositoryformatversion = 0
filemode = true
bare = false
logallrefupdates = true
hooksPath = ./config/hooks
quotepath = off
[remote "origin"]
url = ssh://git@sol/home/git/idee.git
fetch = +refs/heads/*:refs/remotes/origin/*
[branch "master"]
remote = origin
merge = refs/heads/master
[commit]
template = ./config/commit-message
[status]
relativePaths = false
As can be seen, I use the user git for ssh access.
initial push
frank @Asimov:~/projects/idee$ git push
Enter passphrase for key '/home/frank/.ssh/id_rsa':
Enumerating objects: 1156, done.
Counting objects: 100% (1156/1156), done.
Delta compression using up to 4 threads
Compressing objects: 100% (496/496), done.
Writing objects: 100% (1156/1156), 28.32 MiB | 7.89 MiB/s, done.
Total 1156 (delta 675), reused 1064 (delta 616), pack-reused 0
remote: Resolving deltas: 100% (675/675), done.
To ssh://sol/home/git/idee.git
* [new branch] master -> master
/home/git/idee.git/hooks/post-receive
#!/bin/bash
#
# The hook "post-receive" takes care for the
# deployment after all pushed files where
# successfully stored.
#
# The deployment is implemented as pull
# from a client git on the servers wwww folder.
# prevent message: "fatal: Not a git repository: '.'"
unset $(git rev-parse --local-env-vars)
cd /var/www/idee/
git pull
I found the solution for the error message at "Git Hook Pull After Push - remote: fatal: Not a git repository: '.' · Joe Januszkiewicz" 19
/var/www/idee
I create the server side client git also as shared git, making sure that www-data will have sufficient rights to read everything as member of the group git.
git@sol:/var/www$ git init --share=group idee
Initialized empty shared Git repository in /mnt/data/www/idee/.git/
git@sol:/var/www/idee/.git$ git remote add origin /home/git/idee.git
The branch master was set in the ini file by text editor.
[core]
repositoryformatversion = 0
filemode = true
bare = false
logallrefupdates = true
sharedrepository = 1
[receive]
denyNonFastforwards = true
[remote "origin"]
url = /home/git/idee.git
fetch = +refs/heads/*:refs/remotes/origin/*
[branch "master"]
remote = origin
merge = refs/heads/master
Testing the pull
git@sol:/var/www/idee$ git pull
git@sol:/var/www/idee$ ls -la
total 32
drwxrwxr-x 8 www-data www-data 4096 Feb 3 20:11 .
drwxr-xr-x 10 root root 4096 Jan 12 23:33 ..
drwxr-xr-x 2 git git 4096 Feb 3 20:11 author
drwxr-xr-x 3 git git 4096 Feb 3 20:11 config
drwxr-xr-x 2 git git 4096 Feb 3 20:11 generator
drwxrwsr-x 8 git git 4096 Feb 3 20:11 .git
drwxr-xr-x 2 git git 4096 Feb 3 20:11 plain
drwxr-xr-x 11 git git 4096 Feb 3 20:11 website
Since the connection runs via the same user and the remote location is in reality local, no password is asked and we need not setup anything to feed something into a password request.
Providing www-data with group permission
root @sol:/home/git/idee.git/hooks# adduser www-data git
Adding user `www-data' to group `git' ...
Adding user www-data to group git
Done.
Creating a nginx site
The following server definition for nginx uses http instead of https. That's not a problem, it is for testing and migration only in the local network.
/etc/nginx/sites-available/idee_88
# Idee Server Configuration
#
server {
listen 88 default_server;
listen [::]:88 default_server;
disable_symlinks off;
root /var/www/idee/website;
# Add index.php to the list if you are using PHP
index index.html index.htm index.nginx-debian.html;
server_name _;
location / {
# First attempt to serve request as file, then
# as directory, then fall back to displaying a 404.
try_files $uri %uri.html $uri/ =404;
}
location /yacysearch.html {
set $myquery '';
set $other '';
if ($args ~* query=([^&]*)(.*)){
set $myquery $1;
set $other $2;
}
if ($myquery !~* (site(%3a|:)idee\.frank-siebert\.de)) {
set $args query=$myquery+site:idee.frank-siebert.de$other;
}
proxy_pass https://yacy.frank-siebert.de/yacysearch.html;
}
}
This configuration does also the heavy lifting for the yacy search integration. The main effort was the part, which enforces that the site: filter is passed on to YaCy, restricting search results to my own web page.
Enabling the new site
root @sol:/etc/nginx/sites-enabled# ln -s ../sites-available/idee_88 .
root @sol:/etc/nginx/sites-enabled# nginx -t
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful
root @sol:/etc/nginx/sites-enabled# nginx -s reload
Test-URL
http://sol:88/article/verstehen.html
The server works, the YaCy search works as well, but naturally the links still point to the wordpress instance. A redirect from the old to the new URL pattern is required, and the migration of the content is still pending.
But sitemap, rss and index page are the next most important parts to be implemented.
Test run on this article
Doing a test run this article, while it is obviously still work in progress, reveals that it renders nicely, even the source code sections are very pretty, without investing time to make them look nice.
Every source code line is a reference
That is really nice for a number of use cases.
TODO: I have to take care, that these source code references do not get a tabindex each, or blind people will start to hate me.
Source code in the PDF
Source code in the PDF gets colored very nicely. DONE: I have to take care that the source code does not flow out of the page.
After refactoring the program export.py , where I took care to restrict the code to 80 characters per line, the PDF print of this program stays inside the page borders.
Sitemap Implementation
The "Sitemaps XML format " 20 description explains the concept and the XML document structure of sitemaps. It seems to be quite simple, if I just write down some code to create the respective xml-elements and to persist the document afterwards.
Sometimes knowledge makes everything a bit more complicated. I know that I should validate the resulting XML against its schema, and, committed to high quality, I started to dig into question, how this validation has to be set up on a Linux system to work, lets say, first of all in vim.
This theme turns out to be quite complex, and it is independent enough to get its own article: Validating XML in vim .
Fortunately the setup done for the validation in vim will provide also everything required for a validation without vim.
Python3-lxml
The module lxml is required to create xml via beautiful soup.
frank @Asimov:~$ sudo apt-get install python3-lxml
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
python3-lxml is already the newest version (4.6.3+dfsg-0.1+deb11u1).
python3-lxml set to manually installed.
0 upgraded, 0 newly installed, 0 to remove and 1 not upgraded.
Requirements Draft
Sitemaps will be split into monthly maps. Content will be listed in the month of its first publishing. If content from earlier months need an update (e.g. when I migrate the content) the respective older sitemaps are updated accordingly.
I'll not implement the hreflang link stuff, since I do not expect much overlap between English and German content. However, since I plan to use two different site names, "Concept" in English, "Idee" in German, I think I should have two different sitemap trees.
My sitemap-tree will start with one sitemap.xml, referencing idee-mao.xml and concept-map.xml, referencing down to idee-yyyy-MM.xml and concept-yyyy-MM.xml files. Since the sitemap specification does not provide itself a language information, Google my figure out itself the page languages by content.
Since the content I provide on my German site is anyhow heavily suppressed by Google, I do not really care to optimize much to ease Googles live.
Solution Specification
Every sitemap update involves 3 sitemap files, the monthly file, the site file and the top file. The information about the required sitemap changes are found in PubMetaData.instance._updates and PubMetaData.instance._deletions.
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://idee.frank-siebert.de/idee-map.xml</loc>
<lastmod>2021-03-31T18:23:17+00:00</lastmod>
</sitemap>
<sitemap>
<loc>https://idee.frank-siebert.de/concept-map.xml</loc>
<lastmod>2005-01-01</lastmod>
</sitemap>
</sitemapindex>
The modification of sitemaps start at the leaves of the sitemap tree, which is easily possible since the respective map can be found by the article:modified_time information, and the og:site_name information. og:site_name is either "idee" or "concept".
The monthly sitemaps are stored in a dedicated folder named sitemaps to keep the root directory clean.
A class SiteMap applies all the changes. The timestamps for all changes done in the 3 top-level sitemaps during one publishing commit will always be the same.
The sitemap.xml, idee-map.xml and concept-map.xml are created in the root directory and pre-created via text editor to provide the general structure.
A monthly-map.xml template is created in text editor and provided in in the portal folder next to other already existing templates. It simply contains the top level element and the xmlns information.
Implementation Result
~/project/idee/generator/sitemap.py
"""
Update the sitemap of the webseite.
@author: Frank Siebert
@license: https://creativecommons.org/publicdomain/zero/1.0/deed.en
@date: 2022-03-15
@author: Frank Siebert
"""
import re
import datetime
from pubmetadata import PubMetaData
from gitmsgconstants import GitMsgConstants as gmc
from bs4 import BeautifulSoup
from bs4.builder._lxml import LXMLTreeBuilderForXML
= "urlset"
URLSET_TAG = "url"
URL_TAG = "loc"
LOC_TAG = "lastmod"
LASTMOD_TAG # Not used:
# CHANGEFREQ_TAG = "changefreq"
# PRIORITY_TAG = "priority"
= "sitemapindex"
INDEX_TAG = "sitemap"
SIDEMAP_TAG
class SiteMap():
"""Manage all changees in the sitemaps."""
def __init__(self):
"""
Initialize changelists.
Returns
-------
None.
"""
# map information for German page changes on site "Idee".
self.de_list = []
# map information for English page changes on site "Concept".
self.en_list = []
# The time of the update
self._nowdate = datetime.datetime.now().isoformat()
def update(self):
"""
Iterate over changes and update respective sitemaps.
Add the respective sitemaps to their respective change list.
The information about the changed html pages comes from
PubMetaData.instance._updates and
PubMetaData.instance._deletions .
Returns
-------
None.
"""
for article_data in PubMetaData.instance._updates:
= article_data[PubMetaData.pubdate][0:7]
creation_month = article_data[PubMetaData.site]
site = site.lower() + "-" + creation_month + ".xml"
sitemap_path = gmc.sitemappath / sitemap_path
sitemap_path
if site == "Idee":
if sitemap_path not in self.de_list:
self.de_list.append(sitemap_path)
else:
if sitemap_path not in self.en_list:
self.en_list.append(sitemap_path)
if article_data.name != "rechtliches" \
and article_data.name != "legal":
self._update(sitemap_path, article_data)
for article_data in PubMetaData.instance._deletions:
# TODO
pass
self._update_de()
self._update_en()
self._update_main()
def _update_de(self):
"""Update idee-map.xml."""
if len(self.de_list) == 0:
return
with open(gmc.idee_map, 'r') as sitemap_file:
= sitemap_file.read()
xml_doc
sitemap_file.flush()
sitemap_file.close()
= LXMLTreeBuilderForXML
builder = BeautifulSoup(xml_doc, builder=builder, features='xml')
soup
for sitemap_path in self.de_list:
= gmc.website + "/" + sitemap_path.name
url = soup.find(LOC_TAG, text=re.compile(r"" + url))
tag
if not tag:
= soup.find(INDEX_TAG)
tag = soup.new_tag(SIDEMAP_TAG)
new_tag
tag.append(new_tag)= new_tag
tag = soup.new_tag(LOC_TAG)
new_tag = url
new_tag.string
tag.append(new_tag)= soup.new_tag(LASTMOD_TAG)
new_tag
tag.append(new_tag)else:
= tag.parent
tag
# tag holds now the correct SIDEMAP_TAG.
# Either it had been found or created.
# All used child tags exist also.
= tag.find(LASTMOD_TAG)
tag = self._nowdate
tag.string
= soup.prettify()
xml_doc
with open(gmc.idee_map, 'w') as sitemap_file:
print(xml_doc, file=sitemap_file)
sitemap_file.flush()
sitemap_file.close()
def _update_en(self):
"""Update concept-map.xml."""
if len(self.en_list) == 0:
return
with open(gmc.concept_map, 'r') as sitemap_file:
= sitemap_file.read()
xml_doc
sitemap_file.flush()
sitemap_file.close()
= LXMLTreeBuilderForXML
builder = BeautifulSoup(xml_doc, builder=builder, features='xml')
soup
for sitemap_path in self.en_list:
= gmc.website + "/" + sitemap_path.name
url = soup.find(LOC_TAG, text=re.compile(r"" + url))
tag
if not tag:
= soup.find(INDEX_TAG)
tag = soup.new_tag(SIDEMAP_TAG)
new_tag
tag.append(new_tag)= new_tag
tag = soup.new_tag(LOC_TAG)
new_tag = url
new_tag.string
tag.append(new_tag)= soup.new_tag(LASTMOD_TAG)
new_tag
tag.append(new_tag)else:
= tag.parent
tag
# tag holds now the correct SIDEMAP_TAG.
# Either it had been found or created.
# All used child tags exist also.
= tag.find(LASTMOD_TAG)
tag = self._nowdate
tag.string
= soup.prettify()
xml_doc
with open(gmc.concept_map, 'w') as sitemap_file:
print(xml_doc, file=sitemap_file)
sitemap_file.flush()
sitemap_file.close()
def _update_main(self):
"""Update sitemap.xml."""
if len(self.de_list) == 0 and len(self.en_list) == 0:
return
with open(gmc.sitemap, 'r') as sitemap_file:
= sitemap_file.read()
xml_doc
sitemap_file.flush()
sitemap_file.close()
= LXMLTreeBuilderForXML
builder = BeautifulSoup(xml_doc, builder=builder, features='xml')
soup
if len(self.de_list) > 0:
= gmc.website + "/" + gmc.idee_map.name
url = soup.find(LOC_TAG, text=re.compile(r"" + url))
tag # We know in this case, that the tag exists
= tag.parent
tag = tag.find(LASTMOD_TAG)
tag = self._nowdate
tag.string
if len(self.en_list) > 0:
= gmc.website + "/" + gmc.concept_map.name
url = soup.find(LOC_TAG, text=re.compile(r"" + url))
tag # We know in this case, that the tag exists
= tag.parent
tag = tag.find(LASTMOD_TAG)
tag = self._nowdate
tag.string
= soup.prettify()
xml_doc
with open(gmc.sitemap, 'w') as sitemap_file:
print(xml_doc, file=sitemap_file)
sitemap_file.flush()
sitemap_file.close()
@staticmethod
def _update(sitemap_path, article_data):
sitemap_path.resolve()if sitemap_path.exists():
with open(sitemap_path, 'r') as sitemap_file:
= sitemap_file.read()
xml_doc
sitemap_file.flush()
sitemap_file.close()else:
gmc.map_template.resolve()with open(gmc.map_template, 'r') as sitemap_file:
= sitemap_file.read()
xml_doc
sitemap_file.flush()
sitemap_file.close()
= LXMLTreeBuilderForXML
builder = BeautifulSoup(xml_doc, builder=builder, features='xml')
soup
= gmc.website + "/"\
article_url + "article" + "/" + article_data.name + ".html"
= soup.find(LOC_TAG, text=re.compile(r"" + article_url))
tag if not tag:
= soup.find(URLSET_TAG)
tag = soup.new_tag(URL_TAG)
new_tag
tag.append(new_tag)= new_tag
tag = soup.new_tag(LOC_TAG)
new_tag = article_url
new_tag.string
tag.append(new_tag)= soup.new_tag(LASTMOD_TAG)
new_tag
tag.append(new_tag)else:
= tag.parent
tag
# tag holds now the correct URL_TAG.
# Either it had been found or created.
# All used child tags exist also.
= tag.find(LASTMOD_TAG)
tag = article_data[PubMetaData.revdate]
tag.string
= soup.prettify()
xml_doc
with open(sitemap_path, 'w') as sitemap_file:
print(xml_doc, file=sitemap_file)
sitemap_file.flush() sitemap_file.close()
RSS - Really Simple Syndication
The RSS will be based on the standard described by the "Feed Validation Service" 21 and "RSS 2.0 Specification" 22 .
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:atom="http://www.w3.org/2005/Atom"
>
<channel>
<title>Idee der eigenen Erkenntnis</title>
<atom:link href="https://idee.frank-siebert.de/idee-rss.xml" rel="self"
type="application/rss+xml" />
<link>https://idee.frank-siebert.de</link>
<description>Idee</description>
<lastBuildDate>Tue, 11 Jan 2022 07:54:24 +0000</lastBuildDate>
<language>de-DE</language>
<generator>pandoc, fs-commit-msg-hook 1.0</generator>
<image>
<url>https://idee.frank-siebert.de/image/favicon-256x256-150x150.png</url>
<title>Idee der eigenen Erkenntnis</title>
<link>https://idee.frank-siebert.de</link>
<width>32</width>
<height>32</height>
</image>
<item>
<title>Best Article Ever Written</title>
<link>
https://idee.frank-siebert.de/article/best-article-ever-written.html</link>
<pubDate>Tue, 11 Jan 2022 07:50:11 +0000</pubDate>
<category><![CDATA[Uncategorized]]></category>
<guid isPermaLink="false">
https://idee.frank-siebert.de/article/best-article-ever-written.html-2022-01-11T07:50:11</guid>
<description>
<![CDATA[First 406 characters of the article, followed by ...]]>
</description>
<content:encoded><![CDATA[<article>......</article>]]></content:encoded>
<enclosure
url="https://idee.frank-siebert.de/audio/best-article-ever-written.mp3"
length="9090090" type="audio/mpeg" />
</item>
</channel>
</rss>
The article content will be embedded completely into the RSS, enclosed in a CDATA tag and encoded in utf-8. To be able to include the complete content, the extension "RDF Site Summary 1.0 Modules: Content" 23 with the namespace declaration xmlns:content=" http://purl.org/rss/1.0/modules/content/ " needs to be used.
The RSS file will reference its web location via atom:link, which needs the inclusion of the namespace entry
xmlns:atom="
http://www.w3.org/2005/Atom
"
for the line
To make sure existing feed consumers are served the rss feed without any need to change the link, the nginx configuration needs a location /feed/ to redirect this location to the RSS file.
Since the RSS consumers most likely will use the channels title to present the feed items, and this title is the site title, two different RSS xml files are required, one for the site Idee and one for the site Concept . An additional reason to create two RSS xml files is the language information, which can be provided only once in the language tag of the channel.
The specification, according to the post "Multiple channels in a single RSS xml - is it ever appropriate?" 25 does not allow more than one channel in one RSS xml file.
The question left: What's the unit if the length attribute in the enclosure tag?
I found that WordPress provides the size of the file in bytes as value for this attribute, which was also the most probable answer to this question.
RSS feeds to create
-
idee-rss.xml
- title: Idee der eigenen Erkenntnis
- link: https://idee.frank-siebert.de
- description: Idee
- language; de (or de-DE)
-
concept-rss.xml
- title: Concept of new cognition elicitation personally thinking
- link: https://idee.frank-siebert.de
- description: Concept
- language; en (or en-US)
Article Updates
If articles are updated after publishing, RSS does not provide any option to inform about the date of revision. The best idea, how such an update could be communicated to consumers is described in the post "RSS update single item" 26 .
The idea is to change the guid of the item to inform that the item contains changed content. The answer was not marked as correct, but it was the only answer provided,
The implementation choice is to use the link and timestamp of the update as combined guid string.
The GUID change resulted in GPodder sometimes in duplicate entries shown for one article, which is not the result intended. However, GPodder recognized changes in the content without any additional signaling. At least that's the current impression.
Number of RSS items
The rss files will contain the last 10 articles, the last created/update first. Since I plan to migrate articles in the sequence of their original publishing, I'll come out of the migration with my latest articles automatically being featured in the rss feed, with the only difference that I will have two feets in the new solution.
Templates
RSS feed implementation will start off with two templates, one for the english and one for the german version, in folder portal, containing the channel information only items to be added.
After initial feed creation the templates are no longer required, but I'll keep them anyhow, Supposedly the implementation will be very similar to the sitemap implementation.
Implementation
The implementation of the RSS feed generator turned out to be much more cumbersome than expected. Pythons Module BeautifulSoup gives you the alternatives to use the LXMLTreeBuilderForXML, which will nicely write CDATA sections, but will remove and HTML encode them (you know > and such), when it reads the XML.
The alternative HTMLParserTreeBuilder works nicely for xml as well, as long as all XML tags are lower-case. But since this was not mentioned anywhere I looked for solutions of the first problem, I had to find out this second problem by myself.
Just with luck I found out before I tried it in an implementation, that using the lxml package without BeautifulSoup will not solve problem number one.
After careful reading I based my third implementation on the module xml.dom.minidom. This is a pretty low-level implementation requiring some more lines of code, but it provides the required control over the CDATA sections and does not overwrite my implementation decision when it reads the XML.
It just has the annoying habit of adding empty lines with white-spaces only, with its method toprettyxml(). Every time you read and save it will add an additional line between otherwise untouched lines of the XML. But this is at least easily solved by two regex pattern substitutions without any risk to alter mistakenly also the HTML fragments embedded via CDATA.
The following code shows the current implementation. The result has been tested with GPodder, Liferea and Tidings, where GPodder cares only for items with a media reference in the enclosure tag, while Tidings and Liferea show items regardless of the presence of an enclosure.
~/projects/idee/generator/rssbuilder.py
"""
Update the rss feed of the webseite.
@author: Frank Siebert
@license: https://creativecommons.org/publicdomain/zero/1.0/deed.en
@date: 2022-03-15
All links provided relative to the /article/ folder
@author: Frank Siebert
"""
import re
import datetime
import xml.dom.minidom
from bs4 import BeautifulSoup
from bs4.builder._htmlparser import HTMLParserTreeBuilder
from pubmetadata import PubMetaData
from pubmetadata import pageurn
from gitmsgconstants import GitMsgConstants as gmc
= "channel"
CHANNEL_TAG = "lastBuildDate"
LASTBUILD_TAG = "item"
ITEM_TAG = "title"
TITLE_TAG = "link"
LINK_TAG = "pubDate"
PUBDATE_TAG = "guid"
GUID_TAG = "description"
DESCRIPTION_TAG = "content:encoded"
CONTENT_TAG = "enclosure"
ENCLOSURE_TAG = "audio"
AUDIO_TAG
# for the testing on server sol
# HOST = "http://sol:88/"
# for the website
= gmc.website + "/"
HOST
# Number of items to included into the RSS feed
= 15
ITEM_COUNT
def by_pub_date(article_data):
"""
Return the publishing date as sort criteria.
Parameters
----------
e : Series
article_data.
Returns
-------
TYPE
Date as Str
"""
return article_data[PubMetaData.pubdate]
class RSSBuilder():
"""Manage all changees in the sitemaps."""
def __init__(self):
"""
Initialize changelists.
The information about the changed html pages comes from
PubMetaData.instance._updates and
PubMetaData.instance._deletions .
Returns
-------
None.
"""
# information for German page changes on site "Idee".
self.de_list = []
# information for English page changes on site "Concept".
self.en_list = []
# The time of the update
self._nowdate = datetime.datetime.now().isoformat()
# soup of currently processed RSS xml
self._rss_xml = None
# soup tag of currently processed article
self._article_tag = None
for article_data in PubMetaData.instance._updates:
if article_data[PubMetaData.site] == "Idee" \
and article_data.name != "rechtliches":
self.de_list.append(article_data)
else:
if article_data.name != "legal":
self.en_list.append(article_data)
for article_data in PubMetaData.instance._deletions:
# TODO
pass
# Default sort is ascending, oldest posts first in list
self.de_list.sort(key=by_pub_date)
self.en_list.sort(key=by_pub_date)
def update(self):
"""
Iterate over changes and update respective rss files.
Returns
-------
None.
"""
# Update idee-rss.xml.
if len(self.de_list) > 0:
self._update(self.de_list, gmc.idee_rss)
# Update concept-rss.xml.
if len(self.en_list) > 0:
self._update(self.en_list, gmc.concept_rss)
def _read_article_tag(self, article_data):
"""
Read the article tag of the processed article.
The article tag gets assigned to self._article_tag
Returns
-------
None.
"""
= gmc.articlepath / article_data.name
articlepath = articlepath.with_suffix(".html")
articlepath
articlepath.resolve()
with open(articlepath, 'r') as infile:
= infile.read()
html_doc
infile.flush()
infile.close()
= HTMLParserTreeBuilder()
builder = BeautifulSoup(html_doc, builder=builder)
soup
self._article_tag = soup.find("article")
# RSS is downloaded, there is no use case for relatvie links
# even if RSS consumer theoritically could compute them
# to absolute links
# "../" becomes "https://idee.frank-siebert.de/"
= self._article_tag.find_all(re.compile(r".*"), attrs={
tags "href": re.compile(r"^\.\./")})
for tag in tags:
= tag.attrs["href"]
href = href.replace("../", HOST)
href "href": href})
tag.attrs.update({
= self._article_tag.find_all(re.compile(r".*"), attrs={
tags "src": re.compile(r"^\.\./")})
for tag in tags:
= tag.attrs["src"]
href = href.replace("../", HOST)
href "src": href})
tag.attrs.update({
# "./" becomes "https://idee.frank-siebert.de/article/"
= self._article_tag.find_all("a", attrs={
tags "href": re.compile(r"^\./")})
for tag in tags:
= tag.attrs["href"]
href = href.replace("./", HOST + "article/")
href "href": href})
tag.attrs.update({
self._article_tag.prettify()
def _article_cleanup(self):
"""
Remove some things from the articles BeautifulSoup model.
Remove those things, which are not rendered nicely in the
RSS feed consumer, or which are simply dysfunctional there.
Changes are applied to the currently processed article
referenced by self._article_tag
Consumers tested: GPodder, Liferea, Tidings
Returns
-------
None.
"""
# peel out sections
= self._article_tag.find_all("section")
sections for section in sections:
section.unwrap()
# fallback to more common tags
= self._article_tag.find("header")
tag = "div"
tag.name self._article_tag.name = "div"
# Remove toc
= self._article_tag.find("nav")
nav if nav:
nav.decompose()
# Remove footnote-back anchors.
= self._article_tag.find_all("a", class_="footnote-back")
tags for tag in tags:
tag.decompose()
# Remove footnote-ref anchors, preserve the footnote.
= self._article_tag.find_all("a", class_="footnote-ref")
tags for tag in tags:
= tag.find("sup")
suptag # make footnotes more visible
"(" + suptag.text + ")")
suptag.string.replace_with(
tag.unwrap()
# Remove category anchors
= self._article_tag.find_all("a", class_="category")
tags for tag in tags:
tag.decompose()
# Remove attributes from image preventiong it
# to be shown in gpodder
= self._article_tag.find_all("img")
images for img in images:
= {"src": img.attrs["src"]}
img.attrs
# Remove id attributes or some tags might not
# render nicely
= self._article_tag.find_all(re.compile(r".*"), attrs={
idtags "id": True})
for tag in idtags:
"id")
tag.attrs.pop(
# Remove role attributes or some tags might not
# render nicely
= self._article_tag.find_all(re.compile(r".*"), attrs={
idtags "role": True})
for tag in idtags:
"role")
tag.attrs.pop(
# Remove tabindex attributes not working anyhow in gpodder
= self._article_tag.find_all(re.compile(r".*"), attrs={
idtags "tabindex": True})
for tag in idtags:
"tabindex")
tag.attrs.pop(
def _get_item_tag(self, channel_tag, url, article_data):
"""
Find the item tag based on the url information.
Parameters
----------
channel_tag : xml.dom.minidom.Tag
The <channel> tag from the minidom document model.
url : Str
The url of the article, whose item tag is to be returned.
article_data : Dict
Data dictionary of the currently processed article.
Returns
-------
item_tag : xml.dom.minidom.Tag
The pre-existing or created <item> tag for the currently
processed article.
"""
= None
item_tag = None
tag = channel_tag.getElementsByTagName(LINK_TAG)
links
for link in links:
= None
savedurl if len(link.childNodes) > 0:
= link.childNodes[0].data.strip()
savedurl if url == savedurl:
= link
tag break
if tag:
= tag.parentNode
item_tag else:
= self._rss_xml.createElement(ITEM_TAG)
item_tag
= self._rss_xml.createElement(TITLE_TAG)
new_tag = article_data[PubMetaData.title]
nodetext = self._rss_xml.createTextNode(nodetext)
textnode
new_tag.appendChild(textnode)
item_tag.appendChild(new_tag)
= self._rss_xml.createElement(LINK_TAG)
new_tag = url
nodetext = self._rss_xml.createTextNode(nodetext)
textnode
new_tag.appendChild(textnode)
item_tag.appendChild(new_tag)
= self._rss_xml.createElement(PUBDATE_TAG)
new_tag = datetime.datetime.fromisoformat(
pubdatetime
article_data[PubMetaData.pubdate])# running your computer on an english locale
# is helpful for the next line.
= pubdatetime.strftime(
nodetext "%a, %d %b %Y %H:%M:%S +0000")
= self._rss_xml.createTextNode(nodetext)
textnode
new_tag.appendChild(textnode)
item_tag.appendChild(new_tag)
= self._rss_xml.createElement(GUID_TAG)
new_tag "isPermaLink", "false")
new_tag.setAttribute(
item_tag.appendChild(new_tag)
= self._rss_xml.createElement(DESCRIPTION_TAG)
new_tag
item_tag.appendChild(new_tag)
= self._rss_xml.createElement(CONTENT_TAG)
new_tag
item_tag.appendChild(new_tag)
# Processing oldes first, and inserting the items always
# before the frst childNode, wie get newest first in the XML.
# To become the sepcification compliant, we finalize by moving
# all item tags to the end of the channel tag later.
channel_tag.insertBefore(item_tag,0])
channel_tag.childNodes[
return item_tag
def _finalize_channel(self, channel_tag):
"""
Move the items behind the other channel tags.
Take care that the number of items does not exceed ITEM_COUNT.
Update the lastBuildDate.
Parameters
----------
channel_tag : xml.dom.minidom.Tag
The <channel> tag from the minidom document model.
Returns
-------
None.
"""
= channel_tag.getElementsByTagName(ITEM_TAG)
tags = 0
item_count for tag in tags:
if item_count < ITEM_COUNT:
channel_tag.appendChild(tag)+= 1
item_count else:
channel_tag.removeChild(tag)
# change last build date
# running your computer on an english locale
# is helpful for this.
= channel_tag.getElementsByTagName(LASTBUILD_TAG)[0]
tag = datetime.datetime.fromisoformat(
pubdatetime self._nowdate)
= pubdatetime.strftime(
nodetext "%a, %d %b %Y %H:%M:%S +0000")
0].nodeValue = nodetext
tag.childNodes[
@staticmethod
def _remove_empty_lines(xml_doc):
"""Remove empty lines with and without whitespaces."""
= re.compile(r"^\s*$", re.MULTILINE)
pattern = pattern.sub("", xml_doc)
xml_doc = re.compile(r"\n\n", re.MULTILINE)
pattern = pattern.sub("\n", xml_doc)
xml_doc return xml_doc
def _update(self, article_list, rss_path):
"""
Update the RSS file based on the list of changed or added articles.
Parameters
----------
article_list : List
The list of article_data entries of changed or added articles.
Oldest posts are first in the list.
rss_path : Path
The Path to the RSS file.
Returns
-------
None.
"""
with open(rss_path, 'r') as rss_file:
self._rss_xml = xml.dom.minidom.parse(rss_file)
= self._rss_xml.getElementsByTagName(CHANNEL_TAG)[0]
channel_tag
for article_data in article_list:
self._read_article_tag(article_data)
= HOST + "article" +\
url "/" + article_data.name + ".html"
= self._get_item_tag(channel_tag, url, article_data)
item_tag
= item_tag.getElementsByTagName(GUID_TAG)[0]
tag # Changing the guid on update creates problems with some
# consumers
= url # + "-" + self._nowdate
nodetext if not tag.hasChildNodes():
= self._rss_xml.createTextNode(nodetext)
textnode
tag.appendChild(textnode)else:
0].nodeValue = nodetext
tag.childNodes[
= item_tag.getElementsByTagName(DESCRIPTION_TAG)[0]
tag = " ".join(
nodetext self._article_tag.find("p").text.split())[:406] + " ..."
if not tag.hasChildNodes():
= self._rss_xml.createCDATASection(nodetext)
textnode
tag.appendChild(textnode)else:
0].nodeValue = nodetext
tag.childNodes[
# save the audio uri before the removal
# of the header tag
= None
url = self._article_tag.find(AUDIO_TAG)
tag if tag:
= tag.attrs["src"]
url
self._article_cleanup()
= item_tag.getElementsByTagName(CONTENT_TAG)[0]
tag if tag.hasChildNodes():
0])
tag.removeChild(tag.childNodes[= self._article_tag.prettify()
nodetext = " ".join(nodetext.split())
nodetext = self._rss_xml.createCDATASection(nodetext)
textnode
tag.appendChild(textnode)
# An update might add or update the audio
= item_tag.getElementsByTagName(ENCLOSURE_TAG)
tags = None
tag if url and len(tags) == 0:
= self._rss_xml.createElement(ENCLOSURE_TAG)
tag
item_tag.appendChild(tag)elif len(tags) > 0 and not url:
0])
item_tag.removeChild(tags[
# Update enclosure tag
if tag:
= gmc.audiopath / (pageurn(
audio + ".mp3")
article_data[PubMetaData.title]) = 0
filelength
audio.resolve()if audio.exists():
= audio.stat().st_size
filelength "url", url)
tag.setAttribute("length", "{}".format(filelength))
tag.setAttribute("type", "audio/mpeg")
tag.setAttribute(
self._finalize_channel(channel_tag)
= self._rss_xml.toprettyxml(indent=" ", encoding="utf-8")
xml_doc = self._remove_empty_lines(xml_doc.decode("utf-8"))
xml_doc
with open(rss_path, 'w') as rss_file:
print(xml_doc, file=rss_file)
rss_file.flush() rss_file.close()
Error Search
The initial code worked nicely, but I didn't get my audio episodes shown in my favorite podcast catcher GPodder on my sailfish OS device. I found out that GPodder contains a lot of python as well, and that the module to parse the rss feed is named podcastparser.
Since I didn't see the cause of error with my blinded eyes, I ended up to investigate this with the following test code.
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Tue Feb 22 13:02:51 2022
@author: frank
"""
import podcastparser
import urllib
= 'http://sol:88/idee-rss.xml'
feedurl
= podcastparser.parse(feedurl, urllib.request.urlopen(feedurl))
parsed
# parsed is a dict
import pprint
pprint.pprint(parsed)
Via this excursion I found out the following things:
- The reason of error was me using an uri attribute instead of an url attribute in the enclosure tag.
- This podcastparser support relative links in the RSS file, so most probably others will support this as well.
- The testcases in their git repository indicate, that CDATA Sections should work nicely.
Relative Links in RSS
The RSS Advisory Board declares its opinion, that relative links should be supported. The discussion documented on that page also proposes how it should be done, which seems to fit with the podcastparser.py implementation 27 .
The proposal boils down to the notion that the channels link element should provide the location, to which the the links are relative. If required, this can be overwritten with the use of the attribute xml:base.
Since the link element of the channel should point to the html of the channels index (or entry) page, I felt more comfortable with the dedicated xml:base attribute.
Making everything else in the channel elements relative, the RSS Template for my German channel should look like this:
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
xml:base="http://sol:88/"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:atom="http://www.w3.org/2005/Atom"
>
<channel>
<title>Idee der eigenen Erkenntnis</title>
<atom:link href="idee-rss.xml" rel="self"
type="application/rss+xml" />
<link>idee.html</link>
<description>Idee</description>
<lastBuildDate>Tue, 11 Jan 2022 07:54:24 +0000</lastBuildDate>
<language>de</language>
<generator>pandoc, fs-commit-msg-hook 1.0</generator>
<image>
<url>image/favicon-256x256-150x150.png</url>
<title>Idee der eigenen Erkenntnis</title>
<link>idee.html</link>
<width>64</width>
<height>64</height>
</image>
</channel>
</rss>
Where the current xml:base value is for the testing period only.
Thinking further about this, all my article pages are in the folder article. If use that folder as base, all relative links used in the content part should resolve nicely.
The channel part then should look as follows:
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
xml:base="http://sol:88/article/"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:atom="http://www.w3.org/2005/Atom"
>
<channel>
<title>Idee der eigenen Erkenntnis</title>
<atom:link href="../idee-rss.xml" rel="self"
type="application/rss+xml" />
<link>../idee.html</link>
<description>Idee</description>
<lastBuildDate>Tue, 11 Jan 2022 07:54:24 +0000</lastBuildDate>
<language>de</language>
<generator>pandoc, fs-commit-msg-hook 1.0</generator>
<image>
<url>../image/favicon-256x256-150x150.png</url>
<title>Idee der eigenen Erkenntnis</title>
<link>../idee.html</link>
<width>64</width>
<height>64</height>
</image>
</channel>
</rss>
Nice and good thoughts, but it doesn't work as thought. You might use relative links in the channel tags, and it works fine as far as I tested it. But you cannot rely on that for links in the content tag. How the consumer resolves these links, or whether it bothers to try to do this, is something you may not rely on. To be fair, the specification is really unspecific in this respect.
According to the official specification even CDATA sections would not work, as they state that all content needs to HTML-escape all special characters. Using a CDATA section instead is much more convenient and turns out, luckily, to be supported by the feed consumers. But CDATA by definition means: "Character Data" not to be parsed (Character Data to be parsed would be PCDATA).
Implementators can now argue, that parsing and processing relative links must not be done for the CDATA section in the context of xml:base, and that would be correct. But they could argue also, that CDATA is not to be parsed and processed in any kind, and just to be displayed, and that would be correct as well.
I had some hard time to get my article images shown using relative links. In the end I found that images where not not shown because of issues with relative links, but because of tag attributes like alt and title. Also headline tags are not rendered as headline in GPodder, if the e.g. the h2 tag does feature an id attribute,
I extended the code to process a number of attribute removals and some tag replacements, which removed my issues. I did this with a code version which used full qualified links and I did not go back to give the relative links one more try. The full qualified links are anyhow least least likely causing problems with any feed reader.
Include the Portal Fragment
I go back to an early discussion. Obviously I failed to find any HTML means to separate the content of the article from the content of the portal, and it does also not look as if something like html-include will become part of HTML and be supported by browsers.
But it turns out, that it can be done by the web server, using one of its extension modules. Some examples exist, where the functions add_before_body and add_after_body from "Module ngx_http_addition_module" 28 are used to inject a header and a footer.
The article "nginx: Mitigating the BREACH Vulnerability with Perl and SSI or Addition or Substitution Modules — Wild Wild Wolf" 29 is not really about this topic, but it does show that using these two functions we would end up with invalid HTML. Not a big problem, if it works and if this is everything you do care about.
The same article shows that the "Module ngx_http_ssi_module" 30 does exactly what's required to perform such an include on the server side.
You could now argue, that this establishes a step back from the goal to be completely plain HTML only. But it is, that's my argument, close enough to the feature I would have hoped to have included into the HTML standard. I'm willing to accept that the feature is now provided by the web server.
For the implementation this means, that I have to go some steps back and to modify the code, which makes my plain HTML to portal HTML. This part will no longer include the header, but only an comment line with the process instruction to include the header.
Since the nginx site configuration becomes now essential part of the implementation, I'll move that into the git repository as well.
Relocate nginx Site Configuration
In this first step the content of the site configuration stays the same. It is just copied via copy+paste from the file sol:/etc/nginx/sites-available/idee_88 into the new file in the git repository.
The file name and the used port will change, when I go live.
frank @Asimov:~/projects/idee$ mkdir nginx
frank @Asimov:~/projects/idee$ cd nginx/
frank @Asimov:~/projects/idee/nginx$ vim idee_88
frank @Asimov:~/projects/idee/nginx$ git add .
frank @Asimov:~/projects/idee/nginx$ git commit
frank @Asimov:~/projects/idee/nginx$ git push
Enter passphrase for key '/home/frank/.ssh/id_rsa':
Enumerating objects: 5, done.
Counting objects: 100% (5/5), done.
Delta compression using up to 4 threads
Compressing objects: 100% (3/3), done.
Writing objects: 100% (4/4), 750 bytes | 750.00 KiB/s, done.
Total 4 (delta 1), reused 0 (delta 0), pack-reused 0
remote: From /home/git/idee
remote: 8316cca..eabae10 master -> origin/master
remote: Updating 8316cca..eabae10
remote: Fast-forward
remote: nginx/idee_88 | 36 ++++++++++++++++++++++++++++++++++++
remote: 1 file changed, 36 insertions(+)
remote: create mode 100644 nginx/idee_88
To ssh://sol/home/git/idee.git
8316cca..eabae10 master -> master
root @sol:/etc/nginx/sites-available# rm idee_88
root @sol:/etc/nginx/sites-available# ln -s /var/www/idee/nginx/idee_88
root @sol:/etc/nginx/sites-available# nginx -t
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful
root @sol:/etc/nginx/sites-available# nginx -s reload
SSI Installation
SSI stands for Server Side Injection.
The required SSI module is included in the nginx-extras package. But it turns out to be also already part of the nginx-full package, which I already have installed.
root @sol:/etc/nginx/sites-available# apt-cache show nginx-full
...]
[OPTIONAL HTTP MODULES: Addition, Auth Request, Charset, WebDAV, GeoIP, Gunzip,
Gzip, Gzip Precompression, Headers, HTTP/2, Image Filter, Index, Log, Real IP,
Slice, SSI, SSL, Stream, SSL Preread, Stub Status, Substitution, Thread Pool,
Upstream, User ID, XSLT.
...] [
No further installation required.
SSI Configuration
The nginx site configuration needs one additional line to active SSI for the location.
location / {
ssi on;
# First attempt to serve request as file, then
# as directory, then fall back to displaying a 404.
try_files $uri %uri.html $uri/ =404;
}
Include Instruction in the HTML
<html lang="de-DE" xml:lang="de-DE" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8"/>
[...]</head>
<body>
<!--# include file="portal/idee_header.html" -->
<main>
<article>
[...]</article>
</main>
</body>
</html>
For english pages the include will reference the file portal/concept_header.html. The documentation is not explaining the reference directory for the include path. Is it simply the webroot, or is it the html document location? As you see I guess it is the webroot, which is also simpler for my implementation.
First tests upfront implementation shows that the assumption is not correct. The above sample leads to Error 404 (included header page was not found). Defining it relative to the article is correct, but considered unsave:
2022/03/02 11:07:32 [error] 13045#13045: *396623 unsafe URI "/article/../portal/idee_header.html" was detected while sending response to client, client: 10.19.67.21, server: _, request: "GET /article/endlich.html HTTP/1.1", host: "sol:88"
Since there was only on last choice, that one turned out work.
[...]<body>
<!--# include file="/portal/idee-header.html" -->
<main>
[...]
You might notice also, that I decided to rename the header file to use an hypen instead of an underscore. This was just for consistency in my file names. Note that this is only a chapter in a much longer description.
~/projects/idee/generator/archive.py
"""
Update the archive of the webseite.
@author: Frank Siebert
@license: https://creativecommons.org/publicdomain/zero/1.0/deed.en
@date: 2022-03-15
"""
import re
import datetime
from bs4 import BeautifulSoup
from bs4 import Comment
from bs4.builder._htmlparser import HTMLParserTreeBuilder
from pubmetadata import PubMetaData
from gitmsgconstants import GitMsgConstants as gmc
class Archive():
"""Manage all changees in the archive."""
def __init__(self):
"""
Initialize changelists.
Returns
-------
None.
"""
# map information for German page changes on site "Idee".
self.de_list = []
# map information for English page changes on site "Concept".
self.en_list = []
# The time of the update
self._nowdate = datetime.datetime.now().isoformat()
def update(self):
"""
Iterate over changes and update respective archive pages.
Add the archive pages to their respective change list.
The information about the changed html pages comes from
PubMetaData.instance._updates and
PubMetaData.instance._deletions .
Returns
-------
None.
"""
for article_data in PubMetaData.instance._updates:
= article_data[PubMetaData.pubdate][0:7]
creation_month = article_data[PubMetaData.site]
site = site.lower() + "-" + creation_month + ".html"
archive_path = gmc.archivepath / archive_path
archive_path
if site == "Idee":
if archive_path not in self.de_list:
self.de_list.append(archive_path)
else:
if archive_path not in self.en_list:
self.en_list.append(archive_path)
if article_data.name != "rechtliches" \
and article_data.name != "legal":
= self._update(archive_path, article_data)
soup = soup.prettify()
html_doc
with open(archive_path, 'w') as archive_file:
print(html_doc, file=archive_file)
archive_file.flush()
archive_file.close()
for article_data in PubMetaData.instance._deletions:
# TODO
pass
self._update_de()
self._update_en()
def _update_de(self):
"""Update idee-archive.html."""
if len(self.de_list) == 0:
return
with open(gmc.idee_archive, 'r') as archive_file:
= archive_file.read()
html_doc
archive_file.flush()
archive_file.close()
= HTMLParserTreeBuilder
builder = BeautifulSoup(html_doc, builder=builder)
soup
for archive_path in self.de_list:
= './' + archive_path.name
url = soup.find("a", href=re.compile(r"" + url))
tag
if not tag:
= soup.find("main")
tag = soup.new_tag("h3")
new_tag 0, new_tag)
tag.insert(= new_tag
tag = soup.new_tag("a")
new_tag "href": url})
new_tag.attrs.update({= archive_path.name
new_tag.string
tag.append(new_tag)
= soup.prettify()
html_doc
with open(gmc.idee_archive, 'w') as archive_file:
print(html_doc, file=archive_file)
archive_file.flush()
archive_file.close()
def _update_en(self):
"""Update concept-archive.html."""
if len(self.en_list) == 0:
return
with open(gmc.concept_archive, 'r') as archive_file:
= archive_file.read()
html_doc
archive_file.flush()
archive_file.close()
= HTMLParserTreeBuilder
builder = BeautifulSoup(html_doc, builder=builder)
soup
for archive_path in self.en_list:
= './' + archive_path.name
url = soup.find("a", href=re.compile(r"" + url))
tag
if not tag:
= soup.find("main")
tag = soup.new_tag("h3")
new_tag 0, new_tag)
tag.insert(= new_tag
tag = soup.new_tag("a")
new_tag "href": url})
new_tag.attrs.update({= archive_path.name
new_tag.string
tag.append(new_tag)
= soup.prettify()
html_doc
with open(gmc.concept_archive, 'w') as archive_file:
print(html_doc, file=archive_file)
archive_file.flush()
archive_file.close()
@staticmethod
def _get_abstract(article_data):
"""
Read the abstract of the processed article.
The abstract consists of the first 406 characters of the first
<p> tag, or less, if the respective string is shorter.
Returns
-------
Str.
"""
= gmc.articlepath / article_data.name
articlepath = articlepath.with_suffix(".html")
articlepath
articlepath.resolve()
with open(articlepath, 'r') as infile:
= infile.read()
html_doc
infile.flush()
infile.close()
= HTMLParserTreeBuilder()
builder = BeautifulSoup(html_doc, builder=builder)
soup
= soup.find("p")
tag return " ".join(tag.text.split())[0:406]
@staticmethod
def _update(archive_path, article_data, article_loc="../article/"):
= None
is_new
archive_path.resolve()if archive_path.exists():
with open(archive_path, 'r') as archive_file:
= archive_file.read()
html_doc
archive_file.flush()
archive_file.close()= False
is_new else:
gmc.archive_template.resolve()with open(gmc.archive_template, 'r') as archive_file:
= archive_file.read()
html_doc
archive_file.flush()
archive_file.close()= True
is_new
= HTMLParserTreeBuilder
builder = BeautifulSoup(html_doc, builder=builder)
soup
if is_new:
= soup.find("body")
tag # SSI header injection is a function of the language
if article_data[PubMetaData.locale].startswith("de"):
= Comment('# include file="/portal/idee-header.html" ')
new_tag = "de"
language = "Idee"
site_name = "Archiv"
title_prefix else:
= Comment(
new_tag '# include file="/portal/concept-header.html" ')
= "en"
language = "Concept"
site_name = "Archive"
title_prefix 0, new_tag)
tag.insert(
= soup.find("html")
tag "lang": language, "xml:lang": language})
tag.attrs.update({= soup.find("meta", property="og:site_name")
tag "Content": site_name})
tag.attrs.update({
= soup.find("title")
tag = " ".join([title_prefix,
tag.string 0:7]])
article_data[PubMetaData.pubdate][
= soup.find("h1")
tag = " ".join([title_prefix,
tag.string 0:7]])
article_data[PubMetaData.pubdate][
= article_loc + article_data.name + ".html"
article_url
= soup.find("a", href=article_url)
tag if not tag:
= soup.find("h1")
tag
= soup.new_tag("article")
new_tag if tag: # true in archive, false in index page
tag.insert_after(new_tag)else:
= soup.find("main")
tag 0, new_tag)
tag.insert(= new_tag
tag
= soup.new_tag("header")
new_tag
tag.append(new_tag)= new_tag
tag
= soup.new_tag("h2")
new_tag
tag.append(new_tag)= new_tag
tag
= soup.new_tag("a")
new_tag "href": article_url,
new_tag.attrs.update({"alt": article_data[PubMetaData.title]})
= article_data[PubMetaData.title]
new_tag.string
tag.append(new_tag)= tag.parent # header
tag
= soup.new_tag("div")
new_tag
tag.append(new_tag)= new_tag
tag
= soup.new_tag("time")
new_tag "datetime":
new_tag.attrs.update({19],
article_data[PubMetaData.pubdate][:"pubdate": "true"})
= article_data[PubMetaData.pubdate][:10]
new_tag.string
tag.append(new_tag)
= soup.new_tag("address")
new_tag = article_data[PubMetaData.author]
new_tag.string
tag.append(new_tag)= tag.parent.parent # article
tag
= soup.new_tag("p")
new_tag
tag.append(new_tag)= new_tag
tag
= soup.new_tag("a")
new_tag "href": article_url,
new_tag.attrs.update({"alt": article_data[PubMetaData.title]})
= "..."
new_tag.string "placeholder") # for the article abstract
tag.append(
tag.append(new_tag)= tag.parent # article
tag
= soup.new_tag("hr")
new_tag
tag.append(new_tag)else:
= tag.parent.parent.parent # article
tag
# tag holds now the article tag.
# Either it had been found or created.
# All used child tags exist also.
# Write or update the article abstract
= tag.find("p")
tag = tag.find("a")
tag
tag.previousSibling.replace_with(Archive._get_abstract(article_data))
# We give every anchor a tabindex
# 5 Tabindexes are in the portal header
= 6
index = soup.find_all(re.compile(r"^a$|^audio$|^input$"))
tags for tag in tags:
"tabindex": index})
tag.attrs.update({+= 1
index
return soup
Migration
Migration, I hoped for a quick one, needs to be manually. Not only need I supervise the result step by step, I also did put own comments below articles to update or amend articles, which now needs to be incorporated into the article text.
And, as will be seen, slight adjustments to the wiki text needs to done in some cases to get the desired result.
Migration issue: double-byte unicode characters break PDF generation
The standard pandoc installation does not support double-byte unicode characters, as it does use LaTeX for the PDF generation.
In my case this happened with the code U+03BA for the creek character κ. Not to know when and why PDF generation will break next time is no option. And its not possible to remove the issue just by removing the character, since it is for sure used for a reason.
The stackoverflow discussion "Pandoc and foreign characters" 31 explains that the problem can be solved specifying a different PDF engine via --pdf-engine=xelatex.
However, this is only a part of the answer, since this engine needs first to be installed and since a font needs to be chosen, which contains the character.
The engine can be installed from the debian repository by:
frank @Asimov:~/projects/idee$ sudo apt-get install texlive-xetex
A search for fonts supporting the character can be done with:
frank @Asimov:~/projects/idee$ fc-list ':charset=03BA'
This list is quite long and, if you think about it, helpful only in most exotic cases. My best guess for a suitable font to render everything in PDF what I use in my wiki pages would be the font used by my web browser.
Was it in FireFox or was in Chromium? In one if my browsers I found the default to be DejaVu Sans. How can the font be specified? That can be done via command line parameters.
Indeed I found a number of pages describing how this can be done, but in the end they all did not work as expected. Only the "Pandoc User’s Guide" 32 helped in the end.
-V KEY[=VAL], --variable=KEY[:VAL]
Set the template variable KEY to the value VAL when rendering the document in standalone mode. If no VAL is specified, the key will be given the value true.
mainfont, sansfont, monofont, mathfont, CJKmainfont
font families for use with xelatex or lualatex: take the name of any system font, using the fontspec package. CJKmainfont uses the xecjk package.
Those two information combined explained me, when to use the ":" symbol and when the "=" symbol, which, for whatever reason, was not correctly done in the examples I found, or probably I failed to understand them correctly.
The working code to call pandoc from python with naming the fonts to use:
"pandoc",
subprocess.run([# mediawiki markup as input format
"-f", "html",
# html as output forma
"-t", "pdf",
# input file
# "-i", inpath,
# output file
"-o", self.outpath,
"--pdf-engine=xelatex",
"--variable=mainfont:DejaVu Serif",
"--variable=sansfont:DejaVu Sans",
"--variable=monofont:DejaVu Sans Mono",
"--variable=geometry:a4paper",
"--variable=geometry:margin=2.5cm",
"--variable=linkcolor:blue"
\
],=False,\
capture_output# the correct workdirectory to find the images
=workpath,\
cwd# html string as stdin
input=html_doc.encode("utf-8"))
A resource worth to visit for further beautification: "Customizing pandoc to generate beautiful pdf and epub from markdown" 33
For a start I'm happy if the PDF is generated correctly, but I'm sure I'll revisit the theme again to get from good results to perfect results.
Missing Character
Creek Characters where no problem, and for CJK (Chinese, Japan, Korea) Fonts a seperate vairable can be set. But now I got problems with Hebrew Chars and how would it look like if Arabian Chars would be required?
Funnily enough having the chars nicely rendered in the web page doesn't tell you anything about your success during PDF creation.
WARNING] Missing character: There is no א (U+05D0)
[in font DejaVu Serif/OT:script=latn;language=d
WARNING] Missing character: There is no ָ (U+05B8)
[in font DejaVu Serif/OT:script=latn;language=d
WARNING] Missing character: There is no ד (U+05D3)
[in font DejaVu Serif/OT:script=latn;language=d
WARNING] Missing character: There is no ָ (U+05B8)
[in font DejaVu Serif/OT:script=latn;language=d
WARNING] Missing character: There is no ם (U+05DD)
[in font DejaVu Serif/OT:script=latn;language=d
WARNING] Missing character: There is no א (U+05D0)
[in font DejaVu Serif/OT:script=latn;language=d
WARNING] Missing character: There is no ֲ (U+05B2)
[in font DejaVu Serif/OT:script=latn;language=d
WARNING] Missing character: There is no ד (U+05D3)
[in font DejaVu Serif/OT:script=latn;language=d
WARNING] Missing character: There is no ָ (U+05B8)
[in font DejaVu Serif/OT:script=latn;language=d
WARNING] Missing character: There is no מ (U+05DE)
[in font DejaVu Serif/OT:script=latn;language=d
WARNING] Missing character: There is no ָ (U+05B8)
[in font DejaVu Serif/OT:script=latn;language=d
WARNING] Missing character: There is no ה (U+05D4)
[in font DejaVu Serif/OT:script=latn;language=d
The command fc-list does not show any installed font for these character codes, but the browser does show them. This means that the browser gets its fonts from somewhere else, if it needs them.
Curious what font would be reported by the browser, I used the inspection tool and got the answer "Liberation Sans", which convinced me to change the fonts to be used for the PDF generation to Liberation Fonts.
This is one of the fonts installed by default on Debian. And guess what, it worked! Probably I did something wrong with the fc-list command. I think the font looks better balanced in the PDF, it is definitely a good change, not only for that article.
Tables flow out of the PDF page
There is a lot of web pages out there about this topic and all of them, at least those I found, where about how to change the markudown to prevent this to happen.
To basic solutions are:
- scaling the table down together with the font size
- use the markdown for multiline tables
Well, not using markdown but MediaWiki markup to write the articles, and then creating HTML first and then PDF from the HTML, I had some bad time to figure out the solution. Processing things in multiple steps I could probably find a solution by editing intermediate results, but that is cumbersome and not desirable.
I even started to thing about my solution. Shouldn't I use markdown for the PDF as well as for the HTML generation? Should I throw big parts of my implementation away and start over again?
However, reading about the solutions helped in the end. How can I convince Pandoc to make a multiline table from my markdown? I need to enforce a multiline header-cell in my MediaWiki markup.
Note the
<br/>
in the third column:
{| class="wikitable" style="text-align:left;" cellpadding="2px"
! Hersteller
! Impfstoff
! Primary<br/>Completion
! Completion
|-
| BioNTech / Pfizer
| BNT162b2
| 2021-11-30
| 2021-11-30
|}
Relative URLs to own articles do not work in PDF
Nothing to wonder about, but I stumbled upon it nonetheless. There is no chance, before generating PDF I have to revert the relative URLs back into absolute URLs pointing to my web site.
DONE
Half Way Migrated - Checkpoint
At the point, where already migrated up to all Mai 2021 articles, I have 8 audio articles in my RSS feed. On my way I had the take care for a number of bugs, e.g in the code part, where the urn of the article is created based on the title. It is a critical detail, that the urn does match with current WordPress article URL stem, or I will not be able to process automatic redirects from the old URL to the new URL without an extensive matching list.
I also learned what I have to take care for in regard to the articles Title in MediaWiki. E.g. in some titles I have used quotes. If I use the standard Quote ("), then I get a problem with the article filename on the disc. Writing the file, the quote char is correctly escaped. But the attempt to read the file in Python leads to a path where the escape char is escaped, leading to a file not found error. I now change these titles to use the quote chars („) and (“) instead.
I'm also editing the articles to use < ref > tags also to my own articles and to put quotes into < blockquite > tags.
Also the filenames of the audio files is now different than before, using now the urn of the article as stem of the audio filename.
I wouldn't need to care, but I'd like to have all audio articles on my phone in their new representation in my GPodder App. This is not critical for other consumers, it is something I just want to have. Other consumers most probably have no problem with the changed appearance of the articles, just as long as their url for the feed consumption does not stop working.
But for me my special requirement (my wish) leads to the conclusion, that I either need to allow a very big rss feed at the start, or I have to prepare the further article migration, implement the index page generation and to go live with a rapid migration after go-live.
Which way I'll decide, I have to implement the index page generation rather sooner than later, because the go live is near. Not that a fixed day exists for this, but the progress indicates that it cannot be too far away.
TODOs I have not to forget:
-
Prevent the Search-Engine indexing of my legal page (in German and English)
- Prevent the legal page to appear in the sitemap (Done)
- Prevent the legal page to appear in the RSS feed (Done)
- Prevent the legal page to appear in the archive (Done)
- nofollow information at the anchor in the headers (Done)
- disallow English and German legal page explicitly in the robots.txt (Done)
- /feed/ redirect
- I have to take care, that source code references do not get a tabindex each
-
Find out how to get backward references from references back to their text in PDF work. (Done)
- Included readable http links into the PDF to provide useful footnotes also if printed. (footnote section only)
Enabling Backlinks in PDF
I found a list of options to investigate in the post: "How to convert HTML to PDF using pandoc?" 34
wkhtmltopdf
frank @Asimov:~/projects/idee$ sudo apt-cache search wkhtmltopdf
sudo] password for frank:
[python3-django-wkhtmltopdf - Django module with views for HTML to PDF
conversions (Python 3)
pandoc - general markup converter
python3-pdfkit - Python wrapper for wkhtmltopdf to convert HTML to PDF
(Python 3)
wkhtmltopdf - Command line utilities to convert html to pdf or image using
WebKit
frank @Asimov:~/projects/idee$ sudo apt-get install wkhtmltopdf
frank @Asimov:~/projects/idee/plain$ wkhtmltopdf --enable-local-file-access \
--enable-external-links --enable-internal-links --keep-relative-links \
astrazeneca-vaxzevria-verunreinigungen-thromozytopenie-thrombose.html \
astrazeneca-vaxzevria-verunreinigungen-thromozytopenie-thrombose.pdf
The switch --enable-external-links, is not support using unpatched qt, and will
be ignored.The switch --enable-internal-links, is not support using unpatched
qt, and will be ignored.The switch --keep-relative-links, is not support using
unpatched qt, and will be ignored.Loading page (1/2)
Printing pages (2/2)
Done
I'm not yet willing to install a qt-patch for this purpose, being not even sure about the result. Because of this no links at all are working in the PDF. The PDF shows the HTML exactly as it is rendered in the browser, and that is not really what I want to get as well.
But I'll keep this in mind. It might be useful in other use cases.
frank @Asimov:~/projects/idee/plain$ sudo apt-get purge wkhtmltopdf
WeasyPrint
The quite impressing list of packages to be installed for WeasyPrint made me think twice about pressing yes. It even made me reading the documentation first: "WeasyPrint" 35
I learned from this documentation that CCS2 contains style elements for paged media layout. 36
From the reading I get the impression it does everything required to layout the HTML nicely for PDF and to enable all links to work.
And from the post which made me aware of this tool I know already, that it can be named as engine for pandoc. The question for sure has to be asked, whether this does make sense. Calling a program written in R to call a program written in Python, when I'm already in a Python program.
However, I'll try exact that setup for a start, and probably later I'll kick out Pandoc for PDF generation and use directly WeasyPrint via its API, if it works nicely.
I guess if I go that route, I'll develop a second CSS for page layout details and to overwrite some CSS formatting used in HTML but being not nice in PDF.
frank @Asimov:~/projects/idee/plain$ sudo apt-get install weasyprint
sudo] password for frank:
[Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
libblkid-dev libbrotli-dev libcairo-script-interpreter2 libcairo2-dev
libdatrie-dev libfontconfig-dev libfontconfig1-dev libfreetype-dev
libfreetype6-dev libfribidi-dev libglib2.0-dev libglib2.0-dev-bin
libgraphite2-dev libharfbuzz-dev libharfbuzz-gobject0 libice-dev libmount-dev
libpango1.0-dev libpcre2-32-0 libpcre2-dev libpcre2-posix2 libpixman-1-dev
libpng-dev libpng-tools libpthread-stubs0-dev libselinux1-dev libsepol1-dev
libsm-dev libthai-dev libx11-dev libxau-dev libxcb-render0-dev libxcb-shm0-dev
libxcb1-dev libxdmcp-dev libxext-dev libxft-dev libxrender-dev pango1.0-tools
python-tinycss2-common python3-cairocffi python3-cairosvg python3-cffi
python3-cssselect2 python3-pycparser python3-pyphen python3-tinycss2
python3-xcffib uuid-dev x11proto-dev x11proto-xext-dev xorg-sgml-doctools
xtrans-dev
Suggested packages:
libcairo2-doc libdatrie-doc freetype2-doc libgirepository1.0-dev libglib2.0-doc
libgraphite2-utils libice-doc libpango1.0-doc libsm-doc libthai-doc libx11-doc
libxcb-doc libxext-doc python-cairocffi-doc python-cssselect2-doc
python-tinycss2-doc
The following NEW packages will be installed:
libblkid-dev libbrotli-dev libcairo-script-interpreter2 libcairo2-dev
libdatrie-dev libfontconfig-dev libfontconfig1-dev libfreetype-dev
libfreetype6-dev libfribidi-dev libglib2.0-dev libglib2.0-dev-bin
libgraphite2-dev libharfbuzz-dev libharfbuzz-gobject0 libice-dev libmount-dev
libpango1.0-dev libpcre2-32-0 libpcre2-dev libpcre2-posix2 libpixman-1-dev
libpng-dev libpng-tools libpthread-stubs0-dev libselinux1-dev libsepol1-dev
libsm-dev libthai-dev libx11-dev libxau-dev libxcb-render0-dev libxcb-shm0-dev
libxcb1-dev libxdmcp-dev libxext-dev libxft-dev libxrender-dev pango1.0-tools
python-tinycss2-common python3-cairocffi python3-cairosvg python3-cffi
python3-cssselect2 python3-pycparser python3-pyphen python3-tinycss2
python3-xcffib uuid-dev weasyprint x11proto-dev x11proto-xext-dev
xorg-sgml-doctools xtrans-dev
0 upgraded, 54 newly installed, 0 to remove and 1 not upgraded.
Need to get 13.3 MB of archives.
After this operation, 47.1 MB of additional disk space will be used.
Do you want to continue? [Y/n]
Naming weasyprint instead of xelatex as pdf-engine works instantly. Font setting from CSS is not used and Headline Color is also not used as in CSS defined. Probably the CSS is not found at all.
CSS is however found when the program is called from the command line, making the headlines use the CSS defined color, but font settings are still ignored, but this time with an warning message informing about this.
frank @Asimov:~/projects/idee/website/article$ weasyprint -f pdf \
astrazeneca-vaxzevria-verunreinigungen-thromozytopenie-thrombose.html \
astrazeneca-vaxzevria-verunreinigungen-thromozytopenie-thrombose.pdf
WARNING: Ignored `font: var(--theme-font)` at 29:2, invalid value.
WARNING: Ignored `border-right: 1px solid var(--theme-color)` at 44:2, invalid
value.
WARNING: Ignored `border-left: 1px solid var(--theme-color)` at 45:2, invalid
value.
WARNING: Expected a media type, got screen/**/and/**/(min-width: 641px)
WARNING: Invalid media type " screen and (min-width: 641px) " the whole @media
rule was ignored at 83:1.
WARNING: Expected a media type, got screen/**/and/**/(max-width: 640px)
WARNING: Invalid media type " screen and (max-width: 640px) " the whole @media
rule was ignored at 105:1.
WARNING: Ignored `font: var(--theme-font)` at 197:2, invalid value.
WARNING: Ignored `font: var(--theme-font)` at 236:29, invalid value.
WARNING: Ignored `font: var(--theme-font)` at 239:21, invalid value.
WARNING: Ignored `display: inline-grid` at 254:2, invalid value.
WARNING: Ignored `grid-template-columns: 30px auto auto auto` at 255:2, unknown
property.
WARNING: Ignored `font: var(--theme-font)` at 270:2, invalid value.
WARNING: Ignored `text-shadow: 1px 1px rgba(255, 255, 255, 0.4)` at 294:2,
unknown property.
WARNING: Ignored `border-bottom: 0.3em solid var(--theme-color)` at 327:2,
invalid value.
WARNING: Ignored `font: var(--theme-font)` at 332:2, invalid value.
WARNING: Ignored `font: var(--theme-font)` at 402:2, invalid value.
WARNING: Ignored `outline: 5px solid var(--theme-meta-color)` at 406:2, invalid
value.
WARNING: Ignored `border-top: 2px solid var(--theme-meta-color)` at 412:2,
invalid value.
Links pointing backward inside the document do work as they should. Obviously I'll now take a look into a CSS optimization for the PDF generation before I'll proceed with my migration.
fspdf.css
Creating a complete new CSS for PDF generation is not helpful, since this might introduce a lot of double maintenance if the style is changed in future. But a separate CSS to overwrite just some specific things is quite simple.
See the earlier chapter [#The PDF Style Sheet|The PDF Style Sheet]
This little initial CSS also reveals, that the removal of figures around images is no longer required. In contrary these figures are now important means to layout the images as we need them. However, anchor tags inside the figures around the image do nothing. Opening the image in web browser by clicking the image does not work. But I see this as a minor issue, since every document created will carry a QR-Code with the URL of the article for those, who wish to use the web-version of the article,
I was able to add a header line with the articles title and a page number at the top of the page. Over time the layout of the page might change to get perfect results, but for now good is good enough.
pdfworker.py
The following code shows just the essential parts of the new code. A lot more lines have been removed, e.g. the pandoc systemcall and the removal of figures from tables or the article header.
from weasyprint import HTML
from weasyprint import CSS[...]
.css")
csspath = Path(r"/home/frank/projects/idee/website/css/fspdf.resolve()
csspath.prettify()
html_doc = soup
, base_url=str(workpath))
weasy_html = HTML(string=html_doc.write_pdf(target=self.outpath,
weasy_html[CSS(filename=str(csspath))]
stylesheets=
)[...]
WeasyPrint Bug?
I'm perfectly satisfied with the PDF generated by Weasyprint, but I discovered now, not before generating quite a lot of PDF documents, that German special characters (ÄäÖöÜüß) in Headlines lead to dysfunctional links in the table of contents.
The TOC used is not a real PDF TOC, but it is the TOC generated for the HTML and it should work on the PDF just as it does in HTML.
The HTML is generated by Pandoc, and till now I did not meddle with the id and href names generated for internal navigation. Indeed I like it very much, that Pandoc does not escape the Germen umlaute in those,
At a point in near feature I need to investigate this issue closer. Does PDF allow full UTF-8 in references? Where is the Bug in the WeasyPrint implementation? Would the correction be in the escaping of special characters or in enablement of UTF-8?
And then, when the issue is solved, I'll have to trigger re-creation of the PDFs.
Knowledge Resources about CSS Paged Media
Index Page Implementation
The subdomain idee.frank-siebert.de will serve two index pages, one with English language and one with German language.
-
idee.html
- Main index page in German language
-
concept.html
- English index page
There will not be many English articles, as far as I foresee. That's the main reason to give those article no own subdomain. And most probably the English articles will not be a translation of German articles.
I'm undecided about the question, whether the search should be restricted to the current sites language. For a start I'll not implement such a restriction.
Under these circumstances it seems to make no sense to have a language switch somewhere on the site, because then visitors would assume that they can switch the language of the current article, which will not be the case. And for sure I'll refrain from faking a multi-language page via google translate, just to be able to show a language switch button.
Index Page Content
The index page content for the respective language will be generated from the RSS file created for that language. The language specific portal header will be injected.
I based the index page generation on the archive generation. Hot needle implementation and a lot to refactor to get it nice, but it works.
This implementation allows to define a separate item count for the RSS feed and the index page.
The Index Builder
~/projects/idee/generator/idxbuilder.py
"""
Update the index pages of the webseite.
@author: Frank Siebert
@license: https://creativecommons.org/publicdomain/zero/1.0/deed.en
@date: 2022-03-15
All links provided relative to the /article/ folder
@author: Frank Siebert
"""
import datetime
from pubmetadata import PubMetaData
from gitmsgconstants import GitMsgConstants as gmc
from archive import Archive
# Number of items to included into the RSS feed
= 15
ITEM_COUNT
def by_pub_date(article_data):
"""
Return the publishing date as sort criteria.
Parameters
----------
e : Series
article_data.
Returns
-------
TYPE
Date as Str
"""
return article_data[PubMetaData.pubdate]
class IDXBuilder():
"""Manage all changees in the index page."""
def __init__(self):
"""
Initialize changelists.
The information about the changed html pages comes from
PubMetaData.instance._updates and
PubMetaData.instance._deletions .
Returns
-------
None.
"""
# information for German page changes on site "Idee".
self.de_list = []
# information for English page changes on site "Concept".
self.en_list = []
# The time of the update
self._nowdate = datetime.datetime.now().isoformat()
# soup of currently processed Index html
for article_data in PubMetaData.instance._updates:
if article_data[PubMetaData.site] == "Idee" \
and article_data.name != "rechtliches":
self.de_list.append(article_data)
else:
if article_data.name != "legal":
self.en_list.append(article_data)
for article_data in PubMetaData.instance._deletions:
# TODO
pass
# Default sort is ascending, oldest posts first in list
self.de_list.sort(key=by_pub_date)
self.en_list.sort(key=by_pub_date)
def update(self):
"""
Iterate over changes and update respective index pages.
The information about the changed html pages comes from
PubMetaData.instance._updates and
PubMetaData.instance._deletions .
Returns
-------
None.
"""
for article_data in self.de_list:
= Archive._update(gmc.idee_index, article_data,
soup ="./article/")
article_loc= IDXBuilder._limit_entries(soup)
soup = soup.prettify()
html_doc
with open(gmc.idee_index, 'w') as index_file:
print(html_doc, file=index_file)
index_file.flush()
index_file.close()
for article_data in self.en_list:
= Archive._update(gmc.concept_index, article_data,
soup ="./article/")
article_loc= IDXBuilder._limit_entries(soup)
soup
= soup.prettify()
html_doc
with open(gmc.concept_index, 'w') as index_file:
print(html_doc, file=index_file)
index_file.flush()
index_file.close()
@staticmethod
def _limit_entries(soup):
= soup.find_all("article")
tags = 0
count for tag in tags:
if count > ITEM_COUNT:
tag.decompose()else:
+= 1
count return soup
Since the template for archive pages is used for the index page, it is necessary to remove the h1 tag with the text "Archive" after initial creation.
That's a one time intervention, and i did not see any need to implement something to avoid this. As easily can be seen, the index pages are created with Archive._update() function. Probably not very elegant implemented, but effective reuse.
There is obviously room for improvement.
Own Magic Words
I indroduced own so called magic words to steer the production of PDF or the display of the CCß license information.
- __NOPDF__ prevents the PDF creation and the placement of the PDF Icon.
- __NOLIC__ prevents the placement of the License Icons
The rationale is quote simple. If I post just a simple video, audio or reading recommendation, it does not make any sense to place a license information for a non existing own intellectual work.
Indeed it only does raise the risk that consumers misunderstand the meaning of the license information as to be applicable to the recommended content.
The magic words are ignored in the MediaWiki and processed by Pandoc into content placed in < p > tags. The plainworker.py does query there existence and changes the output accordingly.
German Literals
As I found out, that I cannot use "normal" literals in titles, I learned today how enter German literals via the German keyboard in front of me.
Which year is it? 2022. When did I start working in the IT business? I think it was called EDV in Germany at those times, „elektronische Datenverarbeitung“. It was in December 1988.
It took 33 years and a bit more to learn how to enter the German literals. Time to note it down, or I probably forget it again.
- „ [AltGr]+[Fn]+v
- “ [AltGr]+[Fn]+b
Final Recapitulation
The shown documentation follows the implementation sequence, while avoiding to show the code evolution in detail. This is probably not the best sequence possible for a documentation, but I tried to combine it with the implementation story.
The code itself contains a opportunity for improvement. I would not consider the code shown here to be best practice for any purpose.
However, the code is stable enough to go live with the solution on my own site, and I already did. This is the first article, apart from the legal page and the page about the PDF logo, which gets published natively on this page.
It is a very long article and I hope the formatting in the new article HTML takes care to keep it readable in spite of its length.
I learned a lot from this project, and I hope the description is helpful for someone.
Footnotes
- Gitblog - the software that powers my blog , 2020-05-07 ↑
- GitLab Flavored Markdown ↑
- sitemaps.org ; www.sitemaps.org ↑
- Parsing a Wikipedia page's content with python ↑
- Building a full-text search engine in 150 lines of Python code ; Bart de Goede; bart.degoe.de; 2021-03-24 ↑
- Gensim ; WikiPedia ↑
- https://whoosh.readthedocs.io/en/latest/searching.html Whoosh - How to search]; whoosh.readthedocs.io ↑
- rank-bm25 0.2.1 ; pypi.org; 2020-06-04 ↑
- Improvements to BM25 and Language Models Examined ; Andrew Trotman, Antti Puurula, Blake Burgess; Association for Computing Machinery; DOI: https://doi.org/10.1145/2682862.2682863 ; PDF ; 2014-11-26 ↑
- What is the difference between Okapi bm25 and NMSLIB? ; Data Science Stack Exchange; 2021-03-01 ↑
- expandtemplates should use "post" instead of "get" · Issue -272 · mwclient-mwclient ; github.com ↑
- Somebody elses problem - Wikipedia ; en.wikipedia.org ↑
- Configuring MariaDB for Remote Client Access ; mariadb.com ↑
- agate 1.6.3 ; agate.readthedocs.io ↑
- pandas documentation ; pandas.pydata.org ↑
- Add new rows and columns to Pandas dataframe ; kanoki; 2019-08-03 ↑
- Pandas Tutorial ; www.w3schools.com ↑
- Getting Started with Bioconductor 3.7 ; bioconductor.org ↑
- Git Hook Pull After Push - remote: fatal: Not a git repository: '.' · Joe Januszkiewicz ; Joe Januszkiewicz; 2014-04-03 ↑
- sitemaps.org ; www.sitemaps.org ↑
- Feed Validation Service ; validator.w3.org ↑
- RSS 2.0 Specification ; www.rssboard.org ↑
- RDF Site Summary 1.0 Modules: Content ; web.resource.org ↑
- The Atom Syndication Format ; M. Nottingham, R. Sayre; www.rfc-editor.org; DOI: https://doi.org/10.17487/RFC4287 ; December, ↑
- Multiple channels in a single RSS xml - is it ever appropriate? ; , aoeu, aoeu; Stack Overflow; 2010-10-18 ↑
- RSS update single item ; , lou; Stack Overflow; 2013-03-18 ↑
- RSS Advisory Board - Relative links ; www.rssboard.org ↑
- Module ngx_http_addition_module ; nginx.org ↑
- nginx: Mitigating the BREACH Vulnerability with Perl and SSI or Addition or Substitution Modules — Wild Wild Wolf ; wwa; Wild Wild Wolf; 2018-09-04 ↑
- Module ngx_http_ssi_module ; nginx.org ↑
- Pandoc and foreign characters ; , Mike Thomsen, Mike Thomsen; Stack Overflow; 2013-09-05 ↑
- Pandoc User’s Guide ; pandoc.org ↑
- Customizing pandoc to generate beautiful pdf and epub from markdown ; learnbyexample.github.io ↑
- How to convert HTML to PDF using pandoc? ; , Chris Stryczynski; Stack Overflow; 2017-06-08 ↑
- WeasyPrint ; doc.courtbouillon.org ↑
- Going Further ; doc.courtbouillon.org ↑
- Revisting HTML To PDF Conversion with CSS Paged Media ; carlos; The Publishing Project; 2021-11-15 ↑
- CSS Paged Media Module Level 3 ; www.w3.org; 2018-10-18 ↑