Generating basic content statistics for static site generator

Thursday, March 7, 2024 - Permalink

Categories: blog -- Tags: #hugo #statistics #dev

/!\ Warning: This article is older than 555 days, make sure the content is still relevant!

Nota: This post is tagged as a long post, meaning it may be better to prepare yourself a coffee or a drink of your choice before starting reading this page :).

Table of Content

Nota: « Content statistics » refers to statistics about the content available on this website (how many blog posts per year, …). This has nothing to do with analytics / user statistics or any kind of tracking!

Introduction

2 or 3 days ago, I had a crazy idea: « What about adding content statistics on this small website? ». And a few days and many lines of bash / python scripts later, here we are…

For more context, this site is generated via hugo, a static site generator. It means all my contents, would that be blog posts, gemlog entries or bookmarks, are in markdown files. If you want to see an example, you can see all blog posts on the git repository.

By having content within markdown files, it meant:

the SSG/CMS (hugo) does not calculate stats based on those
I can not simply query a database to retrieve the info I want
Stats needs to generated during the build phase before each deployment to calculate always up to date stats

Bash to save the day

I tried searching for some software that would allow to extract data from markdown / frontmatter, but couldn’t find anything usable. All I found was libraries to work with markdown and / or frontmatter content.

I was almost resolved to be obligated to write something from scratch using those libs that I almost gave up and just add this idea to the “longer term todolist” (aka probably never).

But Alex saved the day by giving a brillant bash command that would remove the needs to write a complex code analysing frontmatter in files. Indeed, he shared with me the following command:

for file in *.md; head -n 5 $file | grep 'date:' | sed 's/date.*\([[:digit:]]\{4\}\).*/\1/' >> count; end ; cat count | sort | uniq -c

Seems ugly and complicated, but it simply look at all the *.md files in a directory, look at the header and grep the line started with date: and then keep the 4 digit of the year. Add the info in a temporary file that is then sorted and unified (displaying the count with the -c option).

Let’s say that from there I simply followed the deep rabbit hole… Read the next chapter to access to the links to check these stats out, and read the following one to see the unnecessary complicated process to build those simple pages with bash and python (for generating graph images) :).

What stats are you talking about

Before jumping into the “how”, let’s talk about the “what”. The new available page showing different type of stats are:

Stats Overview page: A first intro to the stats, showing some global numbers and graphs, as well as showing additional stats for the current year
Stats per year: See the number of content (all types) per month for each year, as well as the content type split for all yearly articles
Stats per type (overview): Show some global stats per type: number of article per month over the year, for each content type
Stats about Blog Posts: Show, per year, the number of blog posts per month
Stats about Gemini entries: Show, per year, the number of gemlog entries per month
Stats about bookmarks: Show, per year, the number of bookmarks per month

If you looked at these pages, they are not the best and lots of info could be added, but I feel it is a very good start for 2 evening of work. I can always add more with time. For example, I’m planning to add some stats about word counts per article type of stats. I also plan on keeping a very light css and styling in general, so I may sometime keep stuff in not the best possible way to avoid adding unnecessary css.

How does it work

Now let’s go into the ugly details. And let me start by saying: they are indeed ugly! The python script is just a mess written with speed to write it in mind, absolutely not about optimization or even “good common sense”! I usually don’t care and don’t say too much about code quality of scripts I’m sharing, but I feel like I really MUST warn everyone before they open it :).

The big steps of the process are:

Get data from frontmatter (via bash) for the different content types
Create a json file that will be used both by hugo to access stats number, and a python script that will generate graphs (in png) based on it
Create all different graphs required for the different pages via the python script

All of that is then incorporated within my CI/CD process based on sourcehut CI. I have a (already very) long blog posts in draft to details this CI process, so here I’m only focusing on the part for generating the stats page, not the CI/CD related stuff. It means that if you look at the scripts on sourcehut, there might be things in there not explained on this post.

Retrieve the complete bash script or python code on sourcehut.

Quick word on content hierarchy

It may be important to explain here how my content are organized. In a nutshell:

<hugoRoot>/content
├── bookmarks
├── gemlog
├── pages
├── posts
└── tags

In each directory, I have all the *.md files in there, I don’t have subdirectories. If it is your case, you will need to adapt everything below.

Retrieve data from frontmatter with bash

The magic part here is to focus only on the date parameter within the frontmatter area. I modified a bit the command shared above:

for file in *.md; do head -n 10 $file | grep 'date =' | sed 's/date.*\([[:digit:]]\{4\}-[[:digit:]]\{2\}\).*/\1/' >> "${temp}/_tmpCount" ; done

I had to use -n 10 with head to get more lines as I did have some long frontmatter, as well as retrieving with sed not only the year, but the couple YYYY-MM. Then pushing that in a temporary file.

Then the tricky part was creating a loop that would allow to read data line by line and still be able to reuse the variable outside of the while loop¹:

  total=0
  while read -r line
  do
    nb=$(echo "${line}" | awk -F " " '{print $1}')
    […]
    total=$((total + nb))
  done < <(cat "${temp}/_tmpCount" | sort | uniq -c)

Notice the part after done? This is process substitution and prevents the while loop to create a subprocess.

Let’s talk about generating the json file now…

Create a json file… in bash

I don’t know how to explain better than sharing almost the entire code of the bash script…

[…]
# Init variable:
global_total=0
res_json="{\"articles\": {"

for type in "${stats_content_types[@]:?}"
do
  […]
  res_json="${res_json}\"$type\": {\"entries_per_month\": ["

  […]
  for file in *.md; do head -n 10 $file | grep 'date =' | sed 's/date.*\([[:digit:]]\{4\}-[[:digit:]]\{2\}\).*/\1/' >> "${temp}/_tmpCount" ; done

  total=0
  while read -r line
  do
    nb=$(echo "${line}" | awk -F " " '{print $1}')
    date=$(echo "${line}" | awk -F " " '{print $2}')
    res_json="${res_json}{\"date\": \"${date}\", \"count\": \"${nb}\"},"

    total=$((total + nb))
  done < <(cat "${temp}/_tmpCount" | sort | uniq -c)
  res_json="${res_json::-1}], \"total\": ${total}},"

  global_total=$((global_total + total))
done

res_json="${res_json::-1}}, \"total_articles\": ${global_total}}"
[…]

I tried to reduce the noise to a minimum. As you can see, it is just an ugly way to create the full json string in loops… But hey, it works!

The json generated looks like this:

{
  "articles": {
    "posts": {
      "entries_per_month": [
        {
          "date": "2013-01",
          "count": "1"
        },
        {
          "date": "2013-02",
          "count": "2"
        },
        […]
      ],
      "total": 127
    },
    "gemlog": {
      "entries_per_month": [
        {
          "date": "2021-02",
          "count": "5"
        },
        […]
      ],
      "total": 42
    },
    "bookmarks": {
      "entries_per_month": [
        {
          "date": "2023-02",
          "count": "18"
        },
        […]
      ],
      "total": 68
    }
  },
  "total_articles": 237
}

Count contains the number of posts for that type for the given month (format YYYY-MM).

The generated json file will be used by hugo directly to display some stats (eg, on the stats overview summary) and by the python script to generate images. So it is the “source of truth” once created.

Let’s look first at the graph and images generation, and then at the hugo setup.

Generate graphs in python with Pygal

I almost went with MatPlotLib as the library of choice for building the graphs, but then found Pygal which seemed easier to start with, and more than enough for anything I wanted to do on the stats area of this site.

I’m putting the warning again, but the python script is as ugly as it can be! It needs a lot of love, but for now, everything was made with speed to deliver in mind, not love of well thought work :D.

You can find the script on sourcehut.

I’m not going to display and explain the script here. It is ugly but in the end generate with Pygal different pie and bar charts:

stats-articles_types_per_year-<YEAR>-pie.png: One half pie chart per year displaying the articles split per type
stats-global_articles_in_<YEAR>-bar.png: One bar chart per year displaying the number of articles (all types) per month of the given year
stats-global_articles_per_year-bar.png: One bar chart displaying the number of articles (all types) created per year
stats-monthly_<TYPE>_in_<YEAR>-bar.png: One bar chart per year and per content type, displaying the number of articles of the given content type per month for the given year
stats-monthly_<TYPE>_per_year-bar.png: One bar chart per content type, displaying the number of articles of the given content type per month over all the years

Replace <YEAR> with the different years since the creation of this website (ignoring empty years) and <TYPE> is one of the existing content types (posts, gemlog and bookmarks, pages are ignored).

Right now, it creates 38 files in total… Once generated they are moved to the right place within hugo structure (in my case, <hugoRoot>/static/images/pages/stats/).

Generating Hugo Pages

This may seems complicated and / or messy because there are many markdown and html files… But it is simple in reality, I just splited the markdown in multiple files to have lighter pages and used small shortcode templates to be able to reuse them easily.

Content pages

Let’s start with the easiest part. I created 6 markdown files in <HugoRoot>/content/pages/. I’m not going to copy their content here, I’m linking them to their sourcehut page if you want to see them entirely:

stats.md for the overview page, it read the json file to display some data (see below for details) and load some shortcodes (see below)
stats-per-year.md for the stats per year page
stats-per-type.md for the stats per type overview page
stats-posts.md for the blog posts stats page
stats-gemlog.md for the gemlog stats page
stats-bookmarks.md for the bookmarks stats page

Custom shortcodes

In the previous pages, I call some custom hugo shortcodes to avoid repeating myself (and be able to create custom html called from the markdown files).

You can find all of these shortcodes on sourcehut, so I’m not going to go into all the details. But to give an idea, here is an example.

This is the code for the contentstats-articles-per-type (in <HugoRoot>/layouts/shortcodes/):

{{ $articleType := $.Get 0 }}
{{ $currentYear := $.Get 1 }}

<div class="stats-item">
    <p>All {{ strings.FirstUpper $articleType }} per month in {{ $currentYear }}</p>
    <figure class="statsimg">
        <img src="{{ print "/images/pages/stats/stats-monthly_" $articleType "_in_" $currentYear "-bar.png" }}" alt="{{ $articleType }} in {{ $currentYear }}" />
        <figcaption>{{ $articleType }} in {{ $currentYear }}</figcaption>
    </figure>
</div>

It receives 2 arguments: the type of articles (posts, gemlog or bookmarks) and a given year. Then based on these 2 arguments, it displays the right info. In this case, just one image found based on the 2 given arguments. Other shortcodes may do a lot more.

It allows me to call it from different markdown files multiple times with different arguments depending on the context. For example, I’m calling this one from the stats-posts.md file:

## 2024
{{< contentstats-articles-per-type "posts" "2024" >}}

## 2023
{{< contentstats-articles-per-type "posts" "2023" >}}

[…]

So for each year, I can just call it with different year argument. Then, I can do the same within stats-gemlog.md to display the gemlog graphs:

## 2024
{{< contentstats-articles-per-type "gemlog" "2024" >}}

## 2023
{{< contentstats-articles-per-type "gemlog" "2023" >}}

[…]

Look into the files themselves to understand more :).

Reading json file

One last thing about hugo, it can read directly json (and yaml, toml or csv) files and display data directly from it. For example, from the shortcode contentstats-summary-graph, it doesn’t load images but directly the json file to display the number of articles per type:

{{ $data := index .Site.Data "content_stats" }}
<div class="stats-summary">
    <strong>{{ $data.total_articles }}</strong> pieces of content have been published in total on this website:
    <ul>
    {{ range $k, $v := $data.articles }}
        <li> {{ $v.total }} {{ $k }}</li>
    {{ end }}
    </ul>
</div>

The index .Site.Data "content_stats" is the part loading the file (called content_stats.json) and then I can loop on it using the range function.

Conclusion

Well, this has been a fun couple of evening, working on the ugly bash and python scripts, toying with the generated pages and images and writing this article! Now I have content stats generated during the CI process so I know they will always be up to date :].

Let me know if you think about more interesting information to display or want to discuss how to apply this to your own website.

: See https://stackoverflow.com/questions/4667509/shell-variables-set-inside-while-loop-not-visible-outside-of-it ↩︎

Contact

If you find any issue or have any question about this article, feel free to reach out to me via webmentions, email, mastodon, matrix or even IRC, see the About page for details.

« Happy 25th birthday to the...

Improving listing and taxonomy... »