In this article, I share a Python script developed as a Proof of Concept (POC) for automating the translation of my blog posts, using OpenAI’s GPT-4 language model. This script is specifically designed to process Markdown files in the structure of my Hugo blog, facilitating the multilingual management of my articles. They are available in English, Spanish and Chinese.

Introduction to the Project: Merging AI and Automation for My Blog

This project for the automation of translating my blog articles was kickstarted by my growing fascination with artificial intelligence. Inspired by my preliminary experiences with OpenAI’s GPT-4 APIs and Mistral AI, I was drawn to the idea of actualizing these technologies into a practical project, offering tangible value to my blog. It was not just a quest to master AI tools, but also a desire to merge automation and innovation to enrich my digital space.

This project turned into an adventure where AI was not just a subject of writing but an active partner in development. The idea of translating my articles in a simple and efficient manner with AI, while exploring its automation capabilities, opened up exciting prospects. It was an opportunity to transcend linguistic barriers, making my content accessible to a wider audience while navigating the ever-evolving realm of artificial intelligence.

The Challenge

The main challenge was to create a script capable of accurately translating while preserving the original formatting of the articles, including code blocks, links, and images. Another challenge was ensuring that the script could be easily adapted to support different languages. It also had to account for this structure:

├── content
│   ├── about
│   │   └── a-propos-du-blog-jls42.md
│   ├── mentions
│   │   └── mentions-legales.md
│   ├── posts
│   │   ├── blog
│   │   │   └── nouveau-theme-logo.md
│   │   ├── ia
│   │   │   ├── poc-mistral-ai-mixtral.md
│   │   │   ├── poc-openai-api-gpt4.md
│   │   │   └── stable-difusion-aws-ec2.md
│   │   ├── infrastructure
│   │   │   └── infrastruture-as-code-serverless-ha-jls42-org.md
│   │   └── raspberry-pi
│   │       ├── glusterfs_distribue_replique_sur_raspberry_pi_via_ansible.md
│   │       ├── initialisation-auto-de-raspbian-sur-raspberry-pi.md
│   │       ├── installation-de-docker-sur-raspberry-pi-via-ansible.md
│   │       └── installation-de-kubernetes-sur-raspberry-pi-via-ansible.md

The Solution: An Innovative Script

I designed a Python script that relies on the OpenAI GPT-4 API to translate text while preserving non-textual elements. Thanks to a series of processing rules and the use of placeholders, the script was able to identify and exclude code blocks and other non-translatable elements, thus ensuring that the translated content remained true to the original.

Key Features

  1. Accurate Translation with GPT-4: The script uses OpenAI’s GPT-4 model to translate text from French to English, ensuring the quality and nuance of the original content is maintained.
  2. Preservation of Formatting: Code blocks, URLs, and image paths are identified and left intact during the translation, ensuring that the original formatting is preserved.
  3. Multilingual Flexibility: The script is designed to be easily adaptable to different source and target languages, allowing for a wide range of multilingual applications.
  4. Support for Markdown Files: Ability to translate documents written in Markdown, maintaining their specific structure and formatting.
  5. Automated Translation of Directories: Automates the translation of Markdown files found in a given directory and its subdirectories, facilitating the management of large volumes of content.
  6. Translation Note Integration: Automatically adds a translation note at the end of translated documents, indicating the GPT model used for translation.
  7. Easy Configuration and Customization: Customizable default settings for the API key, GPT model, source and target languages, and file directories, offering great flexibility of use.
  8. Performance Report: The script provides feedback on the time taken to translate each file, allowing for performance monitoring.

Script Code

The code is also available here: AI-Powered Markdown Translator

#!/usr/bin/env python3

import os
import argparse
import time
from openai import OpenAI
import re

# Initialisation de la configuration avec les valeurs par défaut
DEFAULT_API_KEY = 'votre-clé-api-par-défaut'
DEFAULT_MODEL = "gpt-4-1106-preview"
DEFAULT_SOURCE_LANG = 'fr'
DEFAULT_TARGET_LANG = 'en'
DEFAULT_SOURCE_DIR = 'content/posts'
DEFAULT_TARGET_DIR = 'traductions_en'

MODEL_TOKEN_LIMITS = {
    "gpt-4-1106-preview": 4096,
    "gpt-4-vision-preview": 4096,
    "gpt-4": 8192,
    "gpt-4-32k": 32768,
    "gpt-4-0613": 8192,
    "gpt-4-32k-0613": 32768
}

# Fonction de traduction
def translate_with_openai(text, client, args):
    """
    Traduit le texte donné du langage source au langage cible en utilisant l'API OpenAI.
    
    Args:
        text (str) : Le texte à traduire.
        client : L'objet client OpenAI.
        args : Les arguments contenant les informations sur le langage source, le langage cible et le modèle.
        
    Returns:
        str : Le texte traduit.
    """
    # Détecter et stocker les blocs de code
    code_blocks = re.findall(r'(^```[a-zA-Z]*\n.*?\n^```)', text, flags=re.MULTILINE | re.DOTALL)
    placeholders = [f"#CODEBLOCK{index}#" for index, _ in enumerate(code_blocks)]
    
    # Remplacer les blocs de code par des placeholders
    for placeholder, code_block in zip(placeholders, code_blocks):
        text = text.replace(code_block, placeholder)
    
    # Création du message pour l'API
    messages = [
        {"role": "system", "content": f"Translate the following text from {args.source_lang} to {args.target_lang}, ensuring that elements such as URLs, image paths, and code blocks (delimited by ```) are not translated. Leave these elements unchanged."},
        {"role": "user", "content": text}
    ]
    
    # Envoi de la demande de traduction
    response = client.chat.completions.create(
        model=args.model,
        messages=messages
    )
    
    # Obtenir le texte traduit et remplacer les placeholders par les blocs de code originaux
    translated_text = response.choices[0].message.content.strip()
    for placeholder, code_block in zip(placeholders, code_blocks):
        translated_text = translated_text.replace(placeholder, code_block)

    return translated_text

def add_translation_note(client, args):
    """
    Ajoute une note de traduction à un document.

    Args:
        client : Le client de traduction.
        args : Arguments supplémentaires.

    Returns:
        La note de traduction formatée.
    """
    # Note de traduction en français
    translation_note_fr = "Ce document a été traduit de la version française du blog par le modèle "
    # Traduire la note en langue cible
    translated_note = translate_with_openai(translation_note_fr + args.model, client, args)
    # Formatage de la note de traduction
    return f"\n\n**{translated_note}**\n\n"

# Traitement des fichiers Markdown
def translate_markdown_file(file_path, output_path, client, args):
    """
    Traduit le contenu d'un fichier markdown en utilisant l'API de traduction OpenAI et écrit le contenu traduit dans un nouveau fichier.

    Args:
        file_path (str): Chemin vers le fichier markdown d'entrée.
        output_path (str): Chemin vers le fichier de sortie où le contenu traduit sera écrit.
        client: Client de traduction OpenAI.
        args: Arguments supplémentaires pour le processus de traduction.

    Returns:
        None
    """
    print(f"Traitement du fichier : {file_path}")
    start_time = time.time()

    with open(file_path, 'r', encoding='utf-8') as f:
        content = f.read()

    translated_content = translate_with_openai(content, client, args)
    
    # Ajouter la note de traduction à la fin du contenu traduit
    translation_note = add_translation_note(client, args)
    translated_content_with_note = translated_content + translation_note

    with open(output_path, 'w', encoding='utf-8') as f:
        f.write(translated_content_with_note)

    end_time = time.time()
    print(f"Traduction terminée en {end_time - start_time:.2f} secondes.")

def translate_directory(input_dir, output_dir, client, args):
    """
    Traduit tous les fichiers markdown dans le répertoire d'entrée et ses sous-répertoires.

    Args:
        input_dir (str): Chemin vers le répertoire d'entrée.
        output_dir (str): Chemin vers le répertoire de sortie.
        client: Objet client de traduction.
        args: Arguments supplémentaires pour la traduction.

    Returns:
        None
    """
    for root, dirs, files in os.walk(input_dir, topdown=True):
        # Exclure les dossiers qui commencent par "traductions_"
        dirs[:] = [d for d in dirs if not d.startswith("traductions_")]

        for file in files:
            if file.endswith('.md'):
                file_path = os.path.join(root, file)
                base, _ = os.path.splitext(file)
                # Ajouter le nom du modèle utilisé dans le nom du fichier de sortie
                output_file = f"{base}-{args.model}-{args.target_lang}.md"
                relative_path = os.path.relpath(root, input_dir)
                output_path = os.path.join(output_dir, relative_path, output_file)

                os.makedirs(os.path.dirname(output_path), exist_ok=True)

                if not os.path.exists(output_path):
                    translate_markdown_file(file_path, output_path, client, args)
                    print(f"Fichier '{file}' traité.")


def main():
    """
    Fonction principale pour traduire les fichiers Markdown.

    Args:
        --source_dir (str): Répertoire source contenant les fichiers Markdown.
        --target_dir (str): Répertoire cible pour sauvegarder les traductions.
        --model (str): Modèle GPT à utiliser.
        --target_lang (str): Langue cible pour la traduction.
        --source_lang (str): Langue source pour la traduction.
    """
    parser = argparse.ArgumentParser(description="Traduit les fichiers Markdown.")
    parser.add_argument('--source_dir', type=str, default=DEFAULT_SOURCE_DIR, help='Répertoire source contenant les fichiers Markdown')
    parser.add_argument('--target_dir', type=str, default=DEFAULT_TARGET_DIR, help='Répertoire cible pour sauvegarder les traductions')
    parser.add_argument('--model', type=str, default=DEFAULT_MODEL, help='Modèle GPT à utiliser')
    parser.add_argument('--target_lang', type=str, default=DEFAULT_TARGET_LANG, help='Langue cible pour la traduction')
    parser.add_argument('--source_lang', type=str, default=DEFAULT_SOURCE_LANG, help='Langue source pour la traduction')

    args = parser.parse_args()

    openai_api_key = os.getenv('OPENAI_API_KEY', DEFAULT_API_KEY)
    with OpenAI(api_key=openai_api_key) as client:
        translate_directory(args.source_dir, args.target_dir, client, args)

if __name__ == "__main__":
    main()

Closer Look at the Script

Module Imports

First, we have some necessary module imports, such as os, argparse, time, and re. These modules are used for filesystem operations, parsing command-line arguments, measuring execution time, and performing text search-and-replace operations.

Constants

Next, we have constants defined, such as DEFAULT_API_KEY, DEFAULT_MODEL, DEFAULT_SOURCE_LANG, DEFAULT_TARGET_LANG, DEFAULT_SOURCE_DIR, and DEFAULT_TARGET_DIR. These constants represent the default values used in the script, but they can be altered by specifying command-line arguments.

Function translate_with_openai

Next, we have the translate_with_openai function. This function takes text, an OpenAI client object, and arguments as parameters. It uses the OpenAI API to translate the text from the source language to the target language. Here’s how it works:

  1. The function uses a regular expression to detect and store code blocks in the text. These code blocks are delimited by triple backticks (```). The code blocks are stored in a list called code_blocks.
  2. The function then replaces the code blocks with placeholders in the text. The placeholders are strings of the form #CODEBLOCK{index}#, where index is the index of the corresponding code block in the code_blocks list.
  3. The function creates a message for the OpenAI API. This message contains two parts: a system message that instructs the API to translate the text from the source to target language while leaving elements such as URLs, image paths, and code blocks unchanged, and a user message containing the text to be translated.
  4. The function sends the translation request to the API using the client.chat.completions.create() method. It specifies the model to be used and the messages to be translated.
  5. The API’s response contains the translated text. The function retrieves the translated text and replaces the placeholders with the original code blocks.
  6. Finally, the function returns the translated text.

Function add_translation_note

We also have the add_translation_note function. This function adds a translation note to a document. It takes an OpenAI client object and arguments as parameters. Here’s how it functions:

  1. The function creates a translation note in French using the translation_note_fr variable.
  2. The function then uses the translate_with_openai function to translate the translation note using the OpenAI API. The arguments passed to translate_with_openai include the French translation note and other arguments.
  3. The function formats the translated translation note by adding formatting characters.
  4. Finally, the function returns the formatted translation note.

Function translate_markdown_file

We have the translate_markdown_file function. This function takes the path of an input Markdown file, the path of an output file, an OpenAI client object, and arguments as parameters. It translates the content of the Markdown file using OpenAI’s translation API and writes the translated content into the output file.

This script has not only improved the accessibility of my blog articles but has also paved the way for new automation possibilities in the field of multilingual content creation. It’s a step forward toward broader and more inclusive knowledge-sharing and content accessibility.

Usage Experience and Processing Time

Usage Examples

# Création des répertoires cibles
jls42@Boo:~/blog/jls42$ mkdir content/traductions_en content/traductions_es

###############################################
# Demande de traduction à l'IA vers l'anglais #
###############################################
jls42@Boo:~/blog/jls42$ python3 translate.py --source_dir content/ --target_dir content/traductions_en
Traitement du fichier : content/posts/ia/stable-difusion-aws-ec2.md
Traduction terminée en 21.57 secondes.
Fichier 'stable-difusion-aws-ec2.md' traité.
Traitement du fichier : content/posts/ia/poc-openai-api-gpt4.md
Traduction terminée en 34.87 secondes.
Fichier 'poc-openai-api-gpt4.md' traité.
Traitement du fichier : content/posts/ia/poc-mistral-ai-mixtral.md
Traduction terminée en 62.47 secondes.
Fichier 'poc-mistral-ai-mixtral.md' traité.
Traitement du fichier : content/posts/raspberry-pi/installation-de-kubernetes-sur-raspberry-pi-via-ansible.md
Traduction terminée en 46.37 secondes.
Fichier 'installation-de-kubernetes-sur-raspberry-pi-via-ansible.md' traité.
Traitement du fichier : content/posts/raspberry-pi/installation-de-docker-sur-raspberry-pi-via-ansible.md
Traduction terminée en 10.08 secondes.
Fichier 'installation-de-docker-sur-raspberry-pi-via-ansible.md' traité.
Traitement du fichier : content/posts/raspberry-pi/initialisation-auto-de-raspbian-sur-raspberry-pi.md
Traduction terminée en 17.17 secondes.
Fichier 'initialisation-auto-de-raspbian-sur-raspberry-pi.md' traité.
Traitement du fichier : content/posts/blog/nouveau-theme-logo.md
Traduction terminée en 12.91 secondes.
Fichier 'nouveau-theme-logo.md' traité.
Traitement du fichier : content/posts/infrastructure/infrastruture-as-code-serverless-ha-jls42-org.md
Traduction terminée en 12.64 secondes.
Fichier 'infrastruture-as-code-serverless-ha-jls42-org.md' traité.
Traitement du fichier : content/mentions/mentions-legales.md
Traduction terminée en 11.90 secondes.
Fichier 'mentions-legales.md' traité.
Traitement du fichier : content/about/a-propos-du-blog-jls42.md
Traduction terminée en 18.72 secondes.
Fichier 'a-propos-du-blog-jls42.md' traité.

################################################
# Demande de traduction à l'IA vers l'espagnol #
################################################
jls42@Boo:~/blog/jls42$ python3 translate.py --source_dir content/ --target_dir content/traductions_es --target_lang es
Traitement du fichier : content/posts/ia/stable-difusion-aws-ec2.md
Traduction terminée en 33.19 secondes.
Fichier 'stable-difusion-aws-ec2.md' traité.
Traitement du fichier : content/posts/ia/poc-openai-api-gpt4.md
Traduction terminée en 25.24 secondes.
Fichier 'poc-openai-api-gpt4.md' traité.
Traitement du fichier : content/posts/ia/poc-mistral-ai-mixtral.md
Traduction terminée en 58.78 secondes.
Fichier 'poc-mistral-ai-mixtral.md' traité.
Traitement du fichier : content/posts/raspberry-pi/installation-de-kubernetes-sur-raspberry-pi-via-ansible.md
Traduction terminée en 17.64 secondes.
Fichier 'installation-de-kubernetes-sur-raspberry-pi-via-ansible.md' traité.
Traitement du fichier : content/posts/raspberry-pi/installation-de-docker-sur-raspberry-pi-via-ansible.md
Traduction terminée en 19.60 secondes.
Fichier 'installation-de-docker-sur-raspberry-pi-via-ansible.md' traité.
Traitement du fichier : content/posts/raspberry-pi/initialisation-auto-de-raspbian-sur-raspberry-pi.md
Traduction terminée en 37.12 secondes.
Fichier 'initialisation-auto-de-raspbian-sur-raspberry-pi.md' traité.
Traitement du fichier : content/posts/blog/nouveau-theme-logo.md
Traduction terminée en 18.91 secondes.
Fichier 'nouveau-theme-logo.md' traité.
Traitement du fichier : content/posts/infrastructure/infrastruture-as-code-serverless-ha-jls42-org.md
Traduction terminée en 30.73 secondes.
Fichier 'infrastruture-as-code-serverless-ha-jls42-org.md' traité.
Traitement du fichier : content/mentions/mentions-legales.md
Traduction terminée en 13.14 secondes.
Fichier 'mentions-legales.md' traité.
Traitement du fichier : content/about/a-propos-du-blog-jls42.md
Traduction terminée en 11.24 secondes.
Fichier 'a-propos-du-blog-jls42.md' traité.

Processing Time

  • English: About 4 minutes (248.70 seconds)
  • Spanish: About 4.7 minutes (284.05 seconds)
  • Cumulative Total: About 8.7 minutes (532.75 seconds) These times demonstrate the script’s efficiency and speed.

Results

You can now access the results of these content-generating translations at these links:

This blog post is a recap of my experience in automation translation with AI. It’s proof that when you combine programming with artificial intelligence, the possibilities are nearly limitless, opening up new and exciting horizons in the field of knowledge-sharing and content accessibility.

This document has been translated from the French version of the blog by the gpt-4-1106-preview model.