PDFParser

January 2019

Context

I was given a group project aimed at generating summaries from PDF files*. Its goal was to create a tool that could parse and classify large amounts of textual data, extracting key information and presenting it in an easily digestible format.

As a starting point, I chose to use Python and NLP toolkits, which would allow us to work with natural language and analyze the text for important keywords and concepts. Additionally, I decided to use the CEF (Chromium Embedded Framework) for the front-end, as it provided a flexible and powerful user interface for our program.

Mock screenshot of the program home screen

Problem solving

To achieve our goal, I needed to break the problem down into several smaller tasks. The first step was to create a parser that could extract text from PDF files and convert it into a more easily processed format. I accomplished this using a third-party tool called PDFtoText, which I integrated into our program.

Next, I needed to develop a classification algorithm that could analyze the text and identify important information such as keywords and summaries. This required me to use various NLP toolkits, including regular expressions, and to create custom scripts that could identify patterns and structures within the text.

Once I had a functional parser and classifier, I turned my attention to the user interface. I used the CEF framework to create a graphical user interface that would allow users to easily interact with my program. This included features such as file selection, output formatting, and a progress bar to indicate the status of the parsing and classification process.

Screenshot of the program home screen

Keyboard shortcuts

One key feature of our program is the ability to configure customizable keyboard shortcuts. This feature allows users to perform common tasks more quickly and efficiently, without needing to navigate through menus or remember specific command-line arguments. For example, users can easily toggle between XML and plain text output formats by pressing the “x” key. This feature is particularly useful for power users who need to process large amounts of data quickly and efficiently.

Screenshot of the program showing the configurable keybinds

Real-time configuration

Another valuable feature of our program is the real-time configuration form for input and output directories. This feature provides users with a visual representation of their input and output folders, which can be updated in real-time as needed. This makes it easy for users to keep track of their work and ensure that they are working with the correct files. Additionally, the ability to configure the input and output folders through the user interface simplifies the process of setting up and using the program, making it more accessible to users who are not familiar with the command-line interface.

Screenshot of the program showing the configurable path and the save button to change it whithout needing to restart the application

Visual preview of the output structures

Finally, our program includes a visual preview of the output structures and configuration options. This feature allows users to see the expected output of their configuration changes before running the program, giving them greater control over the parsing and summarizing process. This feature is particularly valuable for users who are working with complex documents and need to ensure that the output is accurate and informative.

Screenshot of the program showing the visual preview of the output structures

Conclusion

Overall, the group project was a success. I was able to create a powerful tool for parsing and summarizing large amounts of textual data. Through this project, I learned a great deal about the challenges of working with natural language, as well as the power of using NLP techniques to automate complex tasks. Working with CEF made my project stand out in both usability (UX) and interface design (UI) that was a big advantage and pleased the jury a lot

Also Read

  • Web Dashboard

    Xperidia Private Manager

    Creating a dashboard page in HTMl/CSS that communicate with a PHP API. OAuth login is made with Steam and Discord.

    • HTML5
    • Design
    • UX
    • CSS
  • A better file listing

    Games Center

    Frustrated from the Apache default index page, I decided to write a simple single file web app under constraints

    • Svelte
    • SPA
    • Single File
    • Scraping
  • Machine Learning Visualisation

    REImu Watch

    A cool SvelteJS + TailwindCSS dashboard using ApexChart to visualise data from a machine learning model. The site is static, but fetch database generated by an ETL Python script

    • Dashboard
    • ApexChart
    • SvelteJS
    • ML Model

Want more ?

I've got a lot of other awesome projects

See them all !