Our client, a leader in identifying Key Opinion Leaders (KOLs) for pharmaceutical companies, frequently gathers data from publicly available medical congress websites. They previously faced significant challenges in collecting, converting, and storing this data, leading to time-consuming manual processes and reliance on external vendors for data extraction. Seeking to streamline their processes and reduce costs, they approached ZLabs to develop a solution for automating their data extraction and conversion tasks..
Problem Statement
The client's existing data collection processes were inefficient and costly. Their team manually converted data from medical congress websites' PDF files into Excel, a time-intensive process prone to errors. Additionally, the client hired external vendors to extract congress abstract data from these websites, which incurred high costs. This manual workflow included:
- Downloading PDFs from medical congress websites.
- Copying and pasting data from online sources into Excel.
- Sorting and storing the collected data using Excel templates.
The client needed a solution to automate the PDF conversion and data extraction processes to reduce costs and increase efficiency.
Goals/Objectives
The primary objectives of the solution were:
- Automate the conversion of PDF data into Excel format.
- Eliminate the need for costly third-party vendors for data extraction.
- Improve the accuracy of extracted data and reduce the risk of missing critical information.
- Streamline the data collection process from medical congress websites.
- Reduce manual effort and processing time while maintaining data accuracy and integrity.
Approach/Methodology
ZLabs conducted a thorough analysis of the client's current processes and challenges. Based on this analysis, the following key steps were taken to address the issues:
- Problem Identification: The challenges of manually extracting relevant data from large PDFs and dynamic online websites were identified. The unstructured and inconsistent formats of the data, as well as the risk of missing records during manual extraction, were key issues.
- Tool Development: ZLabs designed and developed a custom tool that automated the extraction of data from PDFs and websites. The tool included batch processing capabilities for improved efficiency and accuracy.
- Data Validation: To ensure the integrity of the extracted data, predefined formulas were incorporated into the tool for validation. If any discrepancies were found, the tool allowed for rerunning specific batches.
- Keyword Filtering: The tool featured advanced keyword filtering to ensure only relevant data was extracted, reducing the need for post-extraction manual reviews.
Solution/Implementation
ZLabs’s custom-built tool significantly reduced manual data processing efforts by automating the conversion of PDFs into Excel and scraping data from medical congress websites. Key features of the solution included:
- Batch Processing: The tool was designed to process large volumes of data in batches. Each batch was evaluated against predefined validation rules, ensuring data quality and accuracy. If any issues were detected, specific batches could be rerun without impacting the entire workflow.
- Keyword Filtering: By implementing a keyword filtering system, the tool was able to extract only the most relevant information from the PDFs and web sources, reducing noise and irrelevant data.
- Web Scraping Automation: The tool was designed to navigate dynamic medical congress websites, ensuring data was consistently and accurately scraped, even with the changing nature of web links.
- Error Handling and Reprocessing: The system was built with error-handling features that allowed problematic data batches to be flagged and reprocessed, preventing the loss of important records.
In addition to the tool's automated features, a manual Quality Assurance (QA) validation process was implemented to ensure the highest data accuracy. This process includes the following steps:
- Language Translation: If the conference data are in local languages (such as Spanish, Italian, or German), they are translated into English using Google Translate for consistency.
- Abstract Alignment: The data is then reviewed to ensure that it is properly formatted and aligned.
- Data Cleansing and Validation: After cleansing the data, a comprehensive Quality Check (QC) is performed to ensure accurate capture of all titles based on the search terms used.
- Spot Check for Expert Data: A spot check is conducted to confirm that the correct number of authors presenting the title is captured and to ensure no Key Opinion Leaders (experts) have been missed during the data extraction process.
Results/Outcomes
The implementation of ZLabs’s tool resulted in substantial improvements to the client’s data collection process:
- Reduced Manual Effort by 70%: The automation of PDF conversion and web scraping tasks significantly reduced the manual workload, allowing the client’s team to focus on higher-value activities.
- Increased Accuracy and Reliability: By leveraging predefined validation formulas and keyword filtering, the accuracy of the extracted data improved dramatically, minimizing the risk of missed or incorrect data.
- Cost Savings: The client was able to completely eliminate the need for third-party vendors, resulting in considerable cost savings.
- Improved Turnaround Time: The batch processing capabilities allowed the client to process large volumes of data faster than before, reducing turnaround times for data collection and processing.
Lessons Learned
The key lessons from this case study include:
- Automation Enhances Efficiency: Automating repetitive, manual tasks like PDF conversion and data extraction can drastically improve efficiency, reduce human error, and free up resources for more strategic tasks.
- Importance of Data Validation: Implementing robust data validation rules ensures the integrity of extracted data, even in large-scale, automated workflows.
- Flexible Systems for Dynamic Environments: Solutions need to account for the dynamic nature of websites, as links and formats may change. A system capable of adapting to such changes ensures continuity in data collection efforts.
Conclusion
ZLabs’s automation solution provided the client with a more efficient and cost-effective way to handle their data extraction and PDF conversion processes. By automating manual tasks and eliminating the need for external vendors, the client realized significant time and cost savings while improving the accuracy of their data collection. Moving forward, the client is well-positioned to scale their operations without incurring additional resource burdens, allowing them to better serve their pharmaceutical clients.