New Invoice OCR Transcription Platform FAQ Emburse is rolling out a new AI-powered OCR (Optical Character Recognition) transcription platform beginning in early 2025. Here are answers to frequently asked questions. General What is OCR transcription? Emburse’s OCR (Optical Character Recognition) transcription is an AI-powered receipt- and invoice-capture technology that intelligently scans, extracts, and categorizes receipt and invoice data with unmatched accuracy. Why is Emburse changing its OCR transcription technology? We are committed to continuously improving the quality and global reach of our OCR capabilities, and that’s why we are making this change. The new engine uses an AI-powered OCR transcription technology that offers significantly higher accuracy and can process documents in multiple languages. This upgrade is designed to address previous limitations and deliver more reliable results, ensuring better performance and overall user experience. What changes can customers expect with the new OCR technology in terms of features, field support, and regional support? Our first objective is to improve the accuracy of invoice capture with upgraded technology. We are also increasing the number of supported countries. Accuracy How accurate is the new OCR transcription technology? The OCR transcription technology is highly advanced and more accurate than previous versions. The concrete accuracy improvements depend on the country, product, and field. However, we see an average increase of more than 20% across all different use cases. How has accuracy helped automate other parts of Emburse’s invoice solutions? All invoice-related processing benefits from being downstream of OCR transcription. For example, vendor and vendor address matching will be more accurate, and the quality of our Audit and Analytics products will also improve. Features Can the OCR transcription handle multiple languages? Yes, our OCR transcription is highly proficient across many languages. It performs especially well with such widely spoken languages as English, Spanish, French, and German. However, accuracy may vary for double-byte languages with logographic or syllabic writing systems, like Chinese, Japanese, and Korean, as well as for less commonly spoken languages or dialects. Further improvements are expected in the near future. Training How is Emburse AI trained? Emburse AI is developed using a multi-layered approach that combines multiple enterprise Large Language Models (LLMs) to label and process millions of documents. Importantly, this is done without ever storing customer data outside of Emburse’s secure infrastructure and, in strict adherence to our contractual and compliance obligations, no external models are trained on customer data. The labeled data is used to fine-tune open-source models that are deployed and operated exclusively within Emburse’s infrastructure. These models are optimized to understand structured financial documents like invoices and receipts. The training pipeline is designed to ensure accuracy and adaptability across document formats while strictly safeguarding customer data. Does customer data train Emburse AI for other Enterprise clients? Emburse AI is a general-purpose model used across all Enterprise clients. When we fine-tune the core model, we do so once, using a curated dataset that may include data from multiple customers. However, this is done within Emburse’s secure infrastructure, and no external parties or third-party models ever access this data. While the resulting model is shared, no customer-specific data is exposed or identifiable within the model. The training process is carefully designed to learn general patterns and structures from the data, rather than storing or reproducing any specific customer content. This allows Emburse AI to generalize well across use cases while maintaining strict privacy and security standards for all clients. Privacy/Security How does Emburse’s OCR transcription handle privacy and security concerns? Privacy and security have been given due consideration during the development of the new OCR transcription technology. All data submitted is never stored outside of Emburse’s data infrastructure and is never used to uptrain external models. How does Emburse’s OCR transcription handle privacy and security concerns when using AI-powered services? Privacy and security have been given due consideration during the development of the new OCR transcription technology. All data submitted is never stored outside of Emburse’s data infrastructure and is never used to uptrain external models. Please visit our Trust Center to learn more about how Emburse is keeping your data safe. How do you ensure our data cannot be extracted from Emburse AI via prompt injection on receipt images or otherwise? Prompt injection is a technique where hidden instructions are embedded in input text or images to manipulate an LLM's behavior. Emburse AI is protected against prompt injection through a strict pre- and post-processing architecture. Pre-processing inspects and sanitizes all inputs before they reach the model, ensuring no user or document content can manipulate the behavior of the LLM. Post-processing enforces strict output validation, ensuring the model’s responses conform to a predefined schema aligned with our product and security expectations, (e.g., enforcing consistent date formats like yyyy-mm-dd and disallowing free-text output that could expose raw data). These controls are designed to prevent any form of prompt injection or model manipulation, maintaining both data integrity and output reliability. Support Whom can I contact if I have any problems? Please contact Support if you encounter any issues related to document upload or processing related issues. If you see occasional, non-systemic issues with the wrong data point being extracted via OCR transcription, you do not need to contact Support. We are continuously improving the underlying engine. The best practice is to simply correct the mistake by overwriting the information in the application. Rollout When will my organization be upgraded to the new OCR transcription technology? The rollout will begin in Q1 2025. You will be contacted by Emburse about the exact dates for your organization. Was this article helpful? Yes No