THIS SERVICE IS DEPRECATED - PLEASE USE Rossinante web service instead.
ALSO: We advise our new users to start with our interactive web application PDF to ePub, which allows you to quickly test over a PDF and nicely see both the XML and ePub results.
(Deprecated) PDF to XML service
Convert any PDF document to properly structured XML or common formats like ePub in just a few clicks! Simply upload your PDF file and activate the conversion.
We advise our new users to start with our interactive web application PDF to ePub, which allows you to quickly test over a PDF and nicely see both the XML and ePub results.
Also, the present service is now available in a different flavor called Rossinante web service.
About the Application
Please note that both services are provided for illustrative purposes:
- They use a general-purpose parameterization rather then a collection-specific one, which they could use to increase the quality of the conversion
- They are not meant to sustain massive conversions
- Feedback, suggestions and questions are more than welcome. You can submit them using our contact form
This work is partly supported by the EU Integrated Project SHAMAN, co-funded within the 7th Framework Programme. More information about this project is available in a companion web service called PDF to ePub.
For more information, you can read about the data exchanged by this service or some related publications and access additional software.
More About PDF to XML
From a technical perspective, this service uses HTTPS as network transport protocol, so accessing it from an Intranet requires only setting properly the HTTP proxy, if any.
The server offers the same set of operations as the PDF to ePub web application:
- PDF-to-Layout-XML: to convert a PDF file to layout-oriented (unstrutured) XML using the pdf2xml open source converter
-
Page header/Footer: to tag the page headers and page footers of all pages of the input document
-
Text segmentation and ordering: to order the textual contents of the document, create lines, and form paragraphs
-
Page number detection: to detect the numbering of the pages
-
Caption detection: to detect image and figure captions
- Footnote detection
-
Table of Content (TOC) determination: to detect the TOC and determine where each entry points to in the document body
- Export to ePub (soon)
In addition, it offers several operations that aggregate the previous ones:
-
PDF-to-LOGICAL-XML: to convert a PDF file to document-logic strutured XML. All steps before the export to ePub
-
PDF-to-ePub: all steps to convert from PDF to ePub, like in the web application (soon)
Have also a look at the open source Xeproc designer if your are interested in modelling document processing as XML pipelines.