Tuesday, September 13, 2016

Extract Image and Text from Pdf using Spire Pdf

In this article we will see how to extract images and text from Pdf , and convert it to a booklet by using Spire.pdf.


Introduction
In this article we will see how to extract images and text from Pdf file , and then convert it to a booklet by using Spire.pdf. For those who aren’t familiar with spire pdf,it is a professional PDF component applied to creating, writing, editing, handling and reading PDF files without any external dependencies within .NET application. Using this .NET PDF library, you can implement rich capabilities to create PDF files from scratch or process existing PDF documents entirely through C#/VB.NET without installing Adobe Acrobat.

Background
 I was working a project and want to extract text from multiple pdf files in order to analyse the contents before exporting to the database.  I surfed the internet and found spire pdf. It worked  well for me and it's very easy to use. However the library is not only for extraction but it support with many rich features, such as security setting, metadata update, importing data, to name few. it also converts text, image and html to pdf with C#/VB.NET in high quality.

Pre requisites
In this demo we are using visual studio 2015, .Net framework 4.5.2, and the Spire PDF for .Net (by e-iceblue) . .

Download Spire PDF and Install



Step 1:  Open Visual Studio and select File ->New Project, and from the new project dialog box select    ->Visual C# ,and select ->Windows Forms Application. Enter a project name at the bottom of the dialog box and click OK button.  Add the spire pdf dll reference to the project.

Step2:  I'm using a single button to extract images and text from pdf file but you can use separate buttons if you want.



  Using The Code 

      §    Select  button code

     private void selectBtn_Click(object sender, EventArgs e)
        {
            OpenFileDialog dialog = new OpenFileDialog();
            // file types, that will be allowed 
            dialog.Filter = "Pdf | *.pdf";
            dialog.ShowDialog();
            tB1.Text = dialog.FileName;                
        }'

 §  Extract image & text button code

private void Extract_Click(object sender, EventArgs e)
        {
            if (tB1.Text != "")
            {
                SaveFileDialog savefile = new SaveFileDialog();
                savefile.FileName = " TextInPdf.txt";
                savefile.Filter = "TextFiles | *.txt";
                // if user clicked OK
                if (savefile.ShowDialog() == DialogResult.OK)
                {
                    try
                    {
                        //Create a pdf document.
                        PdfDocument doc = new PdfDocument();
                        //load the file 
                        doc.LoadFromFile(tB1.Text);
                        StringBuilder buffer = new StringBuilder();
                        IList images = new List();
                        foreach (PdfPageBase page in doc.Pages)
                        {
                            buffer.Append(page.ExtractText());
                            foreach (Image image in page.ExtractImages())
                            {
                                images.Add(image);
                            }
                        }
                        doc.Close();
                       //save text
                        String fileName = "TextInPdf.txt";
                        File.WriteAllText(fileName, buffer.ToString());
                        //save image
                        int index = 0;
                        foreach (Image image in images)
                        {
                            String imageFileName = String.Format("Image-{0}.png", index++);
                            image.Save(imageFileName, ImageFormat.Png);
                        }

                        //Launching the text file.
                        System.Diagnostics.Process.Start(fileName);
                    }
                     catch (Exception ex) { MessageBox.Show(ex.Message); }                               
                }
            }
        }


§  Convert to Booklet Button code

private void bkletBtn_Click(object sender, EventArgs e)
        {
            if (tB1.Text != "")
            {
                SaveFileDialog savefile = new SaveFileDialog();
                savefile.FileName = "booklet.pdf";
                savefile.Filter = "pdf | *.pdf";
                // if user clicked OK
                if (savefile.ShowDialog() == DialogResult.OK)
                {
                    //Create a pdf document.
                    PdfDocument doc = new PdfDocument();                 
                    String srcPdf = tB1.Text;
                    float width = PdfPageSize.A4.Width * 2;
                    float height = PdfPageSize.A4.Height;
                    doc.CreateBooklet(srcPdf, width, height, true);
                    //Save pdf file.
                    doc.SaveToFile("Booklet.pdf");
                    doc.Close();
                    //Launching the Pdf file.
                    PDFDocumentViewer("Booklet.pdf");
                }
            }
        }
        private void PDFDocumentViewer(string fileName)
        {
            try
            {
                System.Diagnostics.Process.Start(fileName);
            }
            catch { }
        }


Now let’s run by selecting a pdf file ,and we will extract images and text from it. 

And bellow are the images that are extracted from the pdf document


Now let’s Convert the pdf file to booklet by first selecting the file, and clicking convert to booklet button in our application.

Conclusion


The spire pdf is very easy to use and very helpful for .Net developers. The documentation is also very simple and self-explaining. In this article I showed you a simple demonstration but there are also very interesting features, like form filling, file conversion, etc.    I’ll keep sharing.  Stay tuned


Thank you so much for your reading! If you have any complaint or suggestion about the code or the article, please let me know. Don't forget leaving your opinion in the comments section below. ;)


0 comments:

Post a Comment