Real Software Forums
http://forums.realsoftware.com/

PDF Text extraction
http://forums.realsoftware.com/viewtopic.php?f=21&t=43007
Page 1 of 1

Author:  gerrut [ Sat Mar 03, 2012 1:46 pm ]
Post subject:  PDF Text extraction

Extracting text from PDFs is a difficult thing to do in RealBasic. I know there are some plugins available that offer PDF text extraction, but they are very expensive. Therefore I have created a small example using the PDFBox project in RealBasic. PDFBox is an Apache project written in Java and therefore, just like RB, cross platform.
Have a look at my sample here: http://www.magicforreal.com/home/playin ... l-and-pdf/
I know the sample can be greatly improved. This is something I plan to do in the not so immediate future. However, I will post new updates in this topic if anyone is interested.

Author:  simulanics [ Mon Mar 05, 2012 10:47 pm ]
Post subject:  Re: PDF Text extraction

gerrut wrote:
Extracting text from PDFs is a difficult thing to do in RealBasic. I know there are some plugins available that offer PDF text extraction, but they are very expensive. Therefore I have created a small example using the PDFBox project in RealBasic. PDFBox is an Apache project written in Java and therefore, just like RB, cross platform.
Have a look at my sample here: http://www.magicforreal.com/home/playin ... l-and-pdf/
I know the sample can be greatly improved. This is something I plan to do in the not so immediate future. However, I will post new updates in this topic if anyone is interested.



definitely interested. Currently to do this there is a commandline pdf-txt/doc free cross-platform extractor which can only be used in rb by creating a new shell and calling it by shell.execute("pdftotxt nameoffile.pdf extractfile.txt") :-)

Author:  gerrut [ Tue Mar 06, 2012 2:36 am ]
Post subject:  Re: PDF Text extraction

Well actually this is a java app running in the shell as well. But it is open source and may be implemented in commercial projects. And it has quite a lot of options for dissecting pdfs.
Eventually it would be nice if we could rewrite some of the Java code to RB-code and make the external app unnecessary. However the main problem with pdfs is the many programs that create sloppy (off-spec) pdfs and the different types of encoding. In RealBasic there are no build-in ways to do some Flat-e decoding.

Page 1 of 1 All times are UTC - 5 hours
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group
http://www.phpbb.com/