Real Software Forums

The forum for Real Studio and other Real Software products.
[ REAL Software Website | Board Index ]
It is currently Mon Mar 27, 2017 9:28 am
xojo

All times are UTC - 5 hours




Post new topic Reply to topic  [ 3 posts ] 
Author Message
 Post subject: PDF Text extraction
PostPosted: Sat Mar 03, 2012 1:46 pm 
Offline
User avatar

Joined: Sat Apr 02, 2011 1:20 pm
Posts: 92
Location: Netherlands
Extracting text from PDFs is a difficult thing to do in RealBasic. I know there are some plugins available that offer PDF text extraction, but they are very expensive. Therefore I have created a small example using the PDFBox project in RealBasic. PDFBox is an Apache project written in Java and therefore, just like RB, cross platform.
Have a look at my sample here: http://www.magicforreal.com/home/playin ... l-and-pdf/
I know the sample can be greatly improved. This is something I plan to do in the not so immediate future. However, I will post new updates in this topic if anyone is interested.


Top
 Profile  
 
 Post subject: Re: PDF Text extraction
PostPosted: Mon Mar 05, 2012 10:47 pm 
Offline
User avatar

Joined: Sun Aug 12, 2007 10:10 am
Posts: 1086
Location: Boiling Springs, SC
gerrut wrote:
Extracting text from PDFs is a difficult thing to do in RealBasic. I know there are some plugins available that offer PDF text extraction, but they are very expensive. Therefore I have created a small example using the PDFBox project in RealBasic. PDFBox is an Apache project written in Java and therefore, just like RB, cross platform.
Have a look at my sample here: http://www.magicforreal.com/home/playin ... l-and-pdf/
I know the sample can be greatly improved. This is something I plan to do in the not so immediate future. However, I will post new updates in this topic if anyone is interested.



definitely interested. Currently to do this there is a commandline pdf-txt/doc free cross-platform extractor which can only be used in rb by creating a new shell and calling it by shell.execute("pdftotxt nameoffile.pdf extractfile.txt") :-)

_________________
Matthew A. Combatti
Real Studio 2012 r1.2

Visit Xojo Developer's Spot!
Systems I Use:
Windows XP/Windows Vista/Windows Server 2008 r2/Windows 7/Windows 8
Mac OSX 10.5/Mac OSX 10.6/Mac OSX Server/Ubuntu/Debian/Suse/Red Hat/
Windows Server 2011/CentOS 5.4 /ReactOS/SimOS

~All Xojo Compatible~


Top
 Profile  
 
 Post subject: Re: PDF Text extraction
PostPosted: Tue Mar 06, 2012 2:36 am 
Offline
User avatar

Joined: Sat Apr 02, 2011 1:20 pm
Posts: 92
Location: Netherlands
Well actually this is a java app running in the shell as well. But it is open source and may be implemented in commercial projects. And it has quite a lot of options for dissecting pdfs.
Eventually it would be nice if we could rewrite some of the Java code to RB-code and make the external app unnecessary. However the main problem with pdfs is the many programs that create sloppy (off-spec) pdfs and the different types of encoding. In RealBasic there are no build-in ways to do some Flat-e decoding.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 3 posts ] 

All times are UTC - 5 hours


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group