Real Software Forums

The forum for Real Studio and other Real Software products.
[ REAL Software Website | Board Index ]
It is currently Wed Dec 13, 2017 8:18 am
xojo

All times are UTC - 5 hours




Post new topic Reply to topic  [ 29 posts ]  Go to page 1, 2  Next
Author Message
 Post subject: About instr() performances
PostPosted: Wed Feb 20, 2013 11:38 am 
Offline

Joined: Wed Jan 19, 2011 2:52 am
Posts: 11
Good morning,if possible I need a clarification related to a feature of Realstudio. While I was testing an old code written for another compiler, I realized that the function instr () is extremely slow. The sample program below, loads a text file (about 50000 lines) into a string and then counts the occurrences of a given string. On my PC (Windows 7 Pro 64-bit - i5-3450 3.10 GHz and 16 GB of RAM), the program takes about 16 seconds for the count, value fairly high compared for example with the one obtained by the same code with FreeBASIC (0.02 seconds). I'm not interested in an algorithm that can improve the speed of the count but I would like to know if these poor results are related to my lack of knowledge of the various aspects of Realstudio.

#pragma DisableBackgroundTasks

dim f As FolderItem
dim t as TextInputStream
dim count as integer
dim length As integer
dim point, found As integer
dim start,stop as double
dim temp As string

f = GetFolderItem("test.txt")

if f.Exists then

t = TextInputStream.Open(f)

if t <> nil then

temp = t.ReadAll()
length = len(temp)
t.Close

point = 1
start = microseconds/1000000

do
found = instr(point,temp, "TRGL")
if found = 0 then exit do
count = count + 1
point = found + 1
loop

stop = microseconds/1000000

msgbox str(count) + " items found" + EndOfLine + _
"Elapsed time: " + str(stop - start) + " seconds"

end if
end if


If anyone would like to check my results this is the link of the file used for the test with the version of Real Studio 2012-2.1.

https://dl.dropbox.com/u/79819428/test.zip

Any suggestions?

thanks

Sergio


Top
 Profile  
Reply with quote  
 Post subject: Re: About instr() performances
PostPosted: Wed Feb 20, 2013 11:57 am 
Offline
User avatar

Joined: Wed May 10, 2006 2:42 pm
Posts: 2985
Location: Germany
Your text has no encoding.
This may give Instr extra work!

So either define an encoding or use InstrB.

Greetings
Christian

_________________
See you in Orlando, Florida for Real World 2013
More details and registration here:
http://www.realsoftware.com/community/realworld.php


Top
 Profile  
Reply with quote  
 Post subject: Re: About instr() performances
PostPosted: Wed Feb 20, 2013 12:07 pm 
Offline

Joined: Wed Jan 19, 2011 2:52 am
Posts: 11
Christian, many thanks!
Perfect, it was just my fault! It seemed strange that Instr () was so slow :lol:

Cheers

Sergio


Top
 Profile  
Reply with quote  
 Post subject: Re: About instr() performances
PostPosted: Wed Feb 20, 2013 12:36 pm 
Offline

Joined: Wed Mar 22, 2006 11:15 am
Posts: 712
Location: Southern California
Setting the encoding makes no difference for me on my Mac or my Win7 VM. InStrB, however, finishes in 0.005s.

_________________
Daniel L. Taylor
Custom Controls for Real Studio WE!
Visit: http://www.webcustomcontrols.com/


Top
 Profile  
Reply with quote  
 Post subject: Re: About instr() performances
PostPosted: Wed Feb 20, 2013 12:48 pm 
Offline

Joined: Fri Sep 30, 2005 10:01 am
Posts: 283
Location: Germany, Munich
Try converting the text to another converting - that might make the search faster.

E.g, try:
WinLatin, UTF-16, UTF-32

_________________
User of RB since first version. Provider of many free and outdated plugins.
Code for sharing: http://www.tempel.org/RB/Resources
Arbed, a unique tool for editing projects: http://www.tempel.org/Arbed
Zip compression classes: http://www.tempel.org/RB/ZipPackage


Top
 Profile  
Reply with quote  
 Post subject: Re: About instr() performances
PostPosted: Wed Feb 20, 2013 1:10 pm 
Offline

Joined: Wed Mar 22, 2006 11:15 am
Posts: 712
Location: Southern California
tempel wrote:
Try converting the text to another converting - that might make the search faster.

E.g, try:
WinLatin, UTF-16, UTF-32


I tried ASCII and UTF8. I was shocked that ASCII didn't improve performance.

_________________
Daniel L. Taylor
Custom Controls for Real Studio WE!
Visit: http://www.webcustomcontrols.com/


Top
 Profile  
Reply with quote  
 Post subject: Re: About instr() performances
PostPosted: Wed Feb 20, 2013 1:24 pm 
Offline

Joined: Fri Sep 30, 2005 10:01 am
Posts: 283
Location: Germany, Munich
Oh, another trick might be to use regex - the longer the search string the faster the search should be.

_________________
User of RB since first version. Provider of many free and outdated plugins.
Code for sharing: http://www.tempel.org/RB/Resources
Arbed, a unique tool for editing projects: http://www.tempel.org/Arbed
Zip compression classes: http://www.tempel.org/RB/ZipPackage


Top
 Profile  
Reply with quote  
 Post subject: Re: About instr() performances
PostPosted: Wed Feb 20, 2013 1:54 pm 
Offline

Joined: Wed Mar 22, 2006 11:15 am
Posts: 712
Location: Southern California
tempel wrote:
Oh, another trick might be to use regex - the longer the search string the faster the search should be.


RegEx is also very fast, 0.03s.

My guess is that InStr has to scan the entire string each time for encodings with a variable character width like UTF8. That would explain the time with a 1.2 MB string that has thousands of instances of the search text. But why would it do the same thing with other encodings? Is it converting everything to UTF8 internally?

_________________
Daniel L. Taylor
Custom Controls for Real Studio WE!
Visit: http://www.webcustomcontrols.com/


Top
 Profile  
Reply with quote  
 Post subject: Re: About instr() performances
PostPosted: Wed Feb 20, 2013 5:10 pm 
Offline

Joined: Fri Sep 30, 2005 10:01 am
Posts: 283
Location: Germany, Munich
taylor-design wrote:
My guess is that InStr has to scan the entire string each time for encodings with a variable character width like UTF8

I know of only one complication with Unicode, and maybe that's causing the slowdown here: Some characters can be represented in more than one code. For instance, the "ö" has both a single char code and a "combination" code out of a ":"-like special char and a "o".
When a search is performed, both possibilities will have to be checked. Maybe the search code isn't well optimized for that.
However, non-unicode single-byte encodings should all be as fast as InstrB, IMO.

If one is sure that one won't have such special unicode case, e.g. because one knows that the text files always use one particular way of encoding those special chars, and one uses the same way for the search string, then the best optimization would be to convert both to UTF-8 and then use InStrB. To convert the found byte position into a char position, the formula
charPos = str.LeftB(bytePos).Len
can be used. That, however, might also be quite slow to calculate.

_________________
User of RB since first version. Provider of many free and outdated plugins.
Code for sharing: http://www.tempel.org/RB/Resources
Arbed, a unique tool for editing projects: http://www.tempel.org/Arbed
Zip compression classes: http://www.tempel.org/RB/ZipPackage


Top
 Profile  
Reply with quote  
 Post subject: Re: About instr() performances
PostPosted: Wed Feb 20, 2013 5:16 pm 
Offline

Joined: Fri Sep 30, 2005 10:01 am
Posts: 283
Location: Germany, Munich
It just dawned on me!

Sergio Tallone wrote:
found = instr(point,temp, "TRGL")


The issue is possibly not the search but the value of point! If this value is very high, then the search has to find that position first. For UTF-8 and UTF-16 the char position is not a simple byte offset, but each byte (UTF-8) or word (UTF-16) needs to be analysed to count the characters in order to reach the point. So, that may be another contributing factor to a slow Search if point gets incremented often during the search.

However, for other encodings such as WinLatin, ASCII or UTF-32, the position can be directly scaled into a byte offset, and therefore this shouldn't be a speed decreasing factor there.

_________________
User of RB since first version. Provider of many free and outdated plugins.
Code for sharing: http://www.tempel.org/RB/Resources
Arbed, a unique tool for editing projects: http://www.tempel.org/Arbed
Zip compression classes: http://www.tempel.org/RB/ZipPackage


Top
 Profile  
Reply with quote  
 Post subject: Re: About instr() performances
PostPosted: Wed Feb 20, 2013 5:53 pm 
Offline
User avatar

Joined: Mon Feb 05, 2007 5:21 pm
Posts: 600
Location: New York, NY
I wonder what we're comparing it to. Is the FreeBASIC version encoding-aware? Is it's search facility case-sensitive? If the answer to both of these questions is no, then the comparable function in Real Studio is InStrB, not InStr.

Edit: Should have been, "case-insensitive", but you knew what I meant.

_________________
Kem Tekinay
MacTechnologies Consulting
http://www.mactechnologies.com/

Need to develop, test, and refine regular expressions? Try RegExRX.


Last edited by ktekinay on Wed Feb 20, 2013 8:36 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
 Post subject: Re: About instr() performances
PostPosted: Wed Feb 20, 2013 8:32 pm 
Offline

Joined: Wed Mar 22, 2006 11:15 am
Posts: 712
Location: Southern California
tempel wrote:
The issue is possibly not the search but the value of point! If this value is very high, then the search has to find that position first. For UTF-8 and UTF-16 the char position is not a simple byte offset, but each byte (UTF-8) or word (UTF-16) needs to be analysed to count the characters in order to reach the point.


Sorry that I wasn't clear. That's what I was getting at when I said that for variable character width encodings, InStr has to scan the entire string every time. First it has to find the starting position, then search the remaining text. In this test there are 11,715 instances of the find string in a 1.2 MB file. The CPU has to scan nearly 14 GB of data.

Quote:
However, for other encodings such as WinLatin, ASCII or UTF-32, the position can be directly scaled into a byte offset, and therefore this shouldn't be a speed decreasing factor there.


That's why I'm wondering if everything is treated as UTF8 internally, OR if the code is just not optimized for fixed width encodings, i.e. it still scans each character to determine the starting position. ASCII yields the same speed.

RegEx uses a byte offset, so it doesn't have this problem.

_________________
Daniel L. Taylor
Custom Controls for Real Studio WE!
Visit: http://www.webcustomcontrols.com/


Top
 Profile  
Reply with quote  
 Post subject: Re: About instr() performances
PostPosted: Wed Feb 20, 2013 8:33 pm 
Offline

Joined: Wed Mar 22, 2006 11:15 am
Posts: 712
Location: Southern California
ktekinay wrote:
I wonder what we're comparing it to. Is the FreeBASIC version encoding-aware? Is it's search facility case-sensitive? If the answer to both of these questions is no, then the comparable function in Real Studio is InStrB, not InStr.


Agreed. And I can almost guarantee you that FreeBASIC's function either:

* Is ASCII only (or defaulting to an ASCII mode).

* Uses byte offsets instead of character offsets so that the starting position doesn't have to be computed on each loop.

_________________
Daniel L. Taylor
Custom Controls for Real Studio WE!
Visit: http://www.webcustomcontrols.com/


Top
 Profile  
Reply with quote  
 Post subject: Re: About instr() performances
PostPosted: Thu Feb 21, 2013 2:48 am 
Offline

Joined: Fri Jan 06, 2006 3:21 pm
Posts: 12388
Location: Portland, OR USA
But you're using a UTF-8 search string (string literals are UTF-8). So my guess is that even if you use an ASCII encoding, it has to convert it back to UTF-8 to do the search. Try converting "TRGL" to ASCII and use Instr().

dim searchstring as string = ConvertEncoding("TRGL", Encodings.ASCII)
...
found = instr(point, temp, searchstring)


Top
 Profile  
Reply with quote  
 Post subject: Re: About instr() performances
PostPosted: Thu Feb 21, 2013 2:57 am 
Offline

Joined: Wed Jan 19, 2011 2:52 am
Posts: 11
Thanks to all for the clarification of instr () in Real Studio. For people like me who have used other compilers with a set of functions less rich than those offered by Real Studio is not easy understand immediately the differences and how to use them.
If possible, I'd like to know how to use regex for this case (I do not have much confidence with regular expressions :? ).

thanks!

Sergio


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 29 posts ]  Go to page 1, 2  Next

All times are UTC - 5 hours


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  
cron
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group