1 Case Insensitive Matching in C++
2 ================================
3 :author: Aaron Ball
4 :email: nullspoon@iohq.net
5
6
7
8 I had this epiphany yesterday while working on my new command line
9 https://oper.io/src/nullspoon/noteless.git[note-taking project] and I wanted to
10 write a blog post about it since I haven't seen anyone on the internet yet take
11 this approach (though there aren't exactly a lot blogs posts on programming
12 theory of this kind in general).
13
14 My program is written in C. It provides a search functionality very similar to
15 the case insensitive matching of _grep -i_ (you 'nix users should know what I'm
16 talking about). If you've done much in C, you likely know that string parsing
17 is not so easy (or is it just different). Thus the question...__how to perform
18 case insensitive text searching in c__.
19
20 A few notes though before we proceed. I'm fairly new to c (about 1 year as a
21 hobby) so everything I say here might not be entirely right (it'll work, it
22 just might not be the _best_ way). If you catch something that's wrong or could
23 use improvement, please send me link:/?p=About[an email]. Secondly, since this
24 is probably something the C gods have already mastered, I will be writing
25 this post aimed at the newer folk (since I myself am one), so bear with me if
26 you already know how to do this. One final note. I am still ceaselessly amazed
27 at how computers work, so I get fairly giddy when it comes to actual memory
28 management and whatnot. Brace yourselves...
29
30 [[chars-ints-kind-of]]
31 Chars == Ints (kind of)
32 -----------------------
33
34 To continue, we need to understand a few things about base data types in
35 memory.
36
37 * **Ints**: An int is just 8 bits of memory (well, it's 16 including
38 signing, but we don't need to cover that here).
39
40 * **Chars**: Chars are just ints, but marked as chars. Effectively, a
41 number has been assigned to each letter and symbol (including uppercase and
42 lowercase), which is where integers meet chars. The integer determines which
43 char is selected.
44
45 To demonstrate those two data types, let's take a look at some sample
46 code.
47
48 ----
49 using namespace std;
50 #include <iostream>
51
52 int main( int argc, char** argv ) {
53 int i = 72;
54 char c = i;
55 cout << "The integer " << i;
56 cout << " is the same as char " << c << "!" << endl;
57 return 0;
58 }
59 ----
60
61 What we do here is create <code>int i</code> with the value of 72. We
62 then create <code>char c</code> and assign it the value of _i_ (still
63 72). Finally, we print both int i and char c and get...
64
65 ----
66 The integer 72 is the same as char H!
67 ----
68
69 If you're wondering, we could have also just assigned char c the value
70 of 72 explicitly and it would have still printed the letter H.
71
72 Now that that's out of the way...
73
74
75 [[a-short-char---integer-list]]
76 A Short Char - Integer List
77 ---------------------------
78
79 * **! " # $ % & ' ( ) * + , - . /**: 35 - 47
80
81 * **0-9**: 48 - 57
82
83 * **: ; < = > ? @**: 58 - 64
84
85 * *A - Z* (uppercase): 65 - 90
86
87 * **[ \ ] ^ _ `**: 91 - 96
88
89 * *a - z* (lowercase): 97 - 122
90
91
92 [[lowercase-uppercase-32]]
93 Lowercase == Uppercase + 32
94 ---------------------------
95
96 You may have noticed an interesting fact about the numbers assigned to
97 characters in [English] computing: uppercase and lowercase letters don't have
98 the same integers.
99
100 These character integer range seperations are key to performing a
101 case-insensitive string search in c\+\+. What they mean is, if you happen upon
102 the letter **a**, which is integer 97, then you know that its capital
103 equivalent is going to be 32 lower (int 65). Suddenly parsing text just got a
104 lot easier.
105
106
107 [[piecing-it-all-together]]
108 Piecing it all together
109 -----------------------
110
111 Since characters are simply just integers, we can perform text matching via
112 number ranges and math operators. For instance...
113
114 Suppose you want to build a password validator that allows numbers, upper case,
115 lower case, and __: ; < = > ? @ [ \ ] ^ _ `__. That is the integer range 48 -
116 57 (the char equivelants of integers), 58 - 64 (the first symbols), 65 - 90
117 (the uppercase), 91 - 96 (the second set of symbols), and 97-122 (the
118 lowercase). Combining those ranges, the allowable characters make up the
119 integer range of 48 - 122. Thus, our program might look something like...
120
121 ----
122 using namespace std;
123 #include <iostream>
124
125 int validate_pass( const char* pass ) {
126 long i = 0;
127 while( pass[i] ) {
128 if( pass[i] < 48 || pass[i] > 122 ) {
129 return 0;
130 }
131 i++;
132 }
133 return 1;
134 }
135
136 int main( int argc, char** argv ) {
137 // The first password that meets the requirements
138 const char* pass = "good_password123";
139 cout << pass;
140 if( validate_pass( pass ) ) {
141 cout << " is valid." << endl;
142 } else {
143 cout << " is not valid." << endl;
144 }
145
146 // The second password fails because ! is int 35, which is out of range
147 const char* pass2 = "bad_password!";
148 cout << pass2;
149 if( validate_pass( pass2 ) ) {
150 cout << " is valid." << endl;
151 } else {
152 cout << " is not valid." << endl;
153 }
154 return 0;
155 }
156 ----
157
158 Will output...
159
160 ----
161 good_password123 is valid.
162 bad_password! is not valid.
163 ----
164
165 The first password succeeds because all of its characters are within the range
166 of 48 - 122. The second password fails because its final character, the "!", is
167 int 35, which is outside of the allowable character range of 48 - 122. That
168 brings a whole new meaning to the out_of_range exception, doesn't it?
169
170 That's just one simple example of how this could work. One personal note,
171 please don't put that restraint of > 48 on your users if you write a validator
172 script. Not having access to the more common symbols is a nightmare for users.
173
174 If you would like to see another example, the one I wrote for case insensitive
175 matchings in my note program can be found at
176 https://oper.io/src/nullspoon/noteless.git/tree/src/common.c#n197 in the
177 *str_contains_case_insensitive* method.
178
179 Hopefully this is useful for someone besides myself. Either way though, I'm
180 still super excited about the ease of making real-life data programatically
181 usable through conversion to integers. It makes me want to see what other
182 real-life data I can convert to numbers for easier parsing. Images? Chemistry
183 notation?
184
185 I do say my good man, http://www.bartleby.com/70/1322.html[Why, then the
186 world’s mine oyster, Which I with numbers will open.] (okay, I may have
187 modified the quote a tad)
188
189
190 Category:Programming
191 Category:C
192
193
194 // vim: set syntax=asciidoc:
|