summaryrefslogtreecommitdiff
path: root/src/case_insensitive_matching_in_c.ascii
blob: 5009eca458b4286da62d308537d55f7ef2a2e481 (plain)
    1 Case Insensitive Matching in C++
    2 ================================
    3 :author: Aaron Ball
    4 :email: nullspoon@iohq.net
    5 
    6 
    7 == {doctitle}
    8 
    9 I had this epiphany yesterday while working on my new command line
   10 https://oper.io/src/nullspoon/noteless.git[note-taking project] and I wanted to
   11 write a blog post about it since I haven't seen anyone on the internet yet take
   12 this approach (though there aren't exactly a lot blogs posts on programming
   13 theory of this kind in general).
   14 
   15 My program is written in C. It provides a search functionality very similar to
   16 the case insensitive matching of _grep -i_ (you 'nix users should know what I'm
   17 talking about). If you've done much in C, you likely know that string parsing
   18 is not so easy (or is it just different). Thus the question...__how to perform
   19 case insensitive text searching in c__.
   20 
   21 A few notes though before we proceed. I'm fairly new to c (about 1 year as a
   22 hobby) so everything I say here might not be entirely right (it'll work, it
   23 just might not be the _best_ way). If you catch something that's wrong or could
   24 use improvement, please send me link:/?p=About[an email]. Secondly, since this
   25 is probably something the C gods have already mastered, I will be writing
   26 this post aimed at the newer folk (since I myself am one), so bear with me if
   27 you already know how to do this. One final note. I am still ceaselessly amazed
   28 at how computers work, so I get fairly giddy when it comes to actual memory
   29 management and whatnot. Brace yourselves...
   30 
   31 [[chars-ints-kind-of]]
   32 Chars == Ints (kind of)
   33 -----------------------
   34 
   35 To continue, we need to understand a few things about base data types in
   36 memory.
   37 
   38 * **Ints**: An int is just 8 bits of memory (well, it's 16 including
   39 signing, but we don't need to cover that here).
   40 
   41 * **Chars**: Chars are just ints, but marked as chars. Effectively, a
   42 number has been assigned to each letter and symbol (including uppercase and
   43 lowercase), which is where integers meet chars. The integer determines which
   44 char is selected.
   45 
   46 To demonstrate those two data types, let's take a look at some sample
   47 code.
   48 
   49 ----
   50 using namespace std;
   51 #include <iostream>
   52  
   53 int main( int argc, char** argv ) {
   54   int i = 72;
   55   char c = i;
   56   cout << "The integer " << i;
   57   cout << " is the same as char " << c << "!" <<  endl;
   58   return 0;
   59 }
   60 ----
   61 
   62 What we do here is create <code>int i</code> with the value of 72. We
   63 then create <code>char c</code> and assign it the value of _i_ (still
   64 72). Finally, we print both int i and char c and get...
   65 
   66 ----
   67 The integer 72 is the same as char H!
   68 ----
   69 
   70 If you're wondering, we could have also just assigned char c the value
   71 of 72 explicitly and it would have still printed the letter H.
   72 
   73 Now that that's out of the way...
   74 
   75 
   76 [[a-short-char---integer-list]]
   77 A Short Char - Integer List
   78 ---------------------------
   79 
   80 * **! " # $ % & ' ( ) * + , - . /**: 35 - 47
   81 
   82 * **0-9**: 48 - 57
   83 
   84 * **: ; < = > ? @**: 58 - 64
   85 
   86 * *A - Z* (uppercase): 65 - 90
   87 
   88 * **[ \ ] ^ _ `**: 91 - 96
   89 
   90 * *a - z* (lowercase): 97 - 122
   91 
   92 
   93 [[lowercase-uppercase-32]]
   94 Lowercase == Uppercase + 32
   95 ---------------------------
   96 
   97 You may have noticed an interesting fact about the numbers assigned to
   98 characters in [English] computing: uppercase and lowercase letters don't have
   99 the same integers.
  100 
  101 These character integer range seperations are key to performing a
  102 case-insensitive string search in c\+\+. What they mean is, if you happen upon
  103 the letter **a**, which is integer 97, then you know that its capital
  104 equivalent is going to be 32 lower (int 65). Suddenly parsing text just got a
  105 lot easier.
  106 
  107 
  108 [[piecing-it-all-together]]
  109 Piecing it all together
  110 -----------------------
  111 
  112 Since characters are simply just integers, we can perform text matching via
  113 number ranges and math operators. For instance...
  114 
  115 Suppose you want to build a password validator that allows numbers, upper case,
  116 lower case, and __: ; < = > ? @ [ \ ] ^ _ `__. That is the integer range 48 -
  117 57 (the char equivelants of integers), 58 - 64 (the first symbols), 65 - 90
  118 (the uppercase), 91 - 96 (the second set of symbols), and 97-122 (the
  119 lowercase). Combining those ranges, the allowable characters make up the
  120 integer range of 48 - 122. Thus, our program might look something like...
  121 
  122 ----
  123 using namespace std;
  124 #include <iostream>
  125  
  126 int validate_pass( const char* pass ) {
  127   long i = 0;
  128   while( pass[i] ) {
  129     if( pass[i] < 48 || pass[i] > 122 ) {
  130       return 0;
  131     }
  132     i++;
  133   }
  134   return 1;
  135 }
  136  
  137 int main( int argc, char** argv ) {
  138   // The first password that meets the requirements
  139   const char* pass = "good_password123";
  140   cout << pass;
  141   if( validate_pass( pass ) ) {
  142     cout << " is valid." << endl;
  143   } else {
  144     cout << " is not valid." << endl;
  145   }
  146  
  147   // The second password fails because ! is int 35, which is out of range
  148   const char* pass2 = "bad_password!";
  149   cout << pass2;
  150   if( validate_pass( pass2 ) ) {
  151     cout << " is valid." << endl;
  152   } else {
  153     cout << " is not valid." << endl;
  154   }
  155   return 0;
  156 }
  157 ----
  158 
  159 Will output...
  160 
  161 ----
  162 good_password123 is valid.
  163 bad_password! is not valid.
  164 ----
  165 
  166 The first password succeeds because all of its characters are within the range
  167 of 48 - 122. The second password fails because its final character, the "!", is
  168 int 35, which is outside of the allowable character range of 48 - 122. That
  169 brings a whole new meaning to the out_of_range exception, doesn't it?
  170 
  171 That's just one simple example of how this could work. One personal note,
  172 please don't put that restraint of > 48 on your users if you write a validator
  173 script. Not having access to the more common symbols is a nightmare for users.
  174 
  175 If you would like to see another example, the one I wrote for case insensitive
  176 matchings in my note program can be found at
  177 https://oper.io/src/nullspoon/noteless.git/tree/src/common.c#n197 in the
  178 *str_contains_case_insensitive* method.
  179 
  180 Hopefully this is useful for someone besides myself. Either way though, I'm
  181 still super excited about the ease of making real-life data programatically
  182 usable through conversion to integers. It makes me want to see what other
  183 real-life data I can convert to numbers for easier parsing. Images? Chemistry
  184 notation?
  185 
  186 I do say my good man, http://www.bartleby.com/70/1322.html[Why, then the
  187 world’s mine oyster, Which I with numbers will open.] (okay, I may have
  188 modified the quote a tad)
  189 
  190 
  191 Category:Programming
  192 Category:C
  193 
  194 
  195 // vim: set syntax=asciidoc:

Generated by cgit