summaryrefslogtreecommitdiff
path: root/posts/case_insensitive_matching_in_c.md
blob: 4ece0e0bf777c3b72a8460a369b68cc058320249 (plain)
    1 Case Insensitive Matching in C++
    2 ================================
    3 
    4 I had this epiphany yesterday while working on my new command line [note-taking
    5 project](https://oper.io/src/nullspoon/noteless.git) and I wanted to write a
    6 blog post about it since I haven't seen anyone on the internet yet take this
    7 approach (though there aren't exactly a lot blogs posts on programming theory
    8 of this kind in general).
    9 
   10 My program is written in C. It provides a search functionality very similar to
   11 the case insensitive matching of _grep -i_ (you 'nix users should know what I'm
   12 talking about). If you've done much in C, you likely know that string parsing
   13 is not so easy (or is it just different). Thus the question... _how to perform
   14 case insensitive text searching in c_.
   15 
   16 A few notes though before we proceed. I'm fairly new to c (about 1 year as a
   17 hobby) so everything I say here might not be entirely right (it'll work, it
   18 just might not be the _best_ way). If you catch something that's wrong or could
   19 use improvement, please send me [an email](/?p=About). Secondly, since this is
   20 probably something the C gods have already mastered, I will be writing this
   21 post aimed at the newer folk (since I myself am one), so bear with me if you
   22 already know how to do this. One final note. I am still ceaselessly amazed at
   23 how computers work, so I get fairly giddy when it comes to actual memory
   24 management and whatnot. Brace yourselves...
   25 
   26 Chars == Ints (kind of)
   27 -----------------------
   28 
   29 To continue, we need to understand a few things about base data types in
   30 memory.
   31 
   32 * **Ints**: An int is just 8 bits of memory (well, it's 16 including signing,
   33   but we don't need to cover that here).
   34 
   35 * **Chars**: Chars are just ints, but marked as chars. Effectively, a number
   36   has been assigned to each letter and symbol (including uppercase and
   37   lowercase), which is where integers meet chars. The integer determines which
   38   char is selected.
   39 
   40 To demonstrate those two data types, let's take a look at some sample
   41 code.
   42 
   43 ```
   44 using namespace std;
   45 #include <iostream>
   46  
   47 int main( int argc, char** argv ) {
   48   int i = 72;
   49   char c = i;
   50   cout << "The integer " << i;
   51   cout << " is the same as char " << c << "!" <<  endl;
   52   return 0;
   53 }
   54 ```
   55 
   56 What we do here is create `int i` with the value of `72`. We then create `char
   57 c` and assign it the value of `i` (still 72). Finally, we print both `int i`
   58 and `char c` and get...
   59 
   60 ```
   61 The integer 72 is the same as char H!
   62 ```
   63 
   64 If you're wondering, we could have also just assigned char c the value of 72
   65 explicitly and it would have still printed the letter H.
   66 
   67 Now that that's out of the way...
   68 
   69 
   70 A Short Char - Integer List
   71 ---------------------------
   72 
   73 * `! " # $ % & ' ( ) * + , - . /`: 35 - 47
   74 
   75 * `0-9`: 48 - 57
   76 
   77 * `: ; < = > ? @`: 58 - 64
   78 
   79 * `A - Z` _(uppercase)_: 65 - 90
   80 
   81 * `` [ \ ] ^ _ ` ``: 91 - 96
   82 
   83 * `a - z` _(lowercase)_: 97 - 122
   84 
   85 
   86 Lowercase == Uppercase + 32
   87 ---------------------------
   88 
   89 You may have noticed an interesting fact about the numbers assigned to
   90 characters in [English] computing: uppercase and lowercase letters don't have
   91 the same integers.
   92 
   93 These character integer range seperations are key to performing a
   94 case-insensitive string search in c\+\+. What they mean is, if you happen upon
   95 the letter **a**, which is integer 97, then you know that its capital
   96 equivalent is going to be 32 lower (int 65). Suddenly parsing text just got a
   97 lot easier.
   98 
   99 
  100 Piecing it all together
  101 -----------------------
  102 
  103 Since characters are simply just integers, we can perform text matching via
  104 number ranges and math operators. For instance...
  105 
  106 Suppose you want to build a password validator that allows numbers, upper case,
  107 lower case, and `` : ; < = > ? @ [ \ ] ^ _ ` ``. That is the integer range 48 -
  108 57 (the char equivelants of integers), 58 - 64 (the first symbols), 65 - 90
  109 (the uppercase), 91 - 96 (the second set of symbols), and 97-122 (the
  110 lowercase). Combining those ranges, the allowable characters make up the
  111 integer range of 48 - 122. Thus, our program might look something like...
  112 
  113 ```
  114 using namespace std;
  115 #include <iostream>
  116  
  117 int validate_pass( const char* pass ) {
  118   long i = 0;
  119   while( pass[i] ) {
  120     if( pass[i] < 48 || pass[i] > 122 ) {
  121       return 0;
  122     }
  123     i++;
  124   }
  125   return 1;
  126 }
  127  
  128 int main( int argc, char** argv ) {
  129   // The first password that meets the requirements
  130   const char* pass = "good_password123";
  131   cout << pass;
  132   if( validate_pass( pass ) ) {
  133     cout << " is valid." << endl;
  134   } else {
  135     cout << " is not valid." << endl;
  136   }
  137  
  138   // The second password fails because ! is int 35, which is out of range
  139   const char* pass2 = "bad_password!";
  140   cout << pass2;
  141   if( validate_pass( pass2 ) ) {
  142     cout << " is valid." << endl;
  143   } else {
  144     cout << " is not valid." << endl;
  145   }
  146   return 0;
  147 }
  148 ```
  149 
  150 Will output...
  151 
  152 ```
  153 good_password123 is valid.
  154 bad_password! is not valid.
  155 ```
  156 
  157 The first password succeeds because all of its characters are within the range
  158 of 48 - 122. The second password fails because its final character, the "!", is
  159 int 35, which is outside of the allowable character range of 48 - 122. That
  160 brings a whole new meaning to the out_of_range exception, doesn't it?
  161 
  162 That's just one simple example of how this could work. One personal note,
  163 please don't put that restraint of > 48 on your users if you write a validator
  164 script. Not having access to the more common symbols is a nightmare for users.
  165 
  166 If you would like to see another example, the one I wrote for case insensitive
  167 matchings in my note program can be found at
  168 https://oper.io/src/nullspoon/noteless.git/tree/src/common.c#n197 in the
  169 *str_contains_case_insensitive* method.
  170 
  171 Hopefully this is useful for someone besides myself. Either way though, I'm
  172 still super excited about the ease of making real-life data programatically
  173 usable through conversion to integers. It makes me want to see what other
  174 real-life data I can convert to numbers for easier parsing. Images? Chemistry
  175 notation?
  176 
  177 I do say my good man, [Why, then the world’s mine oyster, Which I with numbers
  178 will open.](http://www.bartleby.com/70/1322.html) (okay, I may have modified
  179 the quote a tad)

Generated by cgit