IO: A Small Streaming I/O and UTF-32 Library

Updated on 2020-02-02

Add UTF-32 support and easy foreach streaming to your apps

Introduction

My scanners/tokenizers all take IEnumerable as their streaming character source. The reason is that this allows for a simple, ubiquitous streaming interface that is no frills but adaptable. It can take a string or a char[] out of the box, or you can provide your own source. This library contains sources for files, URLs, console input and a generic TextReader. All of my major Unicode support is UTF-32 internally to allow for surrogate pairs to be treated as single characters, thus representing the Unicode graphemes properly instead of as individual UTF-16 Unicode code units. This is critical for proper Unicode support all the way up through the 21 bit range that Unicode provides. This library provides a simple enumerator for converting an IEnumerator to an IEnumerator of UTF-32 code units.

Using the Code

The code is straightforward to use allowing for a couple of gotchas which I'll cover.

// Lifetime is automatically managed
var fr = new FileReaderEnumerable(@"..\..\Program.cs");

// file open on enumeration start
foreach (var ch in fr)
    Console.Write(ch);
// file close when done
Console.WriteLine();
Console.WriteLine();

var ur = new UrlReaderEnumerable(@"http://www.google.com");
var i = 0;
// url fetch on enumeration start
foreach (var ch in ur)
{
    if(79==i)
    {
        Console.Write("...");
        break;
    }
    Console.Write(ch);
    ++i;
}
// url close on done
Console.WriteLine();
Console.WriteLine();

// put in a string with a 21-bit unicode value
var test = "This is a test \U0010FFEE";
var uni = new Utf32Enumerable(test);
foreach(var uch in uni)
{
    // console will mangle, but
    // do it anyway
    var str = char.ConvertFromUtf32(uch);
    Console.Write(str);
}
Console.WriteLine();
Console.WriteLine();

var reader = new StringReader("This is a demo of TextReaderEnumerable");
foreach (char ch in TextReaderEnumerable.FromReader(reader))
    Console.Write(ch);
Console.WriteLine();
Console.WriteLine();

The gotchas here are this: TextReaderEnumerable must be created via FromReader(), unlike the others which are created via a constructor. The other gotcha with it is that it cannot be reset and it can only be enumerated once (no seeking). Attempting to enumerate it a second time will throw. Furthermore, it's important to call Dispose() on the IEnumerator<> instances if you are using them manually. foreach does this for you automatically. Finally, when using ConsoleReaderEnumerable, it doesn't know when to stop short of you typing Ctrl-Z into the console. It's usually used for file piping.

The nice thing about these interfaces is it's super easy to load them into List and List classes if you need to stash them, and it's easy to write adapters for them - Utf32Enumerable is one such adapter but you can do pretty much anything you can do with an enumerator. You can also use LINQ on file streams this way.

Have fun!

History

  • 2nd February, 2020 - Initial submission