Advanced Streaming on Little Gadgets: MIDI Files

Updated on 2022-05-09

Explore some streaming techniques while getting your shiny new ESP32S3 to do some USB and MIDI tricks

Introduction

It's easy enough to stream when your input is effectively built for it - all the data is sequential and you can simply pull information out in the order it appears, and in an ideal world, everything that should be streamed would be in a format that lends itself to iterative processing in order. This isn't an ideal world. One more time for those of you in the back.

Enter multitrack MIDI files. At one point, somebody - or rather, some committee - decided it would be a great idea to break MIDI sequences into multiple tracks, rather than simply multiple channels on a single track. It's nice enough from an organizational perspective, though it's not critical, and yet it absolutely kills your ability to sequentially stream data out of it in order.

This is because each track only contains its part of the sequence or composition. Essentially, in order to actually play any of it, you have to interleave the track data back into a single stream.

Complicating things ever so slightly, each MIDI event has a delta attached to it which indicates its offset in MIDI ticks (we'll get into those) from the previous event. Those deltas are relative to the track. When you produce a single stream, the deltas from all the tracks must be recomputed such that they are relative to each other as the events are interleaved.

If we were to load a MIDI file into memory and then compute everything ahead of time, that could ease our pain somewhat, but the copy of Queen's Bohemian Rhapsody performance I have is over 50kB. It's better to not have to allocate that, which is why we'll stick to streaming it a bit at a time.

We'll be running this on an ESP32S3. If you don't have one, you can do this with an S2, or even an ESP32 with an external programmable USB breakout, but you'll have to edit your platformio.ini file.

We will be using PlatformIO, and if you're not already, you should be. The Arduino IDE 1.x is simply not powerful enough to compile my code, which requires C++14 or better.

For testing purposes, it's probably best to hook the secondary USB port up to your PC (meaning you'll have two cables running between your device and your PC), and then download and run MIDI-OX assuming you're on Windows. You might need a different MIDI connector/monitor software if you are running some other operating system. Anyway, you can use that software to connect your incoming MIDI from USB to your outgoing MIDI on your soundcard so that you'll hear the playback.

MIDI-OX

Also, remember to upload the filesystem image the first time or nothing will work.

I've shipped ESPTinyUsb with this project under the "lib" folder. It is not my work. We're using it to program the USB port as a MIDI device.

ESPTinyUsb

Conceptualizing this Mess

Here we go. I'm going to explain the general thrust of this project, and then I will give a detailed rundown of the particulars of both the MIDI wire protocol as well as the MIDI file format.

First, here's what we're accomplishing:

The goal of this project is to read a MIDI file and then spit it out in a loop at any listening devices.

Protocol Format

General Information

The MIDI protocol is an 8-bit digital wire protocol developed in the 1980s for controlling musical instruments. It is a big-endian protocol, and used to connect physically using 5-Pin DIN connectors, and while those are still in use, it's now possible and quite common to connect via USB. Some devices can connect over Bluetooth.

Message Format

The following guide is presented as a tutorial on the MIDI protocol format, but it's not necessary to be completely familiar with it in order to follow along.

MIDI works using "messages" which tell an instrument what to do. MIDI messages are divided into two types: channel messages and system messages. Channel messages make up the bulk of the data stream and carry performance information, while system messages control global/ambient settings.

A channel message is called a channel message because it is targeted to a particular channel. Each channel can control its own instrument and up to 16 channels are available, with channel #10 (zero based index 9) being a special channel that always carries percussion information, and the other channels being mapped to arbitrary devices. This means the MIDI protocol is capable of communicating with up to 16 individual devices at once.

A system message is called a system message because it controls global/ambient settings that apply to all channels. One example is sending proprietary information to a particular piece of hardware, which is done through a "system exclusive" or "sysex" message. Another example is the special information included in MIDI files (but not present in the wire protocol) such as the tempo to play the file back at. Another example of a system message is a "system realtime message" which allows access to the transport features (play, stop, continue and setting the timing for transport devices).

Each MIDI message has a "status byte" associated with it. This is usually** the first byte in a MIDI message. The status byte contains the message id in the high nibble (4-bits) and the target channel in the low nibble. Ergo, the status byte 0xC5 indicates a channel message type of 0xC and a target channel of 0x5. The high nibble must be 0x8 or greater for reasons. If the high nibble is 0xF, this is a system message, and the entire status byte is the message id since there is no channel. For example, 0xFF is a message id for a MIDI "meta event" message that can be found in MIDI files. Once again, the low nibble is part of the status if the high nibble is 0xF.

** due to an optimization of the protocol, it is possible that the status byte is omitted in which case the status byte from the previous message is used. This allows for "runs" of messages with the same status but different parameters to be sent without repeating the redundant byte for each message.

The following channel messages are available:

  • 0x8 Note Off - Releases the specified note. The velocity is included in this message but not used. All notes with the specified note id are released, so if there are two Note Ons followed by one Note Off for C#4, all of the C#4 notes on that channel are released. This message is 3 bytes in length, including the status byte. The 2nd byte is the note id (0-0x7F/127), and the 3rd is the velocity (0-0x7F/127). The velocity is virtually never respected for a note off message. I'm not sure why it exists. Nothing I've ever encountered uses it. It's usually set to zero, or perhaps the same note velocity for the corresponding note on. It really doesn't matter.
  • 0x9 Note On - Strikes and holds the specified note until a corresponding note off message is found. This message is 3 bytes in length, including the status byte. The parameters are the same as note off. It should be noted that a note on with a velocity of zero is effectively a note off.
  • 0xA Key Pressure/Aftertouch - Indicates the pressure that the key is being held down at. This is usually for higher end keyboards that support it, to give an after effect when a note is held depending on the pressure it is held at. This message is 3 bytes in length, including the status byte. The 2nd byte is the note id (0-0x7F/127) while the 3rd is the pressure (0-0x7F/127)
  • 0xB Control Change - Indicates that a controller value is to be changed to the specified value. Controllers are different for different instruments, but there are standard control codes for common controls like panning. This message is 3 bytes in length, including the status byte. The 2nd byte is the control id. There are common ids like panning (0x0A/10) and volume (7) and many that are just custom, often hardware specific or customizably mapped in your hardware to different parameters. There's a table of standard and available custom codes here. The 3rd byte is the value (0-0x7F/127) whose meaning depends heavily on what the 2nd byte is.
  • 0xC Patch/Program Change - Some devices have multiple different "programs" or settings that produce different sounds. For example, your synthesizer may have a program to emulate an electric piano and one to emulate a string ensemble. This message allows you to set which sound is to be played by the device. This message is 2 bytes long, including the status byte. The 2nd byte is the patch/program id (0-0x7F/127)
  • 0xD Channel Pressure/Non-Polyphonic Aftertouch - This is similar to the aftertouch message, but is geared for less sophisticated instruments that don't support polyphonic aftertouch. It affects the entire channel instead of an individual key, so it affects all playing notes. It is specified as the single greatest aftertouch value for all depressed keys. This message is 2 bytes long, including the status byte. The 2nd byte is the pressure (0x7F/127)
  • 0xE Pitch Wheel Change - This indicates that the pitch wheel has moved to a new position. This generally applies an overall pitch modifier to all notes in the channel such that as the wheel is moved upward, the pitch for all playing notes is increased accordingly, and the opposite goes for moving the wheel downward. This message is 3 bytes long, including the status byte. The 2nd and 3rd byte contain the least significant 7 bits (0-0x7F/127) and the most significant 7 bits respectively, yielding a 14-bit value.

The following system messages are available (non-exhaustive):

  • 0xF0 System Exclusive - This indicates a device specific data stream is to be sent to the MIDI output port. The length of the message varies and is bookended by the End of System Exclusive message. I'm not clear on how this is transmitted just yet, but it's different in the file format than it is over the wire, which makes it one-off. In the file, the length immediately follows the status byte and is encoded as a "variable length quantity" which is covered in a bit. Finally, the data of the specified byte length follows that.
  • 0xF7 End of System Exclusive - This indicates an end marker for a system exclusive message stream
  • 0xFF Meta Message - This is defined in MIDI files, but not in the wire-protocol. It indicates special data specific to files such as the tempo the file should be played at, plus additional information about the scores, like the name of the sequence, the names of the individual tracks, copyright notices, and even lyrics. These may be an arbitrary length. What follows the status byte is a byte indicating the "type" of the meta message, and then a "variable length quantity" that indicates the length, once again, followed by the data.

Here's a sample of what messages look like over the wire.

Note on, middle C, maximum velocity on channel 0:

90 3C 7F

Patch change to 1 on channel 2:

C2 01

Remember, the status byte can be omitted. Here's some note on messages to channel 0 in a run:

90 3C 7F 3F 7F 42 7F

That yields a C major chord at middle C. Each of the two messages with the status byte omitted are using the previous status byte, 0x90.

The MIDI File Format

Once you understand the MIDI wire-protocol, the file format is fairly straightforward as about 80% or more of an average MIDI file is simply MIDI messages with a timestamp on them.

MIDI files typically have a ".mid" extension, and like the wire-protocol it is a big-endian format. A MIDI file is laid out in "chunks." A "chunk" meanwhile, is a FourCC code (simply a 4 byte code in ASCII) which indicates the chunk type followed by a 4-byte integer value that indicates the length of the chunk, and then followed by a stream of bytes of the indicated length. The FourCC for the first chunk in the file is always "MThd". The FourCC for the only other relevant chunk type is "MTrk". All other chunk types are proprietary and should be ignored unless they are understood. The chunks are laid out sequentially, back to back in the file.

The first chunk, "MThd" always has its length field set to 6 bytes. The data that follows it are 3 2-byte integers. The first indicates the MIDI file type which is almost always 1 but simple files can be type 0, and there's a specialized type - type 2 - which stores patterns. The second number is the count of "tracks" in a file. A MIDI file can contain more than one track, with each track containing its own score. The third number is the "timebase" of a MIDI file (often 480) which indicates the number of MIDI "ticks" per quarter note. How much time a tick represents depends on the current tempo.

The following chunks are "MTrk" chunks or proprietary chunks. We skip proprietary chunks, and read each "MTrk" chunk we find. An "MTrk" chunk represents a single MIDI file track (explained below) - which is essentially just MIDI messages with timestamps attached to them. A MIDI message with a timestamp on it is known as a MIDI "event." Timestamps are specified in deltas, with each timestamp being the number of ticks since the last timestamp. These are encoded in a funny way in the file. It's a byproduct of the 1980s and the limited disk space and memory at the time, especially on hardware sequencers - every byte saved was important. The deltas are encoded using a "variable length quantity".

Variable length quantities are encoded as follows: They are 7 bits per byte, most significant bits first (little endian!). Each byte is high (greater than 0x7F) except the last one which must be less than 0x80. If the value is between 0 and 127, it is represented by one byte while if it was greater it would take more. Variable length quantities can in theory be any size, but in practice they must be no greater than 0xFFFFFFF - about 3.5 bytes. You can hold them with an int, but reading and writing them can be annoying.

What follows a variable length quantity delta is a MIDI message, which is at least one byte, but it will be different lengths depending on the type of message it is and some message types (meta messages and sysex messages) are variable length. It may be written without the status byte in which case the previous status byte is used. You can tell if a byte in the stream is a status byte because it will be greater than 0x7F (127) while all of the message payload will be bytes less than 0x80 (128). It's not as hard to read as it sounds. Basically for each message, you check if the byte you're on is high (> 0x7F/127) and if it is, that's your new running status byte, and the status byte for the message. If it's low, you simply consult the current status byte instead of setting it.

MIDI File Tracks

A MIDI type 1 file will usually contain multiple "tracks" (briefly mentioned above). A track usually represents a single score and multiple tracks together make up the entire performance. While this is usually laid out this way, it's actually channels, not tracks that indicate what score a particular device is to play. That is, all notes for channel 0 will be treated as part of the same score even if they are scattered throughout different tracks. Tracks are just a helpful way to organize. They don't really change the behavior of the MIDI at all. In a MIDI type 1 file - the most common type - track 0 is "special". It doesn't generally contain performance messages (channel messages). Instead, it typically contains meta information like the tempo and lyrics, while the rest of your tracks contain performance information. Laying your files out this way ensures maximum compatibility with MIDI devices out there.

Very important: A track must always end with the MIDI End of Track meta message.

Despite tracks being conceptually separate, the separation of scores is actually by channel under the covers, not by track, meaning you can have multiple tracks which when combined, represent the score for a device at a particular channel (or more than one channel). You can combine channels and tracks however you wish, just remember that all the channel messages for the same channel represent an actual score for a single device, while the tracks themselves are basically virtual/abstracted convenience items.

See this page for more information on the MIDI wire-protocol and the MIDI file format.

this page

Coding this Mess

The first thing we have to do is load the file. We just need some basic information from the file, like the count, offsets and sizes of the tracks.

To this end, we have the simple midi_track and midi_file structs:

// represents a MIDI track entry in a MIDI file
struct midi_track final {
    // the size of the track in bytes
    size_t size;
    // the offset where the track begins
    size_t offset;
};
// represents the data in a MIDI file
class midi_file final {
    void copy(const midi_file& rhs);
public:
    // The type of MIDI file
    int16_t type;
    // The timebase
    int16_t timebase;
    // The number of tracks
    size_t tracks_size;
    // the track entries
    midi_track* tracks;
    // constructs a new instance
    midi_file();
    // steals an instance
    midi_file(midi_file&& rhs);
    // steals an instance
    midi_file& operator=(midi_file&& rhs);
    // copies an instance
    midi_file(const midi_file& rhs);
    // copies an instance
    midi_file& operator=(const midi_file& rhs);
    // destroys an instance
    ~midi_file();
    // reads a file from a stream
    static sfx_result read(stream* in, midi_file* out_file);
};

The only things that are really important about midi_file are the data members and the read() method. We're going to explore that function next.

For read(), we take an open stream and read all of the "chunks" mentioned before, extracting the important data. The only chunks we care about are the ones with the fourCCs of "MThd" and "MTrk". The others will be ignored. Since the format is big endian, we must swap bytes for reach of our word values:

// reads a chunk out of a multipart chunked file (MIDI file, basically)
bool midi_file_read_chunk_part(stream* in,size_t* in_out_offset, size_t* out_size) {
    uint32_t tmp;
    // read the size
    if(4!=in->read((uint8_t*)&tmp,4)) {
        return false;
    }
    *in_out_offset+=4;
    if(bits::endianness()==bits::endian_mode::little_endian) {
        tmp = bits::swap(tmp);
    }
    *out_size=(size_t)tmp;
    return true;
}
...
// read file info from a stream
sfx_result midi_file::read(stream* in, midi_file* out_file) {
    // MIDI files are a series of "chunks" that are a 4 byte ASCII string
    // "magic" identifier, and a 4 byte integer size, followed by
    // that many bytes of data. After that is the next chunk
    // or the end of the file.
    // the two relevant chunks are MThd (always size of 6)
    // that contains the MIDI file global info
    // and then MTrk for each MIDI track in the file
    // chunks with any other magic id are ignored.
    if(in==nullptr||out_file==nullptr) {
        return sfx_result::invalid_argument;
    }
    if(!in->caps().read) {
        return sfx_result::io_error;
    }
    int16_t tmp;
    union {
        uint32_t magic_id;
        char magic[5];
    } m;
    m.magic[4]=0;
    size_t pos = 0;
    size_t sz;
    if(4!=in->read((uint8_t*)m.magic,4)) {
        return sfx_result::invalid_format;
    }
    if(0!=strcmp(m.magic,"MThd")) {
        return sfx_result::invalid_format;
    }
    pos+=4;
    if(!midi_file_read_chunk_part(in,&pos,&sz) || 6!=sz) {
        return sfx_result::invalid_format;
    }

    if(2!=in->read((uint8_t*)&tmp,2)) {
        return sfx_result::end_of_stream;
    }
    if(bits::endianness()==bits::endian_mode::little_endian) {
        tmp = bits::swap(tmp);
    }
    pos+=2;
    out_file->type = tmp;

    if(2!=in->read((uint8_t*)&tmp,2)) {
        return sfx_result::end_of_stream;
    }
    if(bits::endianness()==bits::endian_mode::little_endian) {
        tmp = bits::swap(tmp);
    }
    pos+=2;
    out_file->tracks_size = tmp;
    if(2!=in->read((uint8_t*)&tmp,2)) {
        return sfx_result::end_of_stream;
    }
    if(bits::endianness()==bits::endian_mode::little_endian) {
        tmp = bits::swap(tmp);
    }
    pos+=2;
    out_file->timebase = tmp;
    out_file->tracks = (midi_track*)malloc(sizeof(midi_track)*out_file->tracks_size);
    if(out_file->tracks==nullptr) {
        return sfx_result::out_of_memory;
    }
    size_t i = 0;
    while(i<out_file->tracks_size) {
        if(4!=in->read((uint8_t*)m.magic,4)) {
            if(out_file->tracks_size==i) {
                return sfx_result::success;
            }
            return sfx_result::invalid_format;
        }
        pos+=4;
        if(!midi_file_read_chunk_part(in,&pos,&sz)) {
            return sfx_result::end_of_stream;
        }
        if(0==strcmp(m.magic,"MTrk")) {
            out_file->tracks[i].offset=pos;
            out_file->tracks[i].size=sz;
            ++i;
        }
        if(in->caps().seek) {
            in->seek(sz,io::seek_origin::current);
        } else {
            for(int j = 0;j<sz;++j) {
                if(-1==in->getch()) {
                    return sfx_result::end_of_stream;
                }
            }
        }
        pos+=sz;
    }
    if(i==out_file->tracks_size) {
        return sfx_result::success;
    }
    if(i<out_file->tracks_size) {
        return sfx_result::end_of_stream;
    }

    return sfx_result::invalid_format;
}

Once we have the file's information, we can use it to find our tracks and stream them. We also got the timebase of the file in "pulses per quarter note" which helps us figure out the necessary timing to play the file.

Now, since we're going to be pulling MIDI messages out of the file, let's explore what they look like in code:

// represents the type of MIDI message
enum struct midi_message_type : uint8_t {
    // a note off message
    note_off = 0b10000000,
    // a note on message
    note_on = 0b10010000,
    // polyphonic pressure (aftertouch) message
    polyphonic_pressure = 0b10100000,
    // control change (CC) message
    control_change = 0b10110000,
    // program change/patch select message
    program_change = 0b11000000,
    // channel pressure (aftertouch) message
    channel_pressure = 0b11010000,
    // pitch wheel message
    pitch_wheel_change = 0b11100000,
    // system exclusive (sysex) message
    system_exclusive = 0b11110000,
    // 0b11110001 undefined
    // song position message
    song_position = 0b11110010,
    // song select message
    song_select = 0b11110011,
    // b11110100 undefined
    // b11110101 undefined
    // tune request message
    tune_request = 0b11110110,
    // end of system exclusive message
    end_system_exclusive = 0b11110111,
    // timing clock message
    timing_clock = 0b11111000,
    // 0b11111001 undefined
    // start message
    start_playback = 0b11111010,
    // continue message
    continue_playback = 0b11111011,
    // stop message
    stop_playback = 0b11111100,
    // 0b11111101 undefined
    // active sensing message
    active_sensing = 0b11111110,
    // reset message
    reset = 0b11111111,
    // MIDI file meta event message
    meta_event = 0b11111111
};
// represents a MIDI message
class midi_message final {
    void copy(const midi_message& rhs);
    void deallocate();
public:
    // the status byte
    uint8_t status;
    union {
        // the 8-bit value holder for a message with a single byte payload
        uint8_t value8;
        // the 16-bit value holder for a message with a two byte payload
        uint16_t value16;
        // systex information (type()==system_exclusive)
        struct {
            // the data
            uint8_t* data;
            // the size of the data
            size_t size;
        } sysex;
        // meta event information (type()==meta_event - MIDI files only)
        struct {
            // the type of message
            uint8_t type;
            // the length encoded as a varlen
            uint8_t encoded_length[3];
            // the meta data
            uint8_t* data;
        } meta;
    };
    // constructs a new message
    inline midi_message() : status(0) {
        memset(this,0,sizeof(midi_message));
        meta.data = nullptr;
    }
    // destroys a message
    inline ~midi_message() {
        deallocate();
    }
    // copies a message
    inline midi_message(const midi_message& rhs) {
        copy(rhs);
    }
    // copies a message
    inline midi_message& operator=(const midi_message& rhs) {
        deallocate();
        copy(rhs);
        return *this;
    }
    // steals a message
    inline midi_message(midi_message&& rhs) {
        memcpy(this,&rhs,sizeof(midi_message));
        memset(&rhs,0,sizeof(midi_message));
    }
    // steals a message
    inline midi_message& operator=(midi_message&& rhs) {
        deallocate();
        memcpy(this,&rhs,sizeof(midi_message));
        memset(&rhs,0,sizeof(midi_message));
        return *this;
    }
    // gets the channel (channel messages only)
    inline uint8_t channel() const {
        if(status<0b11110000) {
            return status & 0xF;
        }
        return 0;
    }
    // sets the channel (channel messages only)
    inline void channel(uint8_t value) {
        if(status<0b11110000) {
            status = (status & 0xF0) | (value & 0x0F);
        }
    }
    // gets the type of message
    inline midi_message_type type() const {
        if(status<0b11110000) {
            return (midi_message_type)(status&0xF0);
        } else {
            return (midi_message_type)(status);
        }
    }
    // sets the type of message
    inline void type(midi_message_type value) {
        if(((int)value)<0xb11110000) {
            status = (status & 0x0F) | (((int)value));
        } else {
            status = (uint8_t)value;
        }
    }
    // get the MSB value for messages with a 2 byte payload
    inline uint8_t msb() const {
        return (value16 >> 8)&0x7f;
    }
    // set the MSB value for messages with a 2 byte payload
    inline void msb(uint8_t value) {
        value16 = (value16 & 0x7f) | uint16_t((value & 0x7f)<<8);
    }
    // get the LSB value for messages with a 2 byte payload
    inline uint8_t lsb() const {
        return value16 & 0x7f;
    }
    // set the LSB value for messages with a 2 byte payload
    inline void lsb(uint8_t value) {
        value16 = (value16 & uint16_t(0x7f<<8)) | (value & 0x7f);
    }
    // indicates the size of the message over the wire
    inline size_t wire_size() const {
        int32_t result;

        switch(type()) {
        case midi_message_type::note_off:
        case midi_message_type::note_on:
        case midi_message_type::polyphonic_pressure:
        case midi_message_type::control_change:
        case midi_message_type::pitch_wheel_change:
        case midi_message_type::song_position:
            return 3;
        case midi_message_type::program_change:
        case midi_message_type::channel_pressure:
        case midi_message_type::song_select:
            return  2;
        case midi_message_type::system_exclusive:
            return sysex.size+1;
        case midi_message_type::reset:
            if(meta.type&0x80) {
                return 1;
            } else {
                const uint8_t* p=midi_utility::decode_varlen(meta.encoded_length,&result);
                if(p!=nullptr) {
                    return (size_t)result+(p-meta.encoded_length)+2;
                }
            }

            return 1;
        case midi_message_type::end_system_exclusive:
        case midi_message_type::active_sensing:
        case midi_message_type::start_playback:
        case midi_message_type::stop_playback:
        case midi_message_type::tune_request:
        case midi_message_type::timing_clock:
            return 1;
        default:
            return 1;
        }
    }
};

There's a lot here, but I've commented it, and I'll endeavor to explain the larger picture here now. For the most part, messages are 3 bytes or less, if you include the status byte. The exceptions are sysex and meta messages. Basically, for any message but those two, we can hold the data in the union at the top. Included in that are the pointers for meta and sysex messages in those cases. That way, we can at least for the most part, have a single size struct to represent any message. The rest of what's there is helper methods to get the individual portions of a message, and then the wire_size() function which returns the actual size of the message over the wire, or in a file, including the status byte.

A midi_event also contains a delta value which indicates the offset in MIDI ticks from the previous message:

// represents a MIDI event
struct midi_event final {
    // the offset in MIDI ticks from the previous event
    int32_t delta;
    // the MIDI message
    midi_message message;
};

It should be noted that the delta is stored in a file as a varlen number which is a compressed integer that takes from 1 to 3 bytes.

Let's take a look at midi_utility to see what decoding one looks like:

size_t midi_utility::decode_varlen(stream* in,
                                int32_t* out_value) {
    uint8_t c;
    uint32_t value;
    size_t result = 1;
    if ((value = (uint8_t)in->getch()) & 0x80) {
        value &= 0x7f;
        do {
            value = (value << 7) +
                ((c = (uint8_t)in->getch()) & 0x7f);
            ++result;
        } while (c & 0x80 && result < 4);
    }
    *out_value = value;

    return result > 3 ? 0 : result;
}

Basically, as long as the value has the high bit set, we take the lower 7 bits and shift/add it to our number. We do this until we read 3 bytes or the value doesn't have the high bit set.

Now we have almost all the tools to help us read an event from the file. Here's the last little bit, including the header to the function we'll be building out shortly:

// the MIDI event plus an absolute position within the stream
struct midi_stream_event final {
    // the absolute position in MIDI ticks
    unsigned long long absolute;
    // the delta from the last event in MIDI ticks
    int32_t delta;
    // the MIDI message
    midi_message message;
};
// represents a class for fetching MIDI messages out of a stream
class midi_stream final {
public:
    // decode the next event. The contents of the
    // in_out_event should be preserved between calls to this method.
    static const size_t decode_event(bool is_file,
                                    stream* in,
                                    midi_stream_event* in_out_event);
};

You can see here, we've basically augmented a MIDI event with an absolute position. This is so we can keep track of where we are within the stream. We could have done this at a higher level but I decided to put it here since in practice you'll actually need that figure whenever you pull events.

One thing that should be noted here is that the midi_stream_event instance is both an in and an out value. One reason for this is so we can track the absolute position. The other reason is at least as important. I mentioned earlier that we can omit the status byte for runs of messages with the same status. Passing in the old message facilitates this capability. The first time you call the routine, you pass in an empty (newly constructed) event. Afterward, you continue to pass in the same instance as you go. At least that's the theory. In practice it works, but with a wrinkle we'll eventually run into.

You probably noticed is_file. MIDI files augment the MIDI reset message status with its own message type that can only exist in files - the MIDI meta message. Unfortunately, this creates something of an ambiguity in the message stream. In a file, we look for meta events. Anywhere else, we treat them as reset.

Now let's get to the meat of the function, which honestly is pretty ugly, but the protocol sort of demands it. It would be possible to abstract this code, but that comes with its own costs.

const size_t midi_stream::decode_event(bool is_file, stream* in,
      midi_stream_event* in_out_event) {
    if (in == nullptr || in_out_event == nullptr) {
        return 0;
    }
    int32_t delta;
    size_t result = midi_utility::decode_varlen(in,&delta);

    in_out_event->absolute+=delta;
    in_out_event->delta=delta;
    int i = in->getch();
        if(i==-1) {
        return 0;
    }
    ++result;
    uint8_t b = (uint8_t)i;
    if(in_out_event->message.status==0xFF && in_out_event->message.meta.data!=nullptr) {
        free(in_out_event->message.meta.data);
        in_out_event->message.meta.data=nullptr;
    }
    if(in_out_event->message.status==0xF7 && in_out_event->message.sysex.data!=nullptr) {
        free(in_out_event->message.sysex.data);
        in_out_event->message.sysex.data=nullptr;
    }
    bool has_status = b&0x80;
    // expecting a status byte
    if(!has_status) {
        if(!(in_out_event->message.status&0x80)) {
            // no status byte in message
            return 0;
        }
    } else {
        in_out_event->message.status = b;
    }
    switch(in_out_event->message.type()) {
    case midi_message_type::note_off:
    case midi_message_type::note_on:
    case midi_message_type::polyphonic_pressure:
    case midi_message_type::control_change:
    case midi_message_type::pitch_wheel_change:
    case midi_message_type::song_position:
        if(has_status) {
            if(2!=in->read((uint8_t*)&in_out_event->message.value16,2)) {
                return 0;
            }
            result+=2;
            return result;
        }
        i=in->getch();
        if(i==-1) {
            return 0;
        }
        ++result;
        in_out_event->message.lsb(b);
        in_out_event->message.msb((uint8_t)i);
        return result;
    case midi_message_type::program_change:
    case midi_message_type::channel_pressure:
    case midi_message_type::song_select:
        if(has_status) {
            i=in->getch();
            if(i==-1) {
                return 0;
            }
            ++result;
            in_out_event->message.value8 = (uint8_t)i;
            return result;
        }
        in_out_event->message.value8 = b;

        return  result;
    case midi_message_type::system_exclusive:
        {
            uint8_t* psx = nullptr;
            size_t sxsz = 0;
            uint8_t buf[512];
            uint8_t b = 0;
            int i = 0;
            while(b!=0xF7) {
                if(0==in->read(&b,1)) {
                    if(nullptr!=psx) {
                        free(psx);
                    }
                    return 0;
                }
                ++result;
                buf[i++]=b;
                if(i==512) {
                    sxsz+=512;
                    if(psx==nullptr) {
                        psx=(uint8_t*)malloc(sxsz);
                        if(nullptr==psx) {
                            return 0;
                        }
                    } else {
                        psx=(uint8_t*)realloc(psx,sxsz);
                        if(nullptr==psx) {
                            return 0;
                        }
                    }
                    memcpy(psx+sxsz-512,buf,512);
                    i=0;
                }
            }
            if(i>0) {
                sxsz+=i;
                if(psx==nullptr) {
                    psx=(uint8_t*)malloc(sxsz);
                    if(nullptr==psx) {
                        return 0;
                    }
                } else {
                    psx=(uint8_t*)realloc(psx,sxsz);
                    if(nullptr==psx) {
                        return 0;
                    }
                }
                memcpy(psx+sxsz-i,buf,i);
            }
            in_out_event->message.sysex.data = psx;
            in_out_event->message.sysex.size = sxsz;
            return result;
        }
    case midi_message_type::reset:
        if(!is_file) {
            return result;
        }
        // this is a meta event
            i=in->getch();
            if(i==-1) {
                return 0;
            }
            ++result;
            in_out_event->message.meta.type = (uint8_t)i;
        {
            int32_t vl;
            size_t sz=midi_utility::decode_varlen(in,&vl);
            // re-encode it to fill our midi message
            midi_utility::encode_varlen(vl,in_out_event->message.meta.encoded_length);
            result+=sz;
            if(vl>0) {
                uint8_t* p = (uint8_t*)malloc(vl);
                if(nullptr==p) {
                    return 0;
                }
                if(vl!=in->read(p,vl)) {
                    free(p);
                    return 0;
                }
                result+=vl;
                in_out_event->message.meta.data=p;
                return result;
            }
            in_out_event->message.meta.data = nullptr;
            return result;
        }
    case midi_message_type::end_system_exclusive:
    case midi_message_type::active_sensing:
    case midi_message_type::start_playback:
    case midi_message_type::stop_playback:
    case midi_message_type::tune_request:
    case midi_message_type::timing_clock:
        return result;
    default:
        return result;
    }
}

Particularly horrible, and not completely tested is the sysex parsing portion. What complicates it is the fact that we don't know the size ahead of time. We must continue until we find an end sysex message (0xF7). We read up to 512 bytes at a time, and resize what we've allocated as needed, copying in the new data as it comes in. Ugh. It should also be noted that sysex support isn't quite to spec. You should be able to mix system realtime messages in with a sysex stream, but that's not supported with this code.

The other thankfully slightly less complicated part are MIDI file meta messages. However, since the length is stored as a varlen value, we have to decode it off the stream, and then reencode it to put it back into our message, since we've already advanced past it with our stream's input cursor.

Now we have the tools to pull events off of tracks out of a MIDI file! It's quite premature to declare victory at this point, however. We still have to re-interleave the events from each track into one MIDI stream, and then finally, output it over USB.

The interleaving is why this article has "advanced" in the title. We're basically going to do MIDI track mixing straight out of a file stream without loading more than we have to in memory at once.

The basic idea is as follows: For each track, we keep a cursor/context that contains the input position and the current midi_stream_event. Once we fill each one, we find the context with the event that has the nearest next absolute position. We store the index of that event for later.

When we go to retrieve a message, we simply return the event that's pointed to in the context at that previous index, and then decode the next event for that context/track. Finally we repeat the process of finding the event with the closest absolute position and store its index.

The reset() method will fill the contexts with their initial values and also allows us to return the cursors to their starting positions at any point after the fact:

sfx_result midi_file_source::reset() {
    if(m_stream==nullptr) {
        return sfx_result::invalid_state;
    }
    // reset the count of elapsed ticks
    m_elapsed = 0;
    // fill the contexts
    const size_t tsz = m_file.tracks_size;
    for(int i = 0;i<tsz;++i) {
        source_context* ctx = m_contexts+i;
        ctx->input_position = m_file.tracks[i].offset;
        // set the end flag in the case of a zero length track
        ctx->eos = !m_file.tracks[i].size;
        ctx->event.absolute = 0;
        ctx->event.delta = 0;
        ctx->event.message = midi_file_move(midi_message());
        // decode the first event
        if(!ctx->eos && ctx->input_position ==m_stream->seek(ctx->input_position)) {
            if(0!=midi_stream::decode_event(true,m_stream,&ctx->event)) {
                ctx->input_position = m_stream->seek(0,seek_origin::current);
            }
        }
    }
    // now go through the contexts and find the one with
    // the nearest absolute position.
    m_next_context = m_file.tracks_size;
    unsigned long long pos = (unsigned long long)-1;
    for(int i = 0;i<(int)tsz;++i) {
        source_context* ctx = m_contexts+i;
        if(!ctx->eos) {
            if(ctx->event.absolute<pos) {
                m_next_context=i;
                pos = ctx->event.absolute;
            }
        }
    }
    return m_next_context==m_file.tracks_size?
            sfx_result::end_of_stream:sfx_result::success;
}

One weirdness with the above is midi_file_move(), which forces the compiler to choose the reference stealing assignment overload instead of the copy assignment overload. There's no sense in copying the message, and it would just waste resources. Basically, it's the equivalent of std::move<> but on IoT, I tend to limit my use of the STL in my libraries for reasons.

There's a helper method called read_next_event() which we use after we pull the next event from our contexts. It's somewhat similar to the above:

sfx_result midi_file_source::read_next_event() {
    size_t tsz = m_file.tracks_size;
    if(m_next_context==tsz) {
        return sfx_result::end_of_stream;
    }
    // find the next context we're
    // pulling the message from
    source_context* ctx = m_contexts+m_next_context;
    if(ctx->eos) {
        return sfx_result::end_of_stream;
    }
    // seek to the current input position
    if(ctx->input_position!=m_stream->seek(ctx->input_position)) {
        return sfx_result::io_error;
    }
    // decode the next event
    size_t sz = midi_stream::decode_event(true,m_stream,&ctx->event);
    if(sz==0) {
        return sfx_result::invalid_format;
    }
    // increment the position
    ctx->input_position+=sz;
    // set the end of stream flag if we're there
    if(ctx->input_position-m_file.tracks[m_next_context].offset>=
            m_file.tracks[m_next_context].size) {
        ctx->eos = true;
    }
    // find the context with the nearest absolutely positioned
    // event and store the index of it for later
    bool done = true;
    m_next_context = tsz;
    unsigned long long pos = (unsigned long long)-1;
    for(int i = 0;i<(int)tsz;++i) {
        ctx = m_contexts+i;
        if(!ctx->eos) {
            if(ctx->event.message.status!=0 &&
                    ctx->event.absolute<pos) {
                m_next_context=i;
                pos = ctx->event.absolute;
                done = false;
            }
        }
    }
    return done?sfx_result::end_of_stream:
                sfx_result::success;
}

On to receive() which pulls the next event from our contexts and returns it, after recomputing the delta by subtracting the current elapsed() count from the absolute position of the event:

sfx_result midi_file_source::receive(midi_event* out_event) {
    if(m_stream==nullptr) {
        return sfx_result::invalid_state;
    }
    if(m_next_context==m_file.tracks_size) {
        return sfx_result::end_of_stream;
    }

    source_context* ctx = m_contexts+m_next_context;
    if(ctx->eos) {
        return sfx_result::end_of_stream;
    }
    out_event->delta = (int32_t)ctx->event.absolute-m_elapsed;
    // the midi_file_move will cause these values
    // to potentially be zeroed so we preserve
    // them:
    uint8_t status = ctx->event.message.status;
    uint8_t type = ctx->event.message.meta.type;
    out_event->message = midi_file_move(ctx->event.message);
    // set a running status byte
    ctx->event.message.status = status;
    // set the meta type
    ctx->event.message.meta.type = type;
    // don't need anything else

    // advance the elapsed ticks
    m_elapsed = ctx->event.absolute;

    // refill our contexts
    sfx_result r =read_next_event();
    if(r==sfx_result::end_of_stream) {
        return sfx_result::success;
    }
    return r;
}

Note that we return midi_event here instead of midi_stream_event. We don't need the absolute position anymore.

Now by calling receive() until sfx_result::end_of_stream is returned will allow us to fetch the events in order as they occur regardless of the track they are in. Deltas are also recomputed.

That's not a whole lot of fun without somewhere to send the messages. This is where we use the USB capabilities of the new ESP32S3. Basically we're going to be publishing a MIDI USB device and feeding it MIDI messages at a particular time. We'll be using a midi_clock which I won't cover the inner workings of here, but it handles the timing based on a tempo and timebase, calling you back for each MIDI tick. If your timebase is 24, you'll get 24 ticks per "beat" or technically, a quarter note. The clock helpfully calls us back each time a tick occurs if it can. The clock is cooperatively multitasked, so it requires pumping. If it doesn't get pumped often enough, it will report how many ticks it missed, but we don't have to care about that.

Before we get to all that however, let's cover the bit where we feed messages. It should be noted that we don't put the notes right out to USB because they have to be timed. We keep a std::queue<> of midi_stream_event events that hold the next events to be played, along with their absolute positions relative to the file. These get retrieved in the clock's tick callback at which point their messages get packed into their binary format and sent to USB once their time comes.

We keep reading until we get sfx_result::end_of_stream, at which point we call reset() on the source and repeat. We do this until there's an error, so probably forever:

void setup() {
    // an midi event source over a midi file
    midi_file_source msrc;
    // a midi clock
    midi_clock mclk;
    // a queue used to hold pending events
    midi_queue mqueue;
    // the state for the callback
    midi_clock_state mstate;
    mstate.clock = &mclk;
    mstate.queue = &mqueue;
    mstate.source = &msrc;
    Serial.begin(115200);
    // set the tick callback
    mclk.tick_callback(tick_callback,&mstate);
    SPIFFS.begin();
    // necessary boilerplate
    // to get this to work on an S3
    // instead of an S2:
    midi.setBaseEP(3);
    midi.begin();
    midi.setBaseEP(3);
    Serial.println("10 seconds to set up equipment starting now.");
    // must delay at least 1000!
    delay(10000);

    File file = SPIFFS.open("/indaclub.mid", "rb");
    // we use streams as a cross platform way to wrap platform dependent filesystem stuff
    file_stream fstm(file);
    // open the midi file source
    sfx_result r = midi_file_source::open(&fstm, &msrc);
    if (sfx_result::success != r) {
        Serial.printf("Error opening file: %d\n", (int)r);
        while (true)
            ;
    }
    // set the clock's timebase
    mclk.timebase(msrc.file().timebase);
    // start the clock
    mclk.start();
    // go forever
    while (true) {
        // the midi event
        midi_event e;
        // don't let the queue get overly big, there's no reason to
        if (mqueue.size() >= 16) {
            // just pump the clock
            mclk.update();
            continue;
        }
        // get the next event
        sfx_result r = msrc.receive(&e);
        if (r != sfx_result::success) {
            if (r == sfx_result::end_of_stream) {
                // pump the queue until out of messages
                while(mqueue.size()>0) {
                    mclk.update();
                }
                // reset the clock
                // and the source
                mclk.stop();
                msrc.reset();
                mclk.start();
                continue;
            }
            Serial.printf("Error receiving message: %d\n", (int)r);
            // exit
            break;
        } else {
            dump_midi(e.message);
            // add the event to the queue
            mqueue.push({(unsigned long long)msrc.elapsed(), e.delta, e.message});
            // pump the clock
            mclk.update();
        }
    }
}

Now that we've fed our messages into the queue, we should expore the tick callback. The idea here is we play any messages in the queue whose absolute position is less than or equal to the clock's elapsed tick count. Therefore, we loop, and when we do pull a message off the queue and send it, we have to pack it into its MIDI wire format and then use a low level call to write it since ESPTinyUSB's MIDI isn't fully implemented, and the way it's implemented isn't really conducive to how we want to use it anyway:

void tick_callback(uint32_t pending, unsigned long long elapsed, void* state) {
    uint8_t buf[3];
    midi_clock_state* st = (midi_clock_state*)state;
    while (true) {
        // TODO: since I haven't made a usb_midi_output_device
        // we have to do this manually
        // events on the queue?
        if (st->queue->size()) {
            // peek the next one
            const midi_stream_event& event = st->queue->front();
            // is it ready to be played
            if (event.absolute <= elapsed) {
                // special handing for midi meta file tempo events
                if (event.message.type() == midi_message_type::meta_event &&
                          event.message.meta.type == 0x51) {
                    int32_t mt = (event.message.meta.data[0] << 16) |
                    (event.message.meta.data[1] << 8) | event.message.meta.data[2];
                    // update the clock microtempo
                    st->clock->microtempo(mt);
                }
                if (event.message.status != 0xFF ||
                          event.message.meta.data == nullptr) {
                    // send a sysex message
                    if (event.message.type() == midi_message_type::system_exclusive &&
                              event.message.sysex.data != nullptr) {
                        uint8_t* p = (uint8_t*)malloc(event.message.sysex.size + 1);
                        if (p != nullptr) {
                            *p = event.message.status;
                            memcpy(p + 1, event.message.sysex.data,
                                          event.message.sysex.size);
                            tud_midi_stream_write(0, p, event.message.sysex.size + 1);
                            free(p);
                        }
                    } else {
                        // send a regular message
                        // build a buffer and send it using raw midi
                        buf[0] = event.message.status;
                        if ((int)event.message.type() <=
                            (int)midi_message_type::control_change) {
                            switch (event.message.wire_size()) {
                                case 1:
                                    //tud_midi_stream_write(event.message.channel(),
                                    //buf, 1);
                                    tud_midi_stream_write(0, buf, 1);
                                    break;
                                case 2:
                                    buf[1] = event.message.value8;
                                    //tud_midi_stream_write(event.message.channel(),
                                    //buf, 2);
                                    tud_midi_stream_write(0, buf, 2);
                                    break;
                                case 3:
                                    buf[1] = event.message.lsb();
                                    buf[2] = event.message.msb();
                                    //tud_midi_stream_write(event.message.channel(),
                                    //buf, 3);
                                    tud_midi_stream_write(0, buf, 3);
                                    break;
                                default:
                                    break;
                            }
                        } else {
                            switch (event.message.wire_size()) {
                                case 1:
                                    tud_midi_stream_write(0, buf, 1);
                                    break;
                                case 2:
                                    buf[1] = event.message.value8;
                                    tud_midi_stream_write(0, buf, 2);
                                    break;
                                case 3:
                                    buf[1] = event.message.lsb();
                                    buf[2] = event.message.msb();
                                    tud_midi_stream_write(0, buf, 3);
                                    break;
                                default:
                                    break;
                            }
                        }
                    }
                }
                // ensure the message gets destroyed
                // (necessary? I don't think so, but I'd rather not leak)
                event.message.~midi_message();
                // remove the message
                st->queue->pop();
            } else {
                break;
            }
        } else {
            break;
        }
    }
}

Forgive the wrapping. It's just nested pretty deep. Anyway, that's the meat of everything we've done.

If you want to hear it, fire up MIDI-OX and go to Options|MIDI Devices and add "MIDI Class" from your inputs.

History

  • 9th May, 2022 - Initial submission