For those who haven’t taken the time to digest all that is being discussed elsewhere, here is the latest bottom line from David Snyder:
“For those following the rapid succession of releases and wondering why we are seeing frequent updates (versions 147_08, _09, _10, _12, _13, etc.) and why I’ve been obsessing over “100Mbps” and “EEE,” it’s because we are currently living through a major architectural shift in the Diretta protocol itself.
We are moving from what Yu-san, author of Diretta, calls Mode 2 to Mode 3. He also refers to Mode3 as Diretta Direct Stream (DDS).”
The Evolution of the Stream
-
Mode 1 (The Past): Traditional buffered transmission. Reliable, but heavy on CPU usage.
-
Mode 2 (The Standard): This used UDP packets. It was faster and lighter than TCP, but the Target computer still had to process the full OS Network Stack. This means parsing IP headers (Layer 3) and managing UDP ports and sockets (Layer 4). Every cycle spent validating a checksum or routing data to a socket is a cycle not spent moving audio.
-
Mode 3 / DDS (The New Hotness): Yu-san has recently shifted to Layer 2 Ethernet Frames, removing two layers of the network protocol stack.
-
This bypasses both the IP layer (L3) and the Transport layer (L4) entirely.
-
No IP addresses to parse, no UDP ports to manage, and no socket overhead. It is essentially a direct memory transfer over the wire.
-
The Host talks directly to the Target’s MAC address using raw frames with a custom EtherType (0x88b5).
-
The Target computer no longer wastes time inspecting IP headers. It just sees a frame and hands the payload to the USB bus.
Standardization Note: Yu-san is actually working with the IEEE to have 0x88b5 formally defined and reserved as a standard Layer 2 frame type for audio data transmission. This is a serious move toward establishing a new industry standard.
Why I think this matters for Sound Quality
This transition to Layer 2 (L2) is the ultimate expression of the “low noise” philosophy. By removing two layers of network stack processing, we further reduce the CPU instruction count on the Target. Less code execution = less power draw modulation = less electrical noise.
The Results (Version 147_13)
The transition has been a bit of a bumpy ride (as bleeding-edge tinkering often is!), but version 147_13 seems to have turned the corner.
I just ran a benchmark on my system (RPi 4 for Host and Target) using the new L2 protocol. The timing precision is startling:
That means the variance in packet arrival time is less than 2 microseconds. This is truly impressive timing precision that rivals hardware-based FPGA solutions, but we are doing it on a pair of Raspberry Pi 4 computers.
How or why this correlates to better sound quality we can only speculate, but those who have tried 147_13 will tell you it’s the best sounding version so far, and it happens to also have the lowest jitter. It will be interesting to see if this trend continues.
