RFC 6716: Definition of the Opus Audio Codec

PROPOSED STANDARD

Errata Exist

Updated by: 8251

Internet Engineering Task Force (IETF)                         JM. Valin
Request for Comments: 6716                           Mozilla Corporation
Category: Standards Track                                         K. Vos
ISSN: 2070-1721                                  Skype Technologies S.A.
                                                           T. Terriberry
                                                     Mozilla Corporation
                                                          September 2012


                   

Definition of the Opus Audio Codec

Abstract This document defines the Opus interactive speech and audio codec. Opus is designed to handle a wide range of interactive audio applications, including Voice over IP, videoconferencing, in-game chat, and even live, distributed music performances. It scales from low bitrate narrowband speech at 6 kbit/s to very high quality stereo music at 510 kbit/s. Opus uses both Linear Prediction (LP) and the Modified Discrete Cosine Transform (MDCT) to achieve good compression of both speech and music. Status of This Memo This is an Internet Standards Track document. This document is a product of the Internet Engineering Task Force (IETF). It represents the consensus of the IETF community. It has received public review and has been approved for publication by the Internet Engineering Steering Group (IESG). Further information on Internet Standards is available in Section 2 of RFC 5741. Information about the current status of this document, any errata, and how to provide feedback on it may be obtained at http://www.rfc-editor.org/info/rfc6716.

Valin, et al. Standards Track [Page 1]

RFC 6716 Interactive Audio Codec September 20121 . IntroductionREQUIREMENTS]. It is composed of a layer based on Linear Prediction (LP) [LPC] and a layer based on the Modified Discrete Cosine Transform (MDCT) [MDCT]. The main idea behind using two layers is as follows: in speech, linear prediction techniques (such as Code-Excited Linear Prediction, or CELP) code low frequencies more efficiently than transform (e.g., MDCT) domain techniques, while the situation is reversed for music and higher speech frequencies. Thus, a codec with both layers available can operate over a wider range than either one alone and can achieve better quality by combining them than by using either one individually. The primary normative part of this specification is provided by the source code in Appendix A. Only the decoder portion of this software is normative, though a significant amount of code is shared by both the encoder and decoder. Section 6 provides a decoder conformance test. The decoder contains a great deal of integer and fixed-point arithmetic that needs to be performed exactly, including all rounding considerations, so any useful specification requires domain-specific symbolic language to adequately define these operations. Additionally, any conflict between the symbolic representation and the included reference implementation must be resolved. For the practical reasons of compatibility and testability, it would be advantageous to give the reference implementation priority in any disagreement. The C language is also one of the most widely understood, human-readable symbolic representations for machine behavior. For these reasons, this RFC uses the reference implementation as the sole symbolic representation of the codec. While the symbolic representation is unambiguous and complete, it is not always the easiest way to understand the codec's operation. For this reason, this document also describes significant parts of the codec in prose and takes the opportunity to explain the rationale behind many of the more surprising elements of the design. These descriptions are intended to be accurate and informative, but the limitations of common English sometimes result in ambiguity, so it is expected that the reader will always read them alongside the symbolic representation. Numerous references to the implementation are provided for this purpose. The descriptions sometimes differ from the reference in ordering or through mathematical simplification wherever such deviation makes an explanation easier to understand. For example, the right shift and left shift operations in the reference implementation are often described using division and

Valin, et al. Standards Track [Page 5]

RFC 6716 Interactive Audio Codec September 20121.1 . Notation and ConventionsRFC 2119 [RFC2119]. Various operations in the codec require bit-exact fixed-point behavior, even when writing a floating point implementation. The notation "Q<n>", where n is an integer, denotes the number of binary digits to the right of the decimal point in a fixed-point number. For example, a signed Q14 value in a 16-bit word can represent values from -2.0 to 1.99993896484375, inclusive. This notation is for informational purposes only. Arithmetic, when described, always operates on the underlying integer. For example, the text will explicitly indicate any shifts required after a multiplication. Expressions, where included in the text, follow C operator rules and precedence, with the exception that the syntax "x**y" indicates x raised to the power y. The text also makes use of the following functions. 1.1.1 . min(x,y)1.1.2 . max(x,y)1.1.3 . clamp(lo,x,hi)1.1.4 . sign(x)

Valin, et al. Standards Track [Page 6]

RFC 6716 Interactive Audio Codec September 20122 . Opus Codec Overview

Valin, et al. Standards Track [Page 8]

RFC 6716 Interactive Audio Codec September 2012SILK]. It supports NB, MB, or WB audio and frame sizes from 10 ms to 60 ms, and requires an additional 5 ms look-ahead for noise shaping estimation. A small additional delay (up to 1.5 ms) may be required for sampling rate conversion. Like Vorbis [VORBIS-WEBSITE] and many other modern codecs, SILK is inherently designed for variable bitrate (VBR) coding, though the encoder can also produce constant bitrate (CBR) streams. The version of SILK used in Opus is substantially modified from, and not compatible with, the stand-alone SILK codec previously deployed by Skype. This document does not serve to define that format, but those interested in the original SILK codec should see [SILK] instead. The MDCT layer is based on the Constrained-Energy Lapped Transform (CELT) codec [CELT]. It supports NB, WB, SWB, or FB audio and frame sizes from 2.5 ms to 20 ms, and requires an additional 2.5 ms look- ahead due to the overlapping MDCT windows. The CELT codec is inherently designed for CBR coding, but unlike many CBR codecs, it is not limited to a set of predetermined rates. It internally allocates bits to exactly fill any given target budget, and an encoder can produce a VBR stream by varying the target on a per-frame basis. The MDCT layer is not used for speech when the audio bandwidth is WB or less, as it is not useful there. On the other hand, non-speech signals are not always adequately coded using linear prediction. Therefore, the MDCT layer should be used for music signals. A "Hybrid" mode allows the use of both layers simultaneously with a frame size of 10 or 20 ms and an SWB or FB audio bandwidth. The LP layer codes the low frequencies by resampling the signal down to WB. The MDCT layer follows, coding the high frequency portion of the signal. The cutoff between the two lies at 8 kHz, the maximum WB audio bandwidth. In the MDCT layer, all bands below 8 kHz are discarded, so there is no coding redundancy between the two layers. The sample rate (in contrast to the actual audio bandwidth) can be chosen independently on the encoder and decoder side, e.g., a fullband signal can be decoded as wideband, or vice versa. This approach ensures a sender and receiver can always interoperate, regardless of the capabilities of their actual audio hardware. Internally, the LP layer always operates at a sample rate of twice the audio bandwidth, up to a maximum of 16 kHz, which it continues to use for SWB and FB. The decoder simply resamples its output to support different sample rates. The MDCT layer always operates internally at a sample rate of 48 kHz. Since all the supported sample rates evenly divide this rate, and since the decoder may easily zero out the high frequency portion of the spectrum in the frequency domain, it can simply decimate the MDCT layer output to achieve the other supported sample rates very cheaply.

Valin, et al. Standards Track [Page 9]

RFC 6716 Interactive Audio Codec September 20122.1 . Control Parameters2.1.1 . Bitrate

Valin, et al. Standards Track [Page 10]

RFC 6716 Interactive Audio Codec September 20122.1.6 . Packet Loss Resilience2.1.7 . Forward Error Correction (FEC)2.1.8 . Constant/Variable BitrateSRTP-VBR]. Bitrate may still be allowed to vary, even with sensitive data, as long as the variation is not driven by the input signal (for example, to match changing network conditions). To achieve this, an application should still run Opus in CBR mode, but change the target rate before each packet.

Valin, et al. Standards Track [Page 12]

RFC 6716 Interactive Audio Codec September 20122.1.9 . Discontinuous Transmission (DTX)3 . Internal FramingRFC3550] or Ogg [RFC3533] or Matroska [MATROSKA-WEBSITE]) will communicate the length, in bytes, of the packet, and it uses this information to reduce the framing overhead in the packet itself. A decoder implementation MUST support the framing described in this section. An alternative, self- delimiting variant of the framing is described in Appendix B. Support for that variant is OPTIONAL. All bit diagrams in this document number the bits so that bit 0 is the most significant bit of the first byte, and bit 7 is the least significant. Bit 8 is thus the most significant bit of the second byte, etc. Well-formed Opus packets obey certain requirements, marked [R1] through [R7] below. These are summarized in Section 3.4 along with appropriate means of handling malformed packets. 3.1 . The TOC ByteR1]. This byte forms a table-of-contents (TOC) header that signals which of the various modes and configurations a given packet uses. It is composed of a configuration number, "config", a stereo flag, "s", and a frame count code, "c", arranged as illustrated in Figure 1. A description of each of these fields follows.

Valin, et al. Standards Track [Page 13]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 14]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 15]

RFC 6716 Interactive Audio Codec September 20123.2 . Frame Packing3.2.1 . Frame Length CodingR2] to allow for repacketization by gateways, conference bridges, or other software. 3.2.2 . Code 0: One Frame in the Packet

Valin, et al. Standards Track [Page 16]

RFC 6716 Interactive Audio Codec September 20123.2.3 . Code 1: Two Frames in the Packet, Each with Equal Compressed

Size

For code 1 packets, the TOC byte is immediately followed by the (N-1)/2 bytes of compressed data for the first frame, followed by (N-1)/2 bytes of compressed data for the second frame, as illustrated in Figure 3. The number of payload bytes available for compressed data, N-1, MUST be even for all code 1 packets [R3]. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | config |s|0|1| | +-+-+-+-+-+-+-+-+ : | Compressed frame 1 ((N-1)/2 bytes)... | : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : | Compressed frame 2 ((N-1)/2 bytes)... | : +-+-+-+-+-+-+-+-+ | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 3: A Code 1 Packet 3.2.4 . Code 2: Two Frames in the Packet, with Different Compressed

Sizes

For code 2 packets, the TOC byte is followed by a one- or two-byte sequence indicating the length of the first frame (marked N1 in Figure 4), followed by N1 bytes of compressed data for the first frame. The remaining N-N1-2 or N-N1-3 bytes are the compressed data for the second frame. This is illustrated in Figure 4. A code 2 packet MUST contain enough bytes to represent a valid length. For example, a 1-byte code 2 packet is always invalid, and a 2-byte code 2 packet whose second byte is in the range 252...255 is also invalid.

Valin, et al. Standards Track [Page 17]

RFC 6716 Interactive Audio Codec September 2012R4]. This makes, for example, a 2-byte code 2 packet with a second byte in the range 1...251 invalid as well (the only valid 2-byte code 2 packet is one where the length of both frames is zero). 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | config |s|1|0| N1 (1-2 bytes): | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : | Compressed frame 1 (N1 bytes)... | : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | Compressed frame 2... : : | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 4: A Code 2 Packet 3.2.5 . Code 3: A Signaled Number of Frames in the PacketR6,R7]. The TOC byte is followed by a byte encoding the number of frames in the packet in bits 2 to 7 (marked "M" in Figure 5), with bit 1 indicating whether or not Opus padding is inserted (marked "p" in Figure 5), and bit 0 indicating VBR (marked "v" in Figure 5). M MUST NOT be zero, and the audio duration contained within a packet MUST NOT exceed 120 ms [R5]. This limits the maximum frame count for any frame size to 48 (for 2.5 ms frames), with lower limits for longer frame sizes. Figure 5 illustrates the layout of the frame count byte. 0 0 1 2 3 4 5 6 7 +-+-+-+-+-+-+-+-+ |v|p| M | +-+-+-+-+-+-+-+-+ Figure 5: The frame count byte When Opus padding is used, the number of bytes of padding is encoded in the bytes following the frame count byte. Values from 0...254 indicate that 0...254 bytes of padding are included, in addition to

Valin, et al. Standards Track [Page 18]

RFC 6716 Interactive Audio Codec September 2012R6,R7]. The additional padding bytes appear at the end of the packet and MUST be set to zero by the encoder to avoid creating a covert channel. The decoder MUST accept any value for the padding bytes, however. Although this encoding provides multiple ways to indicate a given number of padding bytes, each uses a different number of bytes to indicate the padding size and thus will increase the total packet size by a different amount. For example, to add 255 bytes to a packet, set the padding bit, p, to 1, insert a single byte after the frame count byte with a value of 254, and append 254 padding bytes with the value zero to the end of the packet. To add 256 bytes to a packet, set the padding bit to 1, insert two bytes after the frame count byte with the values 255 and 0, respectively, and append 254 padding bytes with the value zero to the end of the packet. By using the value 255 multiple times, it is possible to create a packet of any specific, desired size. Let P be the number of header bytes used to indicate the padding size plus the number of padding bytes themselves (i.e., P is the total number of bytes added to the packet). Then, P MUST be no more than N-2 [R6,R7]. In the CBR case, let R=N-2-P be the number of bytes remaining in the packet after subtracting the (optional) padding. Then, the compressed length of each frame in bytes is equal to R/M. The value R MUST be a non-negative integer multiple of M [R6]. The compressed data for all M frames follows, each of size R/M bytes, as illustrated in Figure 6.

Valin, et al. Standards Track [Page 19]

RFC 6716 Interactive Audio Codec September 2012R7]. The compressed data for all M frames follows, each frame consisting of the indicated number of bytes, with the final frame consuming any remaining bytes before the final padding, as illustrated in Figure 6. The number of header bytes (TOC byte, frame count byte, padding length bytes, and frame length bytes), plus the signaled length of the first M-1 frames themselves, plus the signaled length of the padding MUST be no larger than N, the total size of the packet.

Valin, et al. Standards Track [Page 20]

RFC 6716 Interactive Audio Codec September 20123.3 . Examples

Valin, et al. Standards Track [Page 21]

RFC 6716 Interactive Audio Codec September 20123.4 . Receiving Malformed PacketsR1] Packets are at least one byte. [R2] No implicit frame length is larger than 1275 bytes. [R3] Code 1 packets have an odd total length, N, so that (N-1)/2 is an integer.

Valin, et al. Standards Track [Page 22]

RFC 6716 Interactive Audio Codec September 2012R4] Code 2 packets have enough bytes after the TOC for a valid frame length, and that length is no larger than the number of bytes remaining in the packet. [R5] Code 3 packets contain at least one frame, but no more than 120 ms of audio total. [R6] The length of a CBR code 3 packet, N, is at least two bytes, the number of bytes added to indicate the padding size plus the trailing padding bytes themselves, P, is no more than N-2, and the frame count, M, satisfies the constraint that (N-2-P) is a non-negative integer multiple of M. [R7] VBR code 3 packets are large enough to contain all the header bytes (TOC byte, frame count byte, any padding length bytes, and any frame length bytes), plus the length of the first M-1 frames, plus any trailing padding bytes. 4 . Opus Decoder4.1 . Range DecoderRANGE-CODING] [MARTIN79], which is itself a rediscovery of the FIFO arithmetic code introduced by [CODING-THESIS]. It is very similar to arithmetic encoding, except that encoding is done with digits in any base

Valin, et al. Standards Track [Page 23]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 24]

RFC 6716 Interactive Audio Codec September 2012Section 5.1. The range decoder maintains an internal state vector composed of the two-tuple (val, rng), where val represents the difference between the high end of the current range and the actual coded value, minus one, and rng represents the size of the current range. Both val and rng are 32-bit unsigned integer values. 4.1.1 . Range Decoder InitializationSection 4.1.2.1, which the decoder invokes immediately after initialization to read additional bits and establish the invariant that rng > 2**23. 4.1.2 . Decoding Symbols

Valin, et al. Standards Track [Page 25]

RFC 6716 Interactive Audio Codec September 2012Section 4.1.3 particularly simple. After the updates, implemented by ec_dec_update() (entdec.c), the decoder normalizes the range using the procedure in the next section, and returns the index k. 4.1.2.1 . RenormalizationSection 4.1.1 for the initialization used to process the first byte. Then, it sets val = ((val<<8) + (255-sym)) & 0x7FFFFFFF

Valin, et al. Standards Track [Page 26]

RFC 6716 Interactive Audio Codec September 2012Section 5.1.5 describes a procedure for doing this. If the range decoder consumes all of the bytes belonging to the current frame, it MUST continue to use zero when any further input bytes are required, even if there is additional data in the current packet from padding or other frames. n n+1 n+2 n+3 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : | <----------- Overlap region ------------> | : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ^ ^ | End of data buffered by the range coder | ...-----------------------------------------------+ | | End of data consumed by raw bits +-------------------------------------------------------... Figure 13: Illustrative Example of Raw Bits Overlapping Range Coder Data 4.1.3 . Alternate Decoding Methods4.1.3.1 . ec_decode_bin()4.1.3.2 . ec_dec_bit_logp()

Valin, et al. Standards Track [Page 27]

RFC 6716 Interactive Audio Codec September 20124.1.3.3 . ec_dec_icdf()

Valin, et al. Standards Track [Page 28]

RFC 6716 Interactive Audio Codec September 20124.1.4 . Decoding Raw BitsSection 4.1.2.1, the input consumed by the raw bits may overlap with the input consumed by the range coder, and a decoder MUST allow this. The format should render it impossible to attempt to read more raw bits than there are actual bits in the frame, though a decoder may wish to check for this and report an error. 4.1.5 . Decoding Uniformly Distributed Integers

Valin, et al. Standards Track [Page 29]

RFC 6716 Interactive Audio Codec September 20124.1.6 . Current Bit Usage

Valin, et al. Standards Track [Page 30]

RFC 6716 Interactive Audio Codec September 20124.2 . SILK Decoder4.2.1 . SILK Decoder Modules

Valin, et al. Standards Track [Page 32]

RFC 6716 Interactive Audio Codec September 2012Section 4.1 and then decodes the parameters in it (2) using the procedures detailed in Sections 4.2.3 through 4.2.7.8.5. These parameters (3, 4, 5) are used to generate an excitation signal (see Section 4.2.7.8.6), which is fed to an optional Long-Term Prediction (LTP) filter (voiced frames only, see Section 4.2.7.9.1) and then a short-term prediction filter (see Section 4.2.7.9.2), producing the decoded signal (6). For stereo streams, the mid-side representation is converted to separate left and right channels (7). The result is finally resampled to the desired output sample rate (e.g., 48 kHz) so that the resampled signal (8) can be mixed with the CELT layer. 4.2.2 . LP Layer Organization

Valin, et al. Standards Track [Page 33]

RFC 6716 Interactive Audio Codec September 2012Section 4.2.4 | | | | | | LBRR Frame(s) | Section 4.2.7 | Section 4.2.4 | | | | | | Regular SILK Frame(s) | Section 4.2.7 | | +-----------------------------------+---------------+---------------+ Table 3: Organization of the SILK layer of an Opus Frame +---------------------------------+ | VAD Flags | +---------------------------------+ | LBRR Flag | +---------------------------------+ | Per-Frame LBRR Flags (Optional) | +---------------------------------+ | LBRR Frame 1 (Optional) | +---------------------------------+ | LBRR Frame 2 (Optional) | +---------------------------------+ | LBRR Frame 3 (Optional) | +---------------------------------+ | Regular SILK Frame 1 | +---------------------------------+ | Regular SILK Frame 2 | +---------------------------------+ | Regular SILK Frame 3 | +---------------------------------+ Figure 15: A 60 ms Mono Frame

Valin, et al. Standards Track [Page 34]

RFC 6716 Interactive Audio Codec September 20124.2.3 . Header Bits

Valin, et al. Standards Track [Page 35]

RFC 6716 Interactive Audio Codec September 20124.2.4 . Per-Frame LBRR FlagsSection 4.2.3) is already sufficient to indicate the presence of that single LBRR frame. 4.2.5 . LBRR Frames

Valin, et al. Standards Track [Page 36]

RFC 6716 Interactive Audio Codec September 2012Section 4.4). When switching from mono to stereo, the LBRR frames in the first stereo Opus frame MAY contain a non-trivial side channel. In order to properly produce LBRR frames under all conditions, an encoder might need to buffer up to 60 ms of audio and re-encode it during these transitions. However, the reference implementation opts to disable LBRR frames at the transition point for simplicity. Since transitions are relatively infrequent in normal usage, this does not have a significant impact on packet loss robustness. The LBRR frames immediately follow the LBRR flags, prior to any regular SILK frames. Section 4.2.7 describes their exact contents. LBRR frames do not include their own separate VAD flags. LBRR frames are only meant to be transmitted for active speech, thus all LBRR frames are treated as active. In a stereo Opus frame longer than 20 ms, although the per-frame LBRR flags for the mid channel are coded as a unit before the per-frame LBRR flags for the side channel, the LBRR frames themselves are interleaved. The decoder parses an LBRR frame for the mid channel of a given 20 ms interval (if present) and then immediately parses the corresponding LBRR frame for the side channel (if present), before proceeding to the next 20 ms interval. 4.2.6 . Regular SILK FramesSection 4.2.7 describes their contents, as well. Unlike the LBRR frames, a regular SILK frame is coded for each time interval in an Opus frame, even if the corresponding VAD flags are unset. For stereo Opus frames longer than 20 ms, the regular mid and side SILK frames for each 20 ms interval are interleaved, just as with the LBRR frames. The side frame may be skipped by coding an appropriate flag, as detailed in Section 4.2.7.2. 4.2.7 . SILK Frame ContentsSection 4.2.7.3), o Quantization gains (Section 4.2.7.4), o Short-term prediction filter coefficients (Section 4.2.7.5),

Valin, et al. Standards Track [Page 37]

RFC 6716 Interactive Audio Codec September 2012Section 4.2.7.5.5), o LTP filter lags and gains (Section 4.2.7.6), and o A Linear Congruential Generator (LCG) seed (Section 4.2.7.7). The quantized excitation signal (see Section 4.2.7.8) follows these at the end of the frame. Table 5 details the overall organization of a SILK frame.

Valin, et al. Standards Track [Page 38]

RFC 6716 Interactive Audio Codec September 20124.2.7.1 . Stereo Prediction WeightsSection 4.5.2). To summarize, these weights are coded if and only if o This is a stereo Opus frame (Section 3.1), and o The current SILK frame corresponds to the mid channel. The prediction weights are coded in three separate pieces, which are decoded by silk_stereo_decode_pred() (stereo_decode_pred.c). The first piece jointly codes the high-order part of a table index for both weights. The second piece codes the low-order part of each table index. The third piece codes an offset used to linearly interpolate between table indices. The details are as follows. Let n be an index decoded with the 25-element stage-1 PDF in Table 6. Then, let i0 and i1 be indices decoded with the stage-2 and stage-3 PDFs in Table 6, respectively, and let i2 and i3 be two more indices decoded with the stage-2 and stage-3 PDFs, all in that order. +-------+-----------------------------------------------------------+ | Stage | PDF | +-------+-----------------------------------------------------------+ | Stage | {7, 2, 1, 1, 1, 10, 24, 8, 1, 1, 3, 23, 92, 23, 3, 1, 1, | | 1 | 8, 24, 10, 1, 1, 1, 2, 7}/256 | | | | | Stage | {85, 86, 85}/256 | | 2 | | | | | | Stage | {51, 51, 52, 51, 51}/256 | | 3 | | +-------+-----------------------------------------------------------+ Table 6: Stereo Weight PDFs

Valin, et al. Standards Track [Page 40]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 41]

RFC 6716 Interactive Audio Codec September 20124.2.7.2 . Mid-Only FlagSection 3.1), o The current SILK frame corresponds to the mid channel, and

Valin, et al. Standards Track [Page 42]

RFC 6716 Interactive Audio Codec September 2012Section 4.2.3) indicate that the corresponding side channel is not active. * This is an LBRR frame where the LBRR flags (see Sections 4.2.3 and 4.2.4) indicate that the corresponding side channel is not coded. It is omitted when there are no stereo weights, for all of the same reasons. It is also omitted for a regular SILK frame when the VAD flag of the corresponding side channel frame is set (indicating it is active). The side channel must be coded in this case, making the mid-only flag redundant. It is also omitted for an LBRR frame when the corresponding LBRR flags indicate the side channel is coded. When the flag is present, the decoder reads a single value using the PDF in Table 8, as implemented in silk_stereo_decode_mid_only() (stereo_decode_pred.c). If the flag is set, then there is no corresponding SILK frame for the side channel, the entire decoding process for the side channel is skipped, and zeros are fed to the stereo unmixing process (see Section 4.2.8) instead. As stated above, LBRR frames still include this flag when the LBRR flag indicates that the side channel is not coded. In that case, if this flag is zero (indicating that there should be a side channel), then Packet Loss Concealment (PLC, see Section 4.4) SHOULD be invoked to recover a side channel signal. Otherwise, the stereo image will collapse. +---------------+ | PDF | +---------------+ | {192, 64}/256 | +---------------+ Table 8: Mid-only Flag PDF 4.2.7.3 . Frame Type

Valin, et al. Standards Track [Page 43]

RFC 6716 Interactive Audio Codec September 20124.2.7.4 . Subframe Gains

Valin, et al. Standards Track [Page 44]

RFC 6716 Interactive Audio Codec September 2012Section 4.2.7.3). +-------------+------------------------------------+ | Signal Type | PDF | +-------------+------------------------------------+ | Inactive | {32, 112, 68, 29, 12, 1, 1, 1}/256 | | | | | Unvoiced | {2, 17, 45, 60, 62, 47, 19, 4}/256 | | | | | Voiced | {1, 3, 26, 71, 94, 50, 9, 2}/256 | +-------------+------------------------------------+ Table 11: PDFs for Independent Quantization Gain MSB Coding The 3 least significant bits are decoded using a uniform PDF: +--------------------------------------+ | PDF | +--------------------------------------+ | {32, 32, 32, 32, 32, 32, 32, 32}/256 | +--------------------------------------+ Table 12: PDF for Independent Quantization Gain LSB Coding These 6 bits are combined to form a value, gain_index, between 0 and 63. When the gain for the previous subframe is available, then the current gain is limited as follows: log_gain = max(gain_index, previous_log_gain - 16) This may help some implementations limit the change in precision of their internal LTP history. The indices to which this clamp applies cannot simply be removed from the codebook, because previous_log_gain will not be available after packet loss. The clamping is skipped after a decoder reset, and in the side channel if the previous frame

Valin, et al. Standards Track [Page 45]

RFC 6716 Interactive Audio Codec September 20124.2.7.5 . Normalized Line Spectral Frequency (LSF) and Linear Predictive

Coding (LPC) Coefficients

A set of normalized Line Spectral Frequency (LSF) coefficients follow the quantization gains in the bitstream and represent the Linear Predictive Coding (LPC) coefficients for the current SILK frame.

Valin, et al. Standards Track [Page 46]

RFC 6716 Interactive Audio Codec September 2012SPECTRAL-PAIRS] of the LPC filter into a symmetric part and an anti-symmetric part (P and Q in Section 4.2.7.5.6). Because of non-linear effects in the decoding process, an implementation SHOULD match the fixed-point arithmetic described in this section exactly. An encoder SHOULD also use the same process. The normalized LSFs are coded using a two-stage vector quantizer (VQ) (Sections 4.2.7.5.1 and 4.2.7.5.2). NB and MB frames use an order-10 predictor, while WB frames use an order-16 predictor. Thus, each of these two cases uses a different set of tables. After reconstructing the normalized LSFs (Section 4.2.7.5.3), the decoder runs them through a stabilization process (Section 4.2.7.5.4), interpolates them between frames (Section 4.2.7.5.5), converts them back into LPC coefficients (Section 4.2.7.5.6), and then runs them through further processes to limit the range of the coefficients (Section 4.2.7.5.7) and the gain of the filter (Section 4.2.7.5.8). All of this is necessary to ensure the reconstruction process is stable. 4.2.7.5.1 . Normalized LSF Stage 1 Decoding

Valin, et al. Standards Track [Page 47]

RFC 6716 Interactive Audio Codec September 20124.2.7.5.2 . Normalized LSF Stage 2 Decoding

Valin, et al. Standards Track [Page 48]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 49]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 50]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 51]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 52]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 53]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 54]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 55]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 56]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 57]

RFC 6716 Interactive Audio Codec September 20124.2.7.5.3 . Reconstructing the Normalized LSF CoefficientsLAROIA-ICASSP]. The weights are derived directly from the stage-1 codebook vector. Let cb1_Q8[k] be the k'th entry of the stage-1 codebook vector from Table 23 or Table 24. Then, for 0 <= k < d_LPC, the following expression computes the square of the weight as a Q18 value: w2_Q18[k] = (1024/(cb1_Q8[k] - cb1_Q8[k-1]) + 1024/(cb1_Q8[k+1] - cb1_Q8[k])) << 16 where cb1_Q8[-1] = 0 and cb1_Q8[d_LPC] = 256, and the division is integer division. This is reduced to an unsquared, Q9 value using the following square-root approximation: i = ilog(w2_Q18[k]) f = (w2_Q18[k]>>(i-8)) & 127 y = ((i&1) ? 32768 : 46214) >> ((32-i)>>1) w_Q9[k] = y + ((213*f*y)>>16) The constant 46214 here is approximately the square root of 2 in Q15. The cb1_Q8[] vector completely determines these weights, and they may be tabulated and stored as 13-bit unsigned values (with a range of 1819 to 5227, inclusive) to avoid computing them when decoding. The reference implementation already requires code to compute these weights on unquantized coefficients in the encoder, in silk_NLSF_VQ_weights_laroia() (NLSF_VQ_weights_laroia.c) and its callers, so it reuses that code in the decoder instead of using a pre-computed table to reduce the amount of ROM required.

Valin, et al. Standards Track [Page 58]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 59]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 60]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 61]

RFC 6716 Interactive Audio Codec September 2012KABAL86]. When using the reference encoder, roughly 2% of frames violate this constraint. The next section describes a stabilization procedure used to make these guarantees. 4.2.7.5.4 . Normalized LSF Stabilization

Valin, et al. Standards Track [Page 62]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 63]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 64]

RFC 6716 Interactive Audio Codec September 20124.2.7.5.5 . Normalized LSF InterpolationSection 4.5.2), the decoder still decodes this factor, but ignores its value and always uses 4 instead. For 10 ms SILK frames, this factor is not stored at all. +---------------------------+ | PDF | +---------------------------+ | {13, 22, 29, 11, 181}/256 | +---------------------------+ Table 26: PDF for Normalized LSF Interpolation Index Let n2_Q15[k] be the normalized LSF coefficients decoded by the procedure in Section 4.2.7.5, n0_Q15[k] be the LSF coefficients decoded for the prior frame, and w_Q2 be the interpolation factor. Then, the normalized LSF coefficients used for the first half of a 20 ms frame, n1_Q15[k], are n1_Q15[k] = n0_Q15[k] + (w_Q2*(n2_Q15[k] - n0_Q15[k]) >> 2) This interpolation is performed in silk_decode_parameters() (decode_parameters.c). 4.2.7.5.6 . Converting Normalized LSFs to LPC Coefficients

Valin, et al. Standards Track [Page 65]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 66]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 67]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 68]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 69]

RFC 6716 Interactive Audio Codec September 20124.2.7.5.7 . Limiting the Range of the LPC Coefficients

Valin, et al. Standards Track [Page 70]

RFC 6716 Interactive Audio Codec September 20124.2.7.5.8 . Limiting the Prediction Gain of the LPC Filter

Valin, et al. Standards Track [Page 71]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 72]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 73]

RFC 6716 Interactive Audio Codec September 2012Section 4.2.7.9.2 are a_Q12[k] = (a32_Q17[k] + 16) >> 5 Otherwise, a round of bandwidth expansion is applied using the same procedure as in Section 4.2.7.5.7, with sc_Q16[0] = 65536 - (2<<i) During round 15, sc_Q16[0] becomes 0 in the above equation, so a_Q12[k] is set to 0 for all k, guaranteeing a stable filter. 4.2.7.6 . Long-Term Prediction (LTP) ParametersSection 4.2.7.3) include additional LTP parameters. There is one primary lag index for each SILK frame, but this is refined to produce a separate lag index per subframe using a vector quantizer. Each subframe also gets its own prediction gain coefficient. 4.2.7.6.1 . Pitch Lags

Valin, et al. Standards Track [Page 74]

RFC 6716 Interactive Audio Codec September 2012Section 4.2.7.3). With absolute coding, the primary pitch lag may range from 2 ms (inclusive) up to 18 ms (exclusive), corresponding to pitches from 500 Hz down to 55.6 Hz, respectively. It is comprised of a high part and a low part, where the decoder first reads the high part using the 32-entry codebook in Table 29 and then the low part using the codebook corresponding to the current audio bandwidth from Table 30. The final primary pitch lag is then lag = lag_high*lag_scale + lag_low + lag_min where lag_high is the high part, lag_low is the low part, and lag_scale and lag_min are the values from the "Scale" and "Minimum Lag" columns of Table 30, respectively. +-------------------------------------------------------------------+ | PDF | +-------------------------------------------------------------------+ | {3, 3, 6, 11, 21, 30, 32, 19, 11, 10, 12, 13, 13, 12, 11, 9, 8, | | 7, 6, 4, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1}/256 | +-------------------------------------------------------------------+ Table 29: PDF for High Part of Primary Pitch Lag +------------+------------------------+-------+----------+----------+ | Audio | PDF | Scale | Minimum | Maximum | | Bandwidth | | | Lag | Lag | +------------+------------------------+-------+----------+----------+ | NB | {64, 64, 64, 64}/256 | 4 | 16 | 144 | | | | | | | | MB | {43, 42, 43, 43, 42, | 6 | 24 | 216 | | | 43}/256 | | | | | | | | | | | WB | {32, 32, 32, 32, 32, | 8 | 32 | 288 | | | 32, 32, 32}/256 | | | | +------------+------------------------+-------+----------+----------+ Table 30: PDF for Low Part of Primary Pitch Lag All frames that do not use absolute coding for the primary lag index use relative coding instead. The decoder reads a single delta value using the 21-entry PDF in Table 31. If the resulting value is zero, it falls back to the absolute coding procedure from the prior paragraph. Otherwise, the final primary pitch lag is then lag = previous_lag + (delta_lag_index - 9)

Valin, et al. Standards Track [Page 75]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 76]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 77]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 78]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 79]

RFC 6716 Interactive Audio Codec September 20124.2.7.6.2 . LTP Filter Coefficients

Valin, et al. Standards Track [Page 80]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 81]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 82]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 83]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 84]

RFC 6716 Interactive Audio Codec September 20124.2.7.6.3 . LTP Scaling ParameterSection 4.2.7.3), and o Either * This SILK frame corresponds to the first time interval of the current Opus frame for its type (LBRR or regular), or * This is an LBRR frame where the LBRR flags (see Section 4.2.4) indicate the previous LBRR frame in the same channel is not coded. This allows the encoder to trade off the prediction gain between packets against the recovery time after packet loss. Unlike absolute-coding for pitch lags, regular SILK frames that are not at the start of an Opus frame (i.e., that do not correspond to the first 20 ms time interval in Opus frames of 40 or 60 ms) do not include this field, even if the prior frame was not voiced, or (in the case of the side channel) not even coded. After an uncoded frame in the side channel, the LTP buffer (see Section 4.2.7.9.1) is cleared to zero, and is thus in a known state. In contrast, LBRR frames do include this field when the prior frame was not coded, since the LTP buffer contains the output of the PLC, which is non-normative. If present, the decoder reads a value using the 3-entry PDF in Table 42. The three possible values represent Q14 scale factors of 15565, 12288, and 8192, respectively (corresponding to approximately 0.95, 0.75, and 0.5). Frames that do not code the scaling parameter use the default factor of 15565 (approximately 0.95).

Valin, et al. Standards Track [Page 85]

RFC 6716 Interactive Audio Codec September 20124.2.7.7 . Linear Congruential Generator (LCG) SeedSection 4.2.7.8.6, SILK uses a Linear Congruential Generator (LCG) to inject pseudorandom noise into the quantized excitation. To ensure synchronization of this process between the encoder and decoder, each SILK frame stores a 2-bit seed after the LTP parameters (if any). The encoder may consider the choice of seed during quantization, and the flexibility of this choice lets it reduce distortion, helping to pay for the bit cost required to signal it. The decoder reads the seed using the uniform 4-entry PDF in Table 43, yielding a value between 0 and 3, inclusive. +----------------------+ | PDF | +----------------------+ | {64, 64, 64, 64}/256 | +----------------------+ Table 43: PDF for LCG Seed 4.2.7.8 . ExcitationPVQ]. The PVQ codebook is designed for Laplace-distributed values and consists of all sums of K signed, unit pulses in a vector of dimension N, where two pulses at the same position are required to have the same sign. Thus, the codebook includes all integer codevectors y of dimension N that satisfy N-1 __ \ abs(y[j]) = K /_ j=0 Unlike regular PVQ, SILK uses a variable-length, rather than fixed- length, encoding. This encoding is better suited to the more Gaussian-like distribution of the coefficient magnitudes and the non- uniform distribution of their signs (caused by the quantization offset described below). SILK also handles large codebooks by coding

Valin, et al. Standards Track [Page 86]

RFC 6716 Interactive Audio Codec September 20124.2.7.8.1 . Rate LevelSection 4.2.7.3). The rate level selects the PDF used to decode the number of pulses in the individual shell blocks. It does not directly convey any information about the bitrate or the number of pulses itself, but merely changes the probability of the symbols in Section 4.2.7.8.2. Level 0 provides a more efficient encoding at low rates generally, and level 8 provides a more efficient encoding at high rates generally, though the most efficient level for a

Valin, et al. Standards Track [Page 87]

RFC 6716 Interactive Audio Codec September 20124.2.7.8.2 . Pulses per Shell BlockSection 4.2.7.8.1. The special value 17 indicates that this block has one or more additional LSBs to decode for each coefficient. If the decoder encounters this value, it decodes another value for the actual pulse count of the block, but uses the PDF corresponding to the special rate level 9 instead of the normal rate level. This process repeats until the decoder reads a value less than 17, and it then sets the number of extra LSBs used to the number of 17's decoded for that block. If it reads the value 17 ten times, then the next iteration uses the special rate level 10 instead of 9. The probability of decoding a 17 when using the PDF for rate level 10 is zero, ensuring that the number of LSBs for a block will not exceed 10. The cumulative distribution for rate level 10 is just a shifted version of that for 9 and thus does not require any additional storage.

Valin, et al. Standards Track [Page 88]

RFC 6716 Interactive Audio Codec September 20124.2.7.8.3 . Pulse Location Decoding

Valin, et al. Standards Track [Page 89]

RFC 6716 Interactive Audio Codec September 2012Section 4.2.7.8.2 was zero, or where the split in the prior level indicated that all of the pulses fell on the other side. These partitions have nothing to code, so they require no PDF.

Valin, et al. Standards Track [Page 90]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 91]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 92]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 93]

RFC 6716 Interactive Audio Codec September 20124.2.7.8.4 . LSB Decoding

Valin, et al. Standards Track [Page 94]

RFC 6716 Interactive Audio Codec September 2012Section 4.2.7.8.2. The magnitude of the coefficient is initially equal to the number of pulses placed at that location in Section 4.2.7.8.3. As each LSB is decoded, the magnitude is doubled, and then the value of the LSB added to it, to obtain an updated magnitude. 4.2.7.8.5 . Sign DecodingSection 4.2.7.3) and the number of pulses in the block (from Section 4.2.7.8.2). The number of pulses in the block does not take into account any LSBs. Most PDFs are skewed towards negative signs because of the quantization offset, but the PDFs for zero pulses are highly skewed towards positive signs. If a block contains many positive coefficients, it is sometimes beneficial to code it solely using LSBs (i.e., with zero pulses), since the encoder may be able to save enough bits on the signs to justify the less efficient coefficient magnitude encoding. +-------------+-----------------------+-------------+---------------+ | Signal Type | Quantization Offset | Pulse Count | PDF | | | Type | | | +-------------+-----------------------+-------------+---------------+ | Inactive | Low | 0 | {2, 254}/256 | | | | | | | Inactive | Low | 1 | {207, 49}/256 | | | | | | | Inactive | Low | 2 | {189, 67}/256 |

Valin, et al. Standards Track [Page 95]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 96]

RFC 6716 Interactive Audio Codec September 2012Section 4.2.7.8.3) combined with any additional LSBs (see Section 4.2.7.8.4), and with the corresponding sign decoded in Section 4.2.7.8.5. Additionally, let seed be the current pseudorandom seed, which is initialized to the value decoded from Section 4.2.7.7 for the first sample in the current SILK frame, and updated for each subsequent sample according to the procedure below. Finally, let offset_Q23 be the quantization offset from Table 53. Then the following procedure produces the final reconstructed excitation value, e_Q23[i]: e_Q23[i] = (e_raw[i] << 8) - sign(e_raw[i])*20 + offset_Q23; seed = (196314165*seed + 907633515) & 0xFFFFFFFF; e_Q23[i] = (seed & 0x80000000) ? -e_Q23[i] : e_Q23[i]; seed = (seed + e_raw[i]) & 0xFFFFFFFF; When e_raw[i] is zero, sign() returns 0 by the definition in Section 1.1.4, so the factor of 20 does not get added. The final e_Q23[i] value may require more than 16 bits per sample, but it will not require more than 23, including the sign. 4.2.7.9 . SILK Frame Reconstruction

Valin, et al. Standards Track [Page 98]

RFC 6716 Interactive Audio Codec September 2012Section 4.2.7.5.5), is less than 4, then these correspond to the final LPC coefficients produced by Section 4.2.7.5.8 from the interpolated LSF coefficients, n1_Q15[k] (computed in Section 4.2.7.5.5). Otherwise, they correspond to the final LPC coefficients produced from the uninterpolated LSF coefficients for the current frame, n2_Q15[k]. Also, let n be the number of samples in a subframe (40 for NB, 60 for MB, and 80 for WB), s be the index of the current subframe in this SILK frame (0 or 1 for 10 ms frames, or 0 to 3 for 20 ms frames), and j be the index of the first sample in the residual corresponding to the current subframe. 4.2.7.9.1 . LTP SynthesisSection 4.2.7.3), the LPC residual for i such that j <= i < (j + n) is simply a normalized copy of the excitation signal, i.e., e_Q23[i] res[i] = --------- 2.0**23 Voiced SILK frames, on the other hand, pass the excitation through an LTP filter using the parameters decoded in Section 4.2.7.6 to produce an LPC residual. The LTP filter requires LPC residual values from before the current subframe as input. However, since the LPC coefficients may have changed, it obtains this residual by "rewhitening" the corresponding output signal using the LPC coefficients from the current subframe. Let out[i] for i such that (j - pitch_lags[s] - d_LPC - 2) <= i < j be the fully reconstructed output signal from the last (pitch_lags[s] + d_LPC + 2) samples of previous subframes (see Section 4.2.7.9.2), where pitch_lags[s] is the pitch lag for the current subframe from Section 4.2.7.6.1. Additionally, let lpc[i] for i such that (j - s*n - d_LPC) <= i < j be the fully reconstructed output signal from the last (s*n + d_LPC)

Valin, et al. Standards Track [Page 99]

RFC 6716 Interactive Audio Codec September 2012Section 4.2.7.9.2). During reconstruction of the first subframe for this channel after either o An uncoded regular SILK frame (if this is the side channel), or o A decoder reset (see Section 4.5.2), out[i] and lpc[i] are initially cleared to all zeros. If this is the third or fourth subframe of a 20 ms SILK frame and the LSF interpolation factor, w_Q2 (see Section 4.2.7.5.5), is less than 4, then let out_end be set to (j - (s-2)*n) and let LTP_scale_Q14 be set to 16384. Otherwise, set out_end to (j - s*n) and set LTP_scale_Q14 to the Q14 LTP scaling value from Section 4.2.7.6.3. Then, for i such that (j - pitch_lags[s] - 2) <= i < out_end, out[i] is rewhitened into an LPC residual, res[i], via 4.0*LTP_scale_Q14 res[i] = ----------------- * clamp(-1.0, gain_Q16[s] d_LPC-1 __ a_Q12[k] out[i] - \ out[i-k-1] * --------, 1.0) /_ 4096.0 k=0 This requires storage to buffer up to 306 values of out[i] from previous subframes. This corresponds to WB with a maximum pitch lag of 18 ms * 16 kHz samples, plus 16 samples for d_LPC, plus 2 samples for the width of the LTP filter. Then, for i such that out_end <= i < j, lpc[i] is rewhitened into an LPC residual, res[i], via d_LPC-1 65536.0 __ a_Q12[k] res[i] = ----------- * (lpc[i] - \ lpc[i-k-1] * --------) gain_Q16[s] /_ 4096.0 k=0 This requires storage to buffer up to 256 values of lpc[i] from previous subframes (240 from the current SILK frame and 16 from the previous SILK frame). This corresponds to WB with up to three previous subframes in the current SILK frame, plus 16 samples for d_LPC. The astute reader will notice that, given the definition of lpc[i] in Section 4.2.7.9.2, the output of this latter equation is merely a scaled version of the values of res[i] from previous subframes.

Valin, et al. Standards Track [Page 100]

RFC 6716 Interactive Audio Codec September 2012Section 4.2.7.6.2. Then for i such that j <= i < (j + n), the LPC residual is 4 e_Q23[i] __ b_Q7[k] res[i] = --------- + \ res[i - pitch_lags[s] + 2 - k] * ------- 2.0**23 /_ 128.0 k=0 4.2.7.9.2 . LPC SynthesisSection 4.5.2). Then, for i such that j <= i < (j + n), the result of LPC synthesis for the current subframe is d_LPC-1 gain_Q16[i] __ a_Q12[k] lpc[i] = ----------- * res[i] + \ lpc[i-k-1] * -------- 65536.0 /_ 4096.0 k=0 The decoder saves the final d_LPC values, i.e., lpc[i] such that (j + n - d_LPC) <= i < (j + n), to feed into the LPC synthesis of the next subframe. This requires storage for up to 16 values of lpc[i] (for WB frames). Then, the signal is clamped into the final nominal range: out[i] = clamp(-1.0, lpc[i], 1.0) This clamping occurs entirely after the LPC synthesis filter has run. The decoder saves the unclamped values, lpc[i], to feed into the LPC filter for the next subframe, but saves the clamped values, out[i], for rewhitening in voiced frames.

Valin, et al. Standards Track [Page 101]

RFC 6716 Interactive Audio Codec September 20124.2.8 . Stereo UnmixingSection 4.2.7.1. This simple low-pass filter imposes a one-sample delay, and the unfiltered mid channel is also delayed by one sample. In order to allow seamless switching between stereo and mono, mono streams must also impose the same one-sample delay. The encoder requires an additional one-sample delay for both mono and stereo streams, though an encoder may omit the delay for mono if it knows it will never switch to stereo. The unmixing process operates in two phases. The first phase lasts for 8 ms, during which it interpolates the prediction weights from the previous frame, prev_w0_Q13 and prev_w1_Q13, to the values for the current frame, w0_Q13 and w1_Q13. The second phase simply uses these weights for the remainder of the frame. Let mid[i] and side[i] be the contents of out[i] (from Section 4.2.7.9.2) for the current mid and side channels, respectively, and let left[i] and right[i] be the corresponding stereo output channels. If the side channel is not coded (see Section 4.2.7.2), then side[i] is set to zero. Also, let j be defined as in Section 4.2.7.9, n1 be the number of samples in phase 1 (64 for NB, 96 for MB, and 128 for WB), and n2 be the total number of samples in the frame. Then, for i such that j <= i < (j + n2), the left and right channel output is prev_w0_Q13 (w0_Q13 - prev_w0_Q13) w0 = ----------- + min(i - j, n1)*---------------------- 8192.0 8192.0*n1 prev_w1_Q13 (w1_Q13 - prev_w1_Q13) w1 = ----------- + min(i - j, n1)*---------------------- 8192.0 8192.0*n1 mid[i-2] + 2*mid[i-1] + mid[i] p0 = ------------------------------ 4.0 left[i] = clamp(-1.0, (1 + w1)*mid[i-1] + side[i-1] + w0*p0, 1.0) right[i] = clamp(-1.0, (1 - w1)*mid[i-1] - side[i-1] - w0*p0, 1.0)

Valin, et al. Standards Track [Page 102]

RFC 6716 Interactive Audio Codec September 20124.2.9 . ResamplingSection 6 is designed to be relatively insensitive to them. The delays listed here are the ones that should be targeted by the encoder.

Valin, et al. Standards Track [Page 103]

RFC 6716 Interactive Audio Codec September 2012Section 4.5, because they all involve a SILK decoder reset. When the decoder is reset, any samples remaining in the resampling buffer are discarded, and the resampler is re-initialized with silence. 4.3 . CELT DecoderMDCT] with partially overlapping windows of 5 to 22.5 ms. The main principle behind CELT is that the MDCT spectrum is divided into bands that (roughly) follow the Bark scale, i.e., the scale of the ear's critical bands [ZWICKER61]. The normal CELT layer uses 21 of those bands, though Opus Custom (see Section 6.2) may use a different number of bands. In Hybrid mode, the first 17 bands (up to 8 kHz) are not coded. A band can contain as little as one MDCT bin per channel, and as many as 176 bins per channel, as detailed in Table 55. In each band, the gain (energy) is coded separately from the shape of the spectrum. Coding the gain explicitly makes it easy to preserve the spectral envelope of the signal. The remaining unit- norm shape vector is encoded using a Pyramid Vector Quantizer (PVQ) Section 4.3.4. +--------+--------+------+-------+-------+-------------+------------+ | Frame | 2.5 ms | 5 ms | 10 ms | 20 ms | Start | Stop | | Size: | | | | | Frequency | Frequency | +--------+--------+------+-------+-------+-------------+------------+ | Band | Bins: | | | | | | | | | | | | | | | 0 | 1 | 2 | 4 | 8 | 0 Hz | 200 Hz | | | | | | | | | | 1 | 1 | 2 | 4 | 8 | 200 Hz | 400 Hz | | | | | | | | |

Valin, et al. Standards Track [Page 104]

RFC 6716 Interactive Audio Codec September 2012Section 4.3.4.5).

Valin, et al. Standards Track [Page 105]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 106]

RFC 6716 Interactive Audio Codec September 20124.3.1 . Transient Decoding4.3.2 . Energy Envelope Decoding4.3.2.1 . Coarse Energy DecodingZ-TRANSFORM] of the prediction filter is

Valin, et al. Standards Track [Page 108]

RFC 6716 Interactive Audio Codec September 20124.3.2.2 . Fine Energy QuantizationSection 4.3.3. Let B_i be the number of fine energy bits for band i; the refinement is an integer f in the range [0,2**B_i-1]. The mapping between f and the correction applied to the coarse energy is equal to (f+1/2)/2**B_i - 1/2. Fine energy quantization is implemented in quant_fine_energy() (quant_bands.c). When some bits are left "unused" after all other flags have been decoded, these bits are assigned to a "final" step of fine allocation. In effect, these bits are used to add one extra fine energy bit per band per channel. The allocation process determines two "priorities" for the final fine bits. Any remaining bits are first assigned only to bands of priority 0, starting from band 0 and going up. If all bands of priority 0 have received one bit per channel, then bands of priority 1 are assigned an extra bit per channel, starting from band 0. If any bits are left after this, they are left unused. This is implemented in unquant_energy_finalise() (quant_bands.c).

Valin, et al. Standards Track [Page 109]

RFC 6716 Interactive Audio Codec September 20124.3.3 . Bit AllocationVALIN2010]. Many codecs transmit significant amounts of side information to control the bit allocation within a frame. Often this control is only indirect, and it must be exercised carefully to achieve the desired rate constraints. The CELT layer, however, can adapt over a very wide range of rates, so it has a large number of codebook sizes to choose from for each band. Explicitly signaling the size of each of these codebooks would impose considerable overhead, even though the allocation is relatively static from frame to frame. This is because all of the information required to compute these codebook sizes must be derived from a single frame by itself, in order to retain robustness to packet loss, so the signaling cannot take advantage of knowledge of the allocation in neighboring frames. This problem is exacerbated in low-latency (small frame size) applications, which would include this overhead in every frame. For this reason, in the MDCT mode, Opus uses a primarily implicit bit allocation. The available bitstream capacity is known in advance to both the encoder and decoder without additional signaling, ultimately from the packet sizes expressed by a higher-level protocol. Using this information, the codec interpolates an allocation from a hard- coded table. While the band-energy structure effectively models intra-band masking, it ignores the weaker inter-band masking, band-temporal masking, and other less significant perceptual effects. While these effects can often be ignored, they can become significant for particular samples. One mechanism available to encoders would be to

Valin, et al. Standards Track [Page 110]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 111]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 112]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 113]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 114]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 115]

RFC 6716 Interactive Audio Codec September 20124.3.4 . Shape DecodingSection 4.3.3 is converted to a number of pulses as described by Section 4.3.4.1. Knowing the number of pulses and the number of samples in the band, the decoder calculates the size of the codebook as detailed in Section 4.3.4.2. The size is used to decode an unsigned integer (uniform probability model), which is the codeword index. This index is converted into the corresponding vector as explained in Section 4.3.4.2. This vector is then scaled to unit norm. 4.3.4.1 . Bits to PulsesSection 4.3.4.2. The difference between the number of bits allocated and the number of bits used is accumulated to a "balance" (initialized to zero) that helps adjust the allocation for the next bands. One third of the balance is applied to the bit allocation of each band to help achieve the target allocation. The only exceptions are the band before the last and the last band, for which half the balance and the whole balance are applied, respectively.

Valin, et al. Standards Track [Page 116]

RFC 6716 Interactive Audio Codec September 20124.3.4.2 . PVQ DecodingPVQ]. The indexing is based on the calculation of V(N,K) (denoted N(L,K) in [PVQ]). The number of combinations can be computed recursively as V(N,K) = V(N-1,K) + V(N,K-1) + V(N-1,K-1), with V(N,0) = 1 and V(0,K) = 0, K != 0. There are many different ways to compute V(N,K), including precomputed tables and direct use of the recursive formulation. The reference implementation applies the recursive formulation one line (or column) at a time to save on memory use, along with an alternate, univariate recurrence to initialize an arbitrary line, and direct polynomial solutions for small N. All of these methods are equivalent, and have different trade-offs in speed, memory usage, and code size. Implementations MAY use any methods they like, as long as they are equivalent to the mathematical definition. The decoded vector X is recovered as follows. Let i be the index decoded with the procedure in Section 4.1.5 with ft = V(N,K), so that 0 <= i < V(N,K). Let k = K. Then, for j = 0 to (N - 1), inclusive, do: 1. Let p = (V(N-j-1,k) + V(N-j,k))/2. 2. If i < p, then let sgn = 1, else let sgn = -1 and set i = i - p. 3. Let k0 = k and set p = p - V(N-j-1,k). 4. While p > i, set k = k - 1 and p = p - V(N-j-1,k). 5. Set X[j] = sgn*(k0 - k) and i = i - p. The decoded vector X is then normalized such that its L2-norm equals one. 4.3.4.3 . SpreadingSection 4.3.4.2 is then rotated for the purpose of avoiding tonal artifacts. The rotation gain is equal to g_r = N / (N + f_r*K)

Valin, et al. Standards Track [Page 117]

RFC 6716 Interactive Audio Codec September 20124.3.4.4 . Split Decoding

Valin, et al. Standards Track [Page 118]

RFC 6716 Interactive Audio Codec September 20124.3.4.5 . Time-Frequency Change

Valin, et al. Standards Track [Page 119]

RFC 6716 Interactive Audio Codec September 2012HADAMARD]. To increase the time resolution by N, N "levels" of the Hadamard transform are applied to the decoded vector for each interleaved MDCT vector. To increase the frequency resolution (assumes a transient frame), then N levels of the Hadamard transform are applied _across_ the interleaved MDCT vector. In the case of increased time resolution, the decoder uses the "sequency order" because the input vector is sorted in time. 4.3.5 . Anti-collapse Processing

Valin, et al. Standards Track [Page 120]

RFC 6716 Interactive Audio Codec September 20124.3.6 . Denormalization4.3.7 . Inverse MDCTPRINCEN86]. The IMDCT and windowing are performed by mdct_backward (mdct.c). 4.3.7.1 . Post-Filter

Valin, et al. Standards Track [Page 121]

RFC 6716 Interactive Audio Codec September 2012GOOGLE-NETEQ] of the Google WebRTC codebase [GOOGLE-WEBRTC] compensates for drift by adding or removing one period when the signal is highly periodic. The reference implementation of Opus allows a caller to learn whether the current frame's signal is highly periodic, and if so what the period is, using the OPUS_GET_PITCH() request. 4.5 . Configuration Switching

Valin, et al. Standards Track [Page 123]

RFC 6716 Interactive Audio Codec September 20124.5.1 . Transition Side Information (Redundancy)

Valin, et al. Standards Track [Page 124]

RFC 6716 Interactive Audio Codec September 20124.5.1.1 . Redundancy FlagSection 4.1.6.1) to check if there are at least 17 bits remaining. If so, then the frame contains redundancy. For Hybrid frames, this signaling is explicit. After decoding the SILK portion of the Opus frame, the decoder uses ec_tell() (see Section 4.1.6.1) to ensure there are at least 37 bits remaining. If so, it reads a symbol with the PDF in Table 64, and if the value is 1, then the frame contains redundancy. Otherwise (if there were fewer than 37 bits left or the value was 0), the frame does not contain redundancy. +----------------+ | PDF | +----------------+ | {4095, 1}/4096 | +----------------+ Table 64: Redundancy Flag PDF 4.5.1.2 . Redundancy Position Flag

Valin, et al. Standards Track [Page 125]

RFC 6716 Interactive Audio Codec September 20124.5.1.3 . Redundancy SizeSection 4.5.1.1. For Hybrid frames, the number of bytes is equal to 2, plus a decoded unsigned integer less than 256 (see Section 4.1.5). This may be more than the number of whole bytes remaining in the Opus frame, in which case the frame is invalid. However, a decoder is not required to ignore the entire frame, as this may be the result of a bit error that desynchronized the range coder. There may still be useful data before the error, and a decoder MAY keep any audio decoded so far instead of invoking the PLC, but it is RECOMMENDED that the decoder stop decoding and discard the rest of the current Opus frame. It would have been possible to avoid these invalid states in the design of Opus by limiting the range of the explicit length decoded from Hybrid frames by the actual number of whole bytes remaining. However, this would require an encoder to determine the rate allocation for the MDCT layer up front, before it began encoding that layer. By allowing some invalid sizes, the encoder is able to defer that decision until much later. When encoding Hybrid frames that do not include redundancy, the encoder must still decide up front if it wishes to use the minimum 37 bits required to trigger encoding of the redundancy flag, but this is a much looser restriction. After determining the size of the redundant CELT frame, the decoder reduces the size of the buffer currently in use by the range coder by that amount. The MDCT layer reads any raw bits from the end of this reduced buffer, and all calculations of the number of bits remaining in the buffer must be done using this new, reduced size, rather than the original size of the Opus frame.

Valin, et al. Standards Track [Page 126]

RFC 6716 Interactive Audio Codec September 20124.5.1.4 . Decoding the Redundancy4.5.2 . State Reset

Valin, et al. Standards Track [Page 127]

RFC 6716 Interactive Audio Codec September 20124.5.3 . Summary of Transitions

Valin, et al. Standards Track [Page 128]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 129]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 130]

RFC 6716 Interactive Audio Codec September 20125 . Opus Encoder

Valin, et al. Standards Track [Page 131]

RFC 6716 Interactive Audio Codec September 20125.1 . Range Encoder

Valin, et al. Standards Track [Page 132]

RFC 6716 Interactive Audio Codec September 20125.1.1 . Encoding SymbolsSection 4.1). ec_encode() updates the state of the encoder as follows. If fl[k] is greater than zero, then rng val = val + rng - --- * (ft - fl) ft rng rng = --- * (fh - fl) ft Otherwise, val is unchanged and rng rng = rng - --- * (fh - fl) ft The divisions here are integer division. 5.1.1.1 . RenormalizationSection 4.1.2.1, implemented by ec_enc_normalize() (entenc.c). The following process is repeated until rng > 2**23. First, the top 9 bits of val, (val>>23), are sent to the carry buffer, described in Section 5.1.1.2. Then, the encoder sets val = (val<<8) & 0x7FFFFFFF rng = rng<<8 5.1.1.2 . Carry Propagation and Output Buffering

Valin, et al. Standards Track [Page 133]

RFC 6716 Interactive Audio Codec September 20125.1.3 . Encoding Raw BitsSection 5.1.5 does this in a way that ensures both the range coded data and the raw bits can be decoded successfully. 5.1.4 . Encoding Uniformly Distributed IntegersSection 4.1.5), it splits up the value into a range coded symbol representing up to 8 of the high bits, and, if necessary, raw bits representing the remainder of the value. ec_enc_uint() takes a two-tuple (t, ft), where t is the unsigned integer to be encoded, 0 <= t < ft, and ft is not necessarily a power of two. Let ftb = ilog(ft - 1), i.e., the number of bits required to store (ft - 1) in two's complement notation. If ftb is 8 or less, then t is encoded directly using ec_encode() with the three-tuple (t, t + 1, ft). If ftb is greater than 8, then the top 8 bits of t are encoded using the three-tuple (t>>(ftb - 8), (t>>(ftb - 8)) + 1, ((ft - 1)>>(ftb - 8)) + 1), and the remaining bits, (t & ((1<<(ftb - 8)) - 1), are encoded as raw bits with ec_enc_bits(). 5.1.5 . Finalizing the Stream

Valin, et al. Standards Track [Page 135]

RFC 6716 Interactive Audio Codec September 2012Section 5.1.1.2, and end is updated via end = (end<<8) & 0x7FFFFFFF Finally, if the buffered output byte, rem, is neither zero nor the special value -1, or the carry count, ext, is greater than zero, then 9 zero bits are sent to the carry buffer to flush it to the output buffer. When outputting the final byte from the range coder, if it would overlap any raw bits already packed into the end of the output buffer, they should be ORed into the same byte. The bit allocation routines in the CELT layer should ensure that this can be done without corrupting the range coder data so long as end is chosen as described above. If there is any space between the end of the range coder data and the end of the raw bits, it is padded with zero bits. This entire process is implemented by ec_enc_done() (entenc.c). 5.1.6 . Current Bit Usage5.2 . SILK EncoderSection 4.2. Details such as the quantization and range coder tables can be found there, while this section describes the high- level design choices that were made. The diagram below shows the basic modules of the SILK encoder. +----------+ +--------+ +---------+ | Sample | | Stereo | | SILK | ------>| Rate |--->| Mixing |--->| Core |----------> Input |Conversion| | | | Encoder | Bitstream +----------+ +--------+ +---------+ Figure 21: SILK Encoder

Valin, et al. Standards Track [Page 136]

RFC 6716 Interactive Audio Codec September 20125.2.1 . Sample Rate Conversion5.2.2 . Stereo Mixing

Valin, et al. Standards Track [Page 137]

RFC 6716 Interactive Audio Codec September 20125.2.3 . SILK Core Encoder

Valin, et al. Standards Track [Page 138]

RFC 6716 Interactive Audio Codec September 20125.2.3.1 . Voice Activity Detection

Valin, et al. Standards Track [Page 139]

RFC 6716 Interactive Audio Codec September 20125.2.3.2 . Pitch Analysis

Valin, et al. Standards Track [Page 140]

RFC 6716 Interactive Audio Codec September 20125.2.3.3 . Noise Shaping Analysis

Valin, et al. Standards Track [Page 141]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 142]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 143]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 144]

RFC 6716 Interactive Audio Codec September 20125.2.3.4 . Prediction AnalysisSection 5.2.3.4.1 and Section 5.2.3.4.2, respectively. Inputs to this function include the pre-whitened signal from the pitch estimator (see Section 5.2.3.2). 5.2.3.4.1 . Voiced SpeechSection 5.2.3.6, and the quantized LTP coefficients are used to compute the LTP residual signal. This LTP residual signal is the input to an LPC analysis where the LPC coefficients are estimated using Burg's method [BURG], such that the residual energy is minimized. The estimated LPC coefficients are converted to a Line Spectral Frequency (LSF) vector and quantized as described in Section 5.2.3.5. After quantization, the quantized LSF vector is converted back to LPC coefficients using the full procedure in Section 4.2.7.5. By using quantized LTP coefficients and LPC

Valin, et al. Standards Track [Page 145]

RFC 6716 Interactive Audio Codec September 20125.2.3.4.2 . Unvoiced Speech5.2.3.4.2.1 . Burg's MethodSCHUR], but with a simple update to the autocorrelations after finding each reflection coefficient to make the result identical to Burg's method. This brings down the complexity of Burg's method to near that of the autocorrelation method. The second difference is that the signal in each subframe is scaled by the inverse of the residual quantization step size. Subframes with a small quantization step size will, on average, spend more bits for a given amount of residual energy than subframes with a large step size. Without scaling, Burg's method minimizes the total residual energy in all subframes, which doesn't necessarily minimize the total number of bits needed for coding the quantized residual. The residual energy of the scaled subframes is a better measure for that number of bits.

Valin, et al. Standards Track [Page 146]

RFC 6716 Interactive Audio Codec September 20125.2.3.5 . LSF QuantizationLAROIA-ICASSP]). These weights are referred to here as Laroia weights. The LSF quantizer consists of two stages. The first stage is an (unweighted) vector quantizer (VQ), with a codebook size of 32 vectors. The quantization errors for the codebook vector are sorted, and for the N best vectors a second stage quantizer is run. By varying the number N, a trade-off is made between R-D performance and computational efficiency. For each of the N codebook vectors, the Laroia weights corresponding to that vector (and not to the input vector) are calculated. Then, the residual between the input LSF vector and the codebook vector is scaled by the square roots of these Laroia weights. This scaling partially normalizes error sensitivity for the residual vector so that a uniform quantizer with fixed step sizes can be used in the second stage without too much performance loss. Additionally, by scaling with Laroia weights determined from the first-stage codebook vector, the process can be reversed in the decoder. The second stage uses predictive delayed decision scalar quantization. The quantization error is weighted by Laroia weights determined from the LSF input vector. The predictor multiplies the previous quantized residual value by a prediction coefficient that depends on the vector index from the first stage VQ and on the location in the LSF vector. The prediction is subtracted from the LSF residual value before quantizing the result and is added back afterwards. This subtraction can be interpreted as shifting the quantization levels of the scalar quantizer, and as a result the quantization error of each value depends on the quantization decision of the previous value. This dependency is exploited by the delayed decision mechanism to search for a quantization sequency with best R-D performance with a Viterbi-like algorithm [VITERBI]. The quantizer processes the residual LSF vector in reverse order (i.e., it starts with the highest residual LSF value). This is done because the prediction works slightly better in the reverse direction.

Valin, et al. Standards Track [Page 147]

RFC 6716 Interactive Audio Codec September 20125.2.3.5.1 . LSF StabilizationSection 4.2.7.5.4) to ensure the LSF parameters are within their valid range, increasingly sorted, and have minimum distances between each other and the border values. 5.2.3.6 . LTP QuantizationSection 5.2.3.4.1 resulted in four sets (one set per subframe) of five LTP coefficients, plus four weighting matrices. The LTP coefficients for each subframe are quantized using entropy constrained vector quantization. A total of three vector codebooks are available for quantization, with different rate-distortion trade- offs. The three codebooks have 10, 20, and 40 vectors and average rates of about 3, 4, and 5 bits per vector, respectively. Consequently, the first codebook has larger average quantization distortion at a lower rate, whereas the last codebook has smaller average quantization distortion at a higher rate. Given the weighting matrix W_ltp and LTP vector b, the weighted rate-distortion measure for a codebook vector cb_i with rate r_i is give by RD = u * (b - cb_i)' * W_ltp * (b - cb_i) + r_i where u is a fixed, heuristically determined parameter balancing the distortion and rate. Which codebook gives the best performance for a given LTP vector depends on the weighting matrix for that LTP vector. For example, for a low valued W_ltp, it is advantageous to use the codebook with 10 vectors as it has a lower average rate. For a large W_ltp, on the other hand, it is often better to use the codebook with 40 vectors, as it is more likely to contain the best codebook vector. The weighting matrix W_ltp depends mostly on two aspects of the input signal. The first is the periodicity of the signal; the more periodic, the larger W_ltp. The second is the change in signal energy in the current subframe, relative to the signal one pitch lag earlier. A decaying energy leads to a larger W_ltp than an increasing energy. Both aspects fluctuate relatively slowly, which causes the W_ltp matrices for different subframes of one frame often

Valin, et al. Standards Track [Page 148]

RFC 6716 Interactive Audio Codec September 20125.2.3.7 . Pre-filterSection 5.2.3.3). By applying only the noise shaping analysis filter to the input signal, it provides the input to the noise shaping quantizer. 5.2.3.8 . Noise Shaping Quantizer

Valin, et al. Standards Track [Page 149]

RFC 6716 Interactive Audio Codec September 20125.3.2 . Bands and Normalization5.3.3 . Energy Envelope Quantization5.3.4 . Bit AllocationSection 4.3.3. The three mechanisms that can be used by the encoder to adjust the bitrate on a frame-by- frame basis are band boost, allocation trim, and band skipping. 5.3.4.1 . Band Boost

Valin, et al. Standards Track [Page 151]

RFC 6716 Interactive Audio Codec September 20125.3.6 . Time-Frequency DecisionSection 4.3.4.5 is based on R-D optimization. The distortion is the L1-norm (sum of absolute values) of each band after each TF resolution under consideration. The L1 norm is used because it represents the entropy for a Laplacian source. The number of bits required to code a change in TF resolution between two bands is higher than the cost of having those two bands use the same resolution, which is what requires the R-D optimization. The optimal decision is computed using the Viterbi algorithm. See tf_analysis() in celt/celt.c. 5.3.7 . Spreading Values Decision

Valin, et al. Standards Track [Page 153]

RFC 6716 Interactive Audio Codec September 20125.3.8 . Spherical Vector QuantizationPVQ] for quantizing the details of the spectrum in each band that have not been predicted by the pitch predictor. The PVQ codebook consists of all sums of K signed pulses in a vector of N samples, where two pulses at the same position are required to have the same sign. Thus, the codebook includes all integer codevectors y of N dimensions that satisfy sum(abs(y(j))) = K. In bands where there are sufficient bits allocated, PVQ is used to encode the unit vector that results from the normalization in Section 5.3.2 directly. Given a PVQ codevector y, the unit vector X is obtained as X = y/||y||, where ||.|| denotes the L2 norm. 5.3.8.1 . PVQ Search

Valin, et al. Standards Track [Page 154]

RFC 6716 Interactive Audio Codec September 20125.3.8.2 . PVQ EncodingSection 5.1.4 with ft = V(N,K). 6 . ConformanceAppendix A. Although this document includes a prose description of the codec, should the description contradict the source code of the reference implementation, the latter shall take precedence. Compliance with this specification means that, in addition to following the normative keywords in this document, a decoder's output MUST also be within the thresholds specified by the opus_compare.c tool (included with the code) when compared to the reference implementation for each of the test vectors provided (see Appendix A.4) and for each output sampling rate and channel count supported. In addition, a compliant decoder implementation MUST have the same final range decoder state as that of the reference decoder. It is therefore RECOMMENDED that the decoder implement the same functional behavior as the reference. A decoder implementation is not required to support all output sampling rates or all output channel counts. 6.1 . TestingAppendix A, a test vector can be decoded with opus_demo -d <rate> <channels> testvectorX.bit testX.out where <rate> is the sampling rate and can be 8000, 12000, 16000, 24000, or 48000, and <channels> is 1 for mono or 2 for stereo.

Valin, et al. Standards Track [Page 155]

RFC 6716 Interactive Audio Codec September 20126.2 . Opus Custom

Valin, et al. Standards Track [Page 156]

RFC 6716 Interactive Audio Codec September 20127 . Security ConsiderationsDOS]. It is extremely important for the decoder to be robust against malicious payloads. Malicious payloads must not cause the decoder to overrun its allocated memory or to take an excessive amount of resources to decode. Although problems in encoders are typically rarer, the same applies to the encoder. Malicious audio streams must not cause the encoder to misbehave because this would allow an attacker to attack transcoding gateways. The reference implementation contains no known buffer overflow or cases where a specially crafted packet or audio segment could cause a significant increase in CPU load. However, on certain CPU architectures where denormalized floating-point operations are much slower than normal floating-point operations, it is possible for some audio content (e.g., silence or near silence) to cause an increase in CPU load. Denormals can be introduced by reordering operations in the compiler and depend on the target architecture, so it is difficult to guarantee that an implementation avoids them. For architectures on which denormals are problematic, adding very small floating-point offsets to the affected signals to prevent significant numbers of denormalized operations is RECOMMENDED. Alternatively, it is often possible to configure the hardware to treat denormals as zero (DAZ). No such issue exists for the fixed-point reference implementation. The reference implementation was validated in the following conditions: 1. Sending the decoder valid packets generated by the reference encoder and verifying that the decoder's final range coder state matches that of the encoder.

Valin, et al. Standards Track [Page 157]

RFC 6716 Interactive Audio Codec September 2012VALGRIND] memory debugger, which tracks reads and writes to invalid memory regions as well as the use of uninitialized memory. There were no errors reported on any of the tested conditions. 8 . Acknowledgements

Valin, et al. Standards Track [Page 158]

RFC 6716 Interactive Audio Codec September 2012LAROIA-ICASSP] Laroia, R., Phamdo, N., and N. Farvardin, "Robust and Efficient Quantization of Speech LSP Parameters Using Structured Vector Quantization", ICASSP-1991, Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 641- 644, October 1991. [LPC] Wikipedia, "Linear Prediction", <http://en.wikipedia.org/ w/index.php?title=Linear_prediction&oldid=497201278>. [MARTIN79] Martin, G., "Range encoding: An algorithm for removing redundancy from a digitised message", Proc. Institution of Electronic and Radio Engineers International Conference on Video and Data Recording, 1979. [MATROSKA-WEBSITE] "Matroska website", <http://matroska.org/>. [MDCT] Wikipedia, "Modified Discrete Cosine Transform", <http:// en.wikipedia.org/w/ index.php?title=Modified_discrete_cosine_ transform&oldid=490295438>. [OPUS-GIT] "Opus Git Repository", <https://git.xiph.org/opus.git>. [OPUS-WEBSITE] "Opus website", <http://opus-codec.org/>. [PRINCEN86] Princen, J. and A. Bradley, "Analysis/Synthesis Filter Bank Design Based on Time Domain Aliasing Cancellation", IEEE Trans. Acoustics, Speech, and Siginal Processing, ASSP-34 (5), pp. 1153-1161, October, 1986. [PVQ] Fischer, T., "A Pyramid Vector Quantizer", IEEE Trans. on Information Theory, Vol. 32, pp. 568-583, July 1986. [RANGE-CODING] Wikipedia, "Range Coding", <http://en.wikipedia.org/w/ index.php?title=Range_encoding&oldid=509582757>. [REQUIREMENTS] Valin, JM. and K. Vos, "Requirements for an Internet Audio Codec", RFC 6366, August 2011.

Valin, et al. Standards Track [Page 160]

RFC 6716 Interactive Audio Codec September 2012WHITENING] Wikipedia, "White Noise", <http://en.wikipedia.org/w/ index.php?title=White_noise&oldid=497791998>. [Z-TRANSFORM] Wikipedia, "Z-transform", <http://en.wikipedia.org/w/ index.php?title=Z-transform&oldid=508392884>. [ZWICKER61] Zwicker, E., "Subdivision of the Audible Frequency Range into Critical Bands", The Journal of the Acoustical Society of America, Vol. 33, No 2 pp. 248, February 1961.

Valin, et al. Standards Track [Page 162]

RFC 6716 Interactive Audio Codec September 2012Appendix A . Reference ImplementationFFT] used is a slightly modified version of the KISS-FFT library, but it is easy to substitute any other FFT library. While the reference implementation does not rely on any _undefined behavior_ as defined by C89 or C99, it relies on common _implementation-defined behavior_ for two's complement architectures: o Right shifts of negative values are consistent with two's complement arithmetic, so that a>>b is equivalent to floor(a/(2**b)), o For conversion to a signed integer of N bits, the value is reduced modulo 2**N to be within range of the type, o The result of integer division of a negative value is truncated towards zero, and o The compiler provides a 64-bit integer type (a C99 requirement which is supported by most C89 compilers). In its current form, the reference implementation also requires the following architectural characteristics to obtain acceptable performance: o Two's complement arithmetic, o At least a 16 bit by 16 bit integer multiplier (32-bit result), and o At least a 32-bit adder/accumulator.

Valin, et al. Standards Track [Page 163]

RFC 6716 Interactive Audio Codec September 2012A.1 . Extracting the Sourcerfc6716.txt | grep '^\ \ \ ###' | sed -e 's/...###//' | base64 --decode > opus-rfc6716.tar.gz o tar xzvf opus-rfc6716.tar.gz o cd opus-rfc6716 o make On systems where the provided Makefile does not work, the following command line may be used to compile the source code: o cc -O2 -g -o opus_demo src/opus_demo.c `cat *.mk | grep -v fixed | sed -e 's/.*=//' -e 's/\\\\//'` -DOPUS_BUILD -Iinclude -Icelt -Isilk -Isilk/float -DUSE_ALLOCA -Drestrict= -lm On systems where the base64 utility is not present, the following commands can be used instead: o cat rfc6716.txt | grep '^\ \ \ ###' | sed -e 's/...###//' > opus.b64 o openssl base64 -d -in opus.b64 > opus-rfc6716.tar.gz The SHA1 hash of the opus-rfc6716.tar.gz file is 86a927223e73d2476646a1b933fcd3fffb6ecc8c. A.2 . Up-to-Date ImplementationOPUS-GIT]. Releases and other resources are available at [OPUS-WEBSITE]. However, although that implementation is expected to remain conformant with the RFC, it is the code in this document that shall remain normative. A.3 . Base64-Encoded Source Code

Valin, et al. Standards Track [Page 164]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 165]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 166]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 167]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 168]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 169]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 170]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 171]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 172]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 173]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 174]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 175]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 176]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 177]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 178]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 179]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 180]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 181]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 182]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 183]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 184]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 185]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 186]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 187]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 188]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 189]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 190]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 191]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 192]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 193]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 194]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 195]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 196]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 197]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 198]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 199]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 200]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 201]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 202]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 203]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 204]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 205]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 206]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 207]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 208]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 209]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 210]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 211]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 212]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 213]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 214]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 215]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 216]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 217]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 218]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 219]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 220]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 221]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 222]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 223]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 224]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 225]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 226]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 227]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 228]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 229]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 230]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 231]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 232]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 233]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 234]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 235]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 236]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 237]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 238]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 239]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 240]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 241]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 242]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 243]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 244]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 245]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 246]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 247]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 248]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 249]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 250]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 251]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 252]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 253]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 254]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 255]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 256]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 257]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 258]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 259]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 260]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 261]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 262]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 263]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 264]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 265]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 266]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 267]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 268]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 269]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 270]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 271]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 272]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 273]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 274]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 275]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 276]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 277]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 278]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 279]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 280]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 281]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 282]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 283]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 284]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 285]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 286]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 287]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 288]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 289]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 290]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 291]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 292]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 293]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 294]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 295]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 296]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 297]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 298]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 299]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 300]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 301]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 302]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 303]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 304]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 305]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 306]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 307]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 308]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 309]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 310]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 311]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 312]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 313]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 314]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 315]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 316]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 317]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 318]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 319]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 320]

RFC 6716 Interactive Audio Codec September 2012A.4 . Test VectorsVECTORS-PROC] and from the Opus codec website at [VECTORS-WEBSITE]. These test vectors were created specifically to exercise all aspects of the decoder. Therefore, the audio quality of the decoded output is significantly lower than what Opus can achieve in normal operation. The SHA1 hash of the files in the test vector package are e49b2862ceec7324790ed8019eb9744596d5be01 testvector01.bit b809795ae1bcd606049d76de4ad24236257135e0 testvector02.bit e0c4ecaeab44d35a2f5b6575cd996848e5ee2acc testvector03.bit a0f870cbe14ebb71fa9066ef3ee96e59c9a75187 testvector04.bit 9b3d92b48b965dfe9edf7b8a85edd4309f8cf7c8 testvector05.bit 28e66769ab17e17f72875283c14b19690cbc4e57 testvector06.bit bacf467be3215fc7ec288f29e2477de1192947a6 testvector07.bit ddbe08b688bbf934071f3893cd0030ce48dba12f testvector08.bit 3932d9d61944dab1201645b8eeaad595d5705ecb testvector09.bit 521eb2a1e0cc9c31b8b740673307c2d3b10c1900 testvector10.bit 6bc8f3146fcb96450c901b16c3d464ccdf4d5d96 testvector11.bit 338c3f1b4b97226bc60bc41038becbc6de06b28f testvector12.bit a20a2122d42de644f94445e20185358559623a1f testvector01.dec 48ac1ff1995250a756e1e17bd32acefa8cd2b820 testvector02.dec d15567e919db2d0e818727092c0af8dd9df23c95 testvector03.dec 1249dd28f5bd1e39a66fd6d99449dca7a8316342 testvector04.dec 93eee37e5d26a456d2c24483060132ff7eae2143 testvector05.dec a294fc17e3157768c46c5ec0f2116de0d2c37ee2 testvector06.dec 2bf550e2f072e0941438db3f338fe99444385848 testvector07.dec 2695c1f2d1f9748ea0bf07249c70fd7b87f61680 testvector08.dec 12862add5d53a9d2a7079340a542a2f039b992bb testvector09.dec a081252bb2b1a902fdc500530891f47e2a373d84 testvector10.dec dfd0f844f2a42df506934fac2100a3c03beec711 testvector11.dec 8c16b2a1fb60e3550ba165068f9d7341357fdb63 testvector12.dec Appendix B . Self-Delimiting FramingSection 3, the decoder must know the total length of the Opus packet, in bytes. This section describes a simple variation of that framing that can be used when the total length of the packet is not known. Nothing in the encoding of the packet itself allows a decoder to distinguish between the regular, undelimited framing and the self-delimiting framing described in this appendix. Which one is used and where must be

Valin, et al. Standards Track [Page 321]

RFC 6716 Interactive Audio Codec September 2012Section 3, except that each Opus packet contains one extra length field, encoded using the same one- or two-byte scheme from Section 3.2.1. This extra length immediately precedes the compressed data of the first Opus frame in the packet, and is interpreted in the various modes as follows: o Code 0 packets: It is the length of the single Opus frame (see Figure 25). o Code 1 packets: It is the length used for both of the Opus frames (see Figure 26). o Code 2 packets: It is the length of the second Opus frame (see Figure 27). o CBR Code 3 packets: It is the length used for all of the Opus frames (see Figure 28). o VBR Code 3 packets: It is the length of the last Opus frame (see Figure 29).

Valin, et al. Standards Track [Page 322]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 323]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 324]

RFC 6716 Interactive Audio Codec September 2012

Valin, et al. Standards Track [Page 325]

RFC 6716 Interactive Audio Codec September 2012