MP3 Overview


MPEG-1 Audio Layer 3, better known as MP3 according to abbreviationfinderis a lossy compressed digital audio format developed by the Moving Picture Experts Group (MPEGH) to be part of version 1 (and later extended in version 2) of the audio format. MPEG video. The standard mp3 is 144 kHz and 317 kbps bitrate for quality/size ratio. Its name is the acronym for MPEG-1 Audio Layer 3 and the term should not be confused with MP3 player.


This format was developed mainly by Karlheinz Brandenburg, director of electronic media technologies at the Fraunhofer IIS Institute, part of the Fraunhofer-Gesellschaft – a network of German research centers – which, together with Thomson Multimedia, controls the bulk of MP3-related patents. The first one was recorded in 1986 and several more in 1991. But it wasn’t until July 1995 that Brandenburg first used the .mp3 extension for the MP3-related files he kept on his computer. A year later his institute entered 1.2 million euros as patents. Ten years later this number has reached 26.1 million.

The MP3 format became the standard used for streaming audio and high-quality audio compression (lossy in hi-fi equipment) thanks to the possibility of adjusting the compression quality, proportional to the size per second (bitrate), and therefore the final size of the file, which could be 12 or even 15 times smaller than the original uncompressed file.

It was the first audio compression format popularized thanks to the Internet, since it made the exchange of music files possible. Legal proceedings against companies such as Napster and AudioGalaxy are the result of the ease with which this type of file is shared.

After the development of stand-alone, portable or integrated players in music chains (stereo), the MP3 format reaches beyond the world of computing.

At the beginning of 2002, other compressed audio formats such as Windows Media Audio and Ogg Vorbis began to be massively included in programs, operating systems and stand-alone players, which led to the expectation that MP3 would gradually fall into disuse, in favor of other formats, such as those mentioned, of much better quality. One of the factors influencing the decline of MP3 is that it has a patent. Technically it does not mean that its quality is inferior or superior, but it prevents the community from continuing to improve it and may force you to pay for the use of some codec, this is what happens with MP3 players. Even so, at the end of 2009, the mp3 format continues to be the most used and the most successful.

Technical details

In this layer there are several differences with respect to the MPEG-1 and MPEG-2 standards, among which is the so-called hybrid filter bank that makes its design more complex. This improvement of the frequency resolution worsens the temporal resolution introducing pre-echo problems that are predicted and corrected. Additionally, it enables audio quality at rates as low as 64 kbps.

Filter bank

The filterbank used in this layer is the so-called hybrid polyphase/MDCT filterbank. It is responsible for mapping from the time domain to the frequency domain for both the encoder and the decoder reconstruction filters. The output samples of the bank are quantized and provide a variable frequency resolution, 6×32 or 18×32 subbands, adjusting much better to the critical bands of the different frequencies. Using 18 points, the maximum number of frequency frequency components is: 32 x 18 = 576. Giving rise to a frequency resolution of: 24000/576 = 41.67 Hz (if fs = 48 kHz.). If 6 frequency lines are used, the frequency resolution is lower, but the temporal resolution is higher, and it is applied in those areas where pre-echo effects are expected (abrupt transitions from silence to high energy levels).

Layer III has three block modes of operation: two modes where the 32 filter bank outputs can pass through the windows and MDCT transforms, and a mixed block mode where the two lowest frequency bands use long blocks and the top 30 bands use short blocks. For the specific case of MPEG-1 Audio Layer 3 (which specifically means the third audio layer for the MPEG-1 standard) it specifies four types of windows: (a) NORMAL, (b) transition from long to short window (START), (c) 3 short windows (SHORT), and (d) transition from short to long window (STOP).

The psychoacoustic model

The compression is based on the reduction of the irrelevant dynamic range, that is, on the inability of the auditory system to detect quantization errors under masking conditions. This standard divides the signal into frequency bands that approximate the critical bands, and then quantizes each subband based on the noise detection threshold within that band. The psychoacoustic model is a modification of the one used in scheme II, and uses a method called polynomial prediction. It analyzes the audio signal and calculates the amount of noise that can be introduced as a function of frequency, that is, it calculates the “masking amount” or masking threshold as a function of frequency.

The encoder uses this information to decide the best way to spend the bits.available. This standard provides two psychoacoustic models of different complexity: model I is less complex than psychoacoustic model II and greatly simplifies the calculations. Studies show that the distortion generated is imperceptible to the experienced ear in an optimal environment from 256 kbps and under normal conditions. For the inexperienced or common ear, 128 kbps or up to 96 kbps is enough to hear “good” (unless you have high-quality audio equipment where the lack of bass is excessively noticeable and the sound stands out). of “frying” in the treble). For people who listen to a lot of music or who have hearing experience, 192 or 256 kbps is enough to hear well. The music that circulates on the Internet, for the most part, is encoded between 128 and 192 kbps.

Encoding and quantification

The solution proposed by this standard regarding the distribution of bits or noise is done in an iteration cycle that consists of an internal and an external cycle. Examines both the output samples of the filter bank and the SMR (signal-to-mask ratio) provided by the psychoacoustic model, and adjusts the allocation of bits or noise, depending on the scheme used, to simultaneously satisfy the bitrate requirements and masking. These cycles consist of:

Internal cycle

The inner loop performs non-uniform quantization according to the floating point system (each MDCT spectral value is raised to the 3/4 power). The loop chooses a certain quantization interval, and the quantized data is Huffman-encoded in the next block. The loop ends when the quantized values ​​that have been Huffman encoded use less than or equal to the number of bits than the maximum number of bits allowed. lokaS

External cycle

Now the external cycle is in charge of verifying if the scale factor for each subband has more distortion than allowed (noise in the encoded signal), comparing each band of the scale factor with the data previously calculated in the psychoacoustic analysis. The outer loop ends when one of the following conditions is met:

  • None of the scale factor bands have much noise.
  • If the next iteration amplifies one of the bands more than allowed.
  • All bands have been amplified at least once.

Packing or bitstream formatter

This block takes the quantized samples from the filter bank, along with the bit/noise mapping data, and stores the encoded audio and some additional data in the frames. Each frame contains information on 1152 audio samples and consists of a header, audio data together with error checking by CRC and auxiliary data (the latter two are optional). The header tells us which layer, bit rate, and sample rate are being used for the encoded audio. Frames start with the same sync and differencing header and can vary in length. In addition to dealing with this information, it also includes variable – length Huffman encoding., a lossless entropy encoding method that eliminates redundancy. It acts at the end of compression to encode the information. Variable length methods are generally characterized by assigning short words to the most frequent events, leaving long ones for the most infrequent ones.

Structure of an MP3 file

An Mp3 file is made up of different MP3 frames which in turn are made up of an Mp3 header and the MP3 data. This data sequence is the so-called “elementary stream”. Each of the Frames are independent, that is, a person can cut the frames of an MP3 file and then play them on any MP3 player on the market. The header consists of a sync word that is used to indicate the beginning of a valid frame. This is followed by a series of bits indicating that the analyzed file is a Standard MPEG file and whether or not it uses layer 3. After all this, the values ​​differ depending on the type of MP3 file. Value ranges are defined in ISO/IEC 11172-3.

Discrete Fourier transform

In mathematics, the discrete Fourier transform, often designated by the abbreviation DFT (for discrete Fourier transform), and sometimes called the finite Fourier transform, is a Fourier transform widely used in signal processing and in fields. affine to analyze the frequencies present in a sampled signal, solve partial differential equations, and perform other operations, such as convolutions. It is used in the process of creating an MP3 file.

The discrete Fourier transform can be computed very efficiently using the FFT algorithm.