Tuesday, February 25, 2025

Simulating Polar Modulation

Note - this page is work-in-progress so expect changes & additions.

We now have some mathematical and practical insight into polar modulation. We also now have some hints regarding it's bandwidth requirements: under certain circumstances both the amplitude and frequency outputs have a high harmonic content. But a practical circuit implementation will impose some bandwidth constraints. For instance, if we use the si5351 for frequency generation, it's output frequency can only be updated at about 12ksps - imposing an upper bandwith on the frequency output of about 6kHz. We therefore want to understand the consequences of these bandwidth restrictions.

 My maths skills are insufficient for me to carry out a theoretical analysis, and I'm not even sure whether it's possible. For instance I was surprised to find that even analysis of a nontrivial FM modulation quickly becomes intractible  - see here

I have therefore sought answers by simulating the algorithm. I've written a C program that internally generates test signals, simulates polar modulation and writes the individual samples into an array. It then displays the fourier transform results of this time-sequence of values. 

The simulation parameters I've chosen are:

  • The simulation should test bandwidths of the amplitude and frequency outputs up to 192ksps.
  • The carrier frequency should be at least 5x this frequency, or about 1MHz.
  • The simulation sampling should be at least 2x this frequency, but make it 4x for safety: ie. 4MHz.
With 256K samples my PC performs a Fourier transform and all other simulation calculations in well under a second. This quick evaluation time nicely allows the spectrum results to be refreshed and viewed whilst interactively changing input parameters. The frequency-width of each of the Fourier transform's output array entry = samplingfrequencynoofsamples15.3Hz, which is more than enough resolution.

The goal with the simulation code is to remove all possible sources of uncertainty or errors so that the results reflect issues with the core modulation technique only. The Hilbert Transform is nontrival and potentially generates slight amplitude and phase errors in its output. For this reason the Hilbert transform has been removed, and replaced by oscillators that directly generate quadrature phase signals. Also the atan2() and power calculation approximations in Guido's original code are replaced by high accuracy calculations.

The core simulation Pseudocode is:

SAMPLES = 1<<18
Fs = 4e6
dt = 1.0/Fs
float output[SAMPLES]
float outputAmplitude
float freqDelta

float f1 = 1600    // set the modulation signal values
float a1 = 1.0
float f2 = 300
float a2 = 1.0

int AmpSubsampling   = Fs/12000    // for 12kHz amplitude updates
int PhaseSubsampling = Fs/12000    // for 12kHz phase updates

for  t = 0..SAMPLES
// calculate the modulating signal
t1 = 2*PI*f1*dt*t
t2 = 2*PI*f2*dt*t
I = a1*cos(t1) + a2*cos(t2)
Q = a1*sin(t1) + a2*sin(t2)

if ( t mod AmpSubsampling == 0) 
outputAmplitude = sqrt(I*I + Q*Q)

if ( t mod PhaseSubsampling == 0)
float phase = atan2(Q, I);
float dp = phase - prev_phase;
prev_phase = phase;

if(dp < 0)   dp += 2*PI; // phase unwrap
if(dp >= PI) dp -= 2*PI; // reverse phase unwrap
freqDelta = dp * Fs/PhaseSubsampling /(2.0*PI)

carrierPhase += 2.0*PI*(fc+freqDelta)*dt
output[t] = outputAmplitude * cos(carrierPhase)

ApplyWindowingFunction(output)
FFTresult[] = FFT_Transform(output)

power[] = convertResultToPowerIndB(FFTresult)

DisplayAsChart(power)

For the Fourier Transform calculation I use this library - it's open-source, quite fast, the code is easily read, and I was already familiar with it, having used it on other projects. Other libraries might be faster, but raw speed is not critical here.

To simulate various conditions, such as amplitude and phase errors that might be introduced by the Hilbert transform, this core simulation code is supplemented with additional calculations.

When I first wrote this code, I spent some time experimenting with it to give myself confidence that it was both doing what was expected, and that I was able to correctly interprete the results. Some experiments were:

  • Temporarily replaced the modulated carrier with pure sinusoid and verified that the result appeared in the predicted FFT "bin".
  • Temporarily generated two carriers close together and verified that two peaks were present in the FFT result and at the predicted location.
  • Temporarily AM modulated the carrier with a low-frequency sinusoid and confirmed the FFT result was a carrier with two side lobes.
  • Reduced and increased the amplitude of the carrier by a factor of 10 and confirmed that the power output changed by 20dB.

Results with 192kHz Sample Rate

Finally, with the sample rate set to 192kHz, I checked the spectrum with two equal amplitude sinusoids, one at 3.5 kHz and the other ranging in frequency from just below 3.5kHz down to just 40 Hz. In all cases the spectrum is virtually perfect:

The horizontal lines are 10dB apart, and the total span is about 12kHz. This is good evidence that the algorithm is working as expected.

Results with 12kHz Sample Rate - slight amplitude delay

The spectrum results are sensitive to the relative delay between the frequency and amplitude outputs. The best results are when the amplitude signal is delayed by approx. 190 master clock ticks (4MHz rate). This corresponds to a delay of a little over half the 12kHz sample clock.

With the sample rate set to 12kHz, and one input set to 3.5kHz, and the other just 130Hz lower, some spurious signals are present at 45dB down:

When the difference frequency increases to 500Hz, many more spurious signals appear:



At a difference frequency of 1000Hz, the spurious signals have continued to grow but the nearby peaks remain about 40dB down:


With a difference frequency of 1500Hz the spurious signal peaks are 30dB down:



When the difference frequency reaches 2000Hz, the spurious signal peaks are 25dB down:



With a difference frequency of 2500Hz, the spurious signal peaks are <25dB down:



Finally, with a difference frequency of 3000Hz, the spurious signal peaks are 20dB down:


A similar pattern emerges with different sample rates - with a higher sample rate and the same difference-frequency the power of the spuri are lower (and vice versa).

Results with 12kHz Frequency Sample Rate, Increased Amplitude Sample Rate

...tbd...

Sensitivity to Amplitude Variation

In the following charts, the frequencies are kept constant at the standard IMD test tones of 700Hz and 1900Hz , and the 700Hz amplitude is varied:

Equal amplitude:


700Hz at 0.9 amplitude:


700Hz at 0.8 amplitude:



700Hz at 0.5 amplitude:


700Hz at 0.1 amplitude:




To Follow (maybe):

  • Source of errors arising from lower sampling rate
  • Alternative phase unwrapping code?
  • Hardware configuration and results of tests on the hardware
  • Indications of the algorithm's sensitivity to 
    • Hilbert transform phase & amplitude errors
    • Delays between the amplitude & frequency outputs










 

Tuesday, February 18, 2025

Guido's SSB Polar Modulation - Analysis and Application


This blog post is "work in progress" and may be updated from time to time. 

The uSDX, is a microprocessor-based Software Defined Transceiver developed by Guido PE1NNZ that has a very interesting technique to generate the SSB transmission. The SSB signal is created by rapidly updating both the output frequency and the output power of a class-E driven Power Amplifier. This design results in a power-efficient and very simple low cost circuit:


That's it - that's the entire transmit circuitry to generate a 5W SSB signal! That's crazy, and so much simpler than typical SSB modulation techniques. I'll call this type of modulation "polar modulation" for reasons that will become apparent later.

Guido's approach is such an attractive idea that the QMX is is also adopting polar modulation.  

This blog post describes my journey learning about polar modulation, how it works, what its limitations are, and how those limitations can be minimised in a practical implementation. The journey I took led me down many blind alleys. I'll try to spare you from these, and to chart a more direct and logical path.

My first step was to explore how it was even possible for polar modulation to modulate typical audio inputs. Understsanding it for a single input tone is easy - the output frequency and output power are just kept constant. For instance, if the transceiver is tuned to 7MHz upper sideband, and a modulating tone of 1kHz is applyed, the radio sets the si5351 CLK2 output frequency to 7.001MHz, and the PA amplitude to a value proportional to the input signal's amplitude. If the frequency or amplitude of the single tone modulating input changes, the output frequency and amplitude are updated accordingly.

When two audio signals, say at 1kHz and 1.9kHz, are applied then the output to the antenna should comprise two frequencies - one at 7.001MHz and one at 7.0019MHz. How is this possible when the transmitter is only capable of generating one frequency at a time? Well, it turns out that it is possible - by varying both the output frequency and output amplitude in exactly the right way. This can be shown mathematically in my blog post here.

I've gone on to confirm in both simulation software and laboratory-style tests that indeed a two frequency output signal can be accurately generated using polar modulation. However there are limitations to this approach which I will summarise here, and will explain later how I arrived at these conclusions.

Summary of Findings

The polar modulation algorithm is nonlinear, and becomes increasingly nonlinear as the amplitudes of two input tones approach each other.  When this happens, as with any nonlinear system, many harmonics are generated and these harmonics are at frequencies that are multiples of the difference-frequency of the input tones. For example, if the input signals are at 1kHz and 1.9kHz, the difference frequency is 900Hz, and there are harmonics at 1800Hz, 2700Hz, 3600Hz, etc.

The harmonics are critical to the correct generation of the RF output. By "magic" the harmonics generated by the frequency variations exactly cancel the harmonics generated by the amplitude variations, resulting in a clean RF signal. Any reduction in the harmonic signals that are caused by bandwidth limitations in either the frequency or amplitude paths will result in spurious signals in the output.

A rough rule of thumb for the case of equal-amplitude input signals is that the bandwidth of the frequency and amplitude paths need to be approximately 5x the maximum expected frequency-difference in order to keep the output's spurious signals more than 30dB below either test tone. For instance, the si5351 can only be updated at about 12ksps maximum (with overclocking the I2C bus beyond spec), and so has a maximum bandwidth of 6kHz. Such a system will show spurious signals of worse than 30dB whenever equal-amplitude dual-tone signals are applied that have a freequency difference greater than 6kHz5=1200Hz. However, as will be seen later, improvements are achievable if the amplitude path has a higher bandwidth than the frequency path.

To illustrate the consequences of limiting the bandwith of both paths, here is the simulation result of a 6kHz bandwidth limitation on 300Hz and 1600Hz tones (1300Hz frequency difference) is:


The same test on a laboratory hardware test set-up is:


Note the very strong similarity between the simulated results and real results (allowing for slightly difference horizontal and vertical scales). This is positive evidence of the accuracy of the simulation and my understanding of what is happening.

The Modulation Algorithm

The modulation algorithm at a high level is:








Or in pseudo-code:

modulate(audio): 

(I, Q) = HilbertTransform(audio)

amplitude = sqrt(I*I, Q*Q)
phase = atan2(Q, I)

deltaPhase = phase - lastPhase    // differentiate
lastPhase = phase

if (deltaPhase < 0)  deltaPhase = deltaPhase + 2*PI    // phase unwrapping
if (deltaPhase >PI) deltaPhase = deltaPhase - 2*PI     // phase unwrapping
frequency = deltaPhase * Const + CarrierFrequency

return (frequency, amplitude)

You might rightly wonder what is happening here - this is a huge jump from the description so far and from the mathematical analysis referred to above.

To understand how this algorithm works it helps to first describe:

  1. The phasor representation of a sinusoid. 
  2. The Hilbert transform, as it applies in this narrow application.
  3. Phase unwrapping.

Phasor Representation

The key idea is that, since a sinusoid has both an amplitude and a phase, it can be represented as a vector(arrow) or phasor in a 2D space. The phasor's length is its amplitude, and its frequency is how quickly it rotates around its "tail" in this 2D space. So the tip of a phasor representing a single tone traces a circle in this 2D space. The convention is that the phasor rotates anticlockwise. The horizontal component of this phasor is the level of the signal we see. This is illustrated here.

At any point in time the phasor has a length (the distance from its tail to its tip) and an angle, which is the phasor's angle to the horizontal.



If a signal comprises multiple sinusoids, all the phasors representing these sinusoids are added together tip-to-tail, with each phasor continuing to rotate around its own tail. The result is a new result phasor with its tail at the origin and a tip that traces out a complex pattern in 2D space.


The resulting path could also be traced out by a hypothetical signal generator whose output amplitude and frequency is continually and precisely varied to match. The signal generator's output amplitude is the length of result phasor, and its frequency is the rate of change of the result phasor's angle. This is exactly the calculation being performed in the modulation algorithm!

The Hilbert Transform

The discussion above describes the use of phasors, but the input signal isn't in phasor form. That's where the Hilbert Transform (or Hilbert Filter) comes in - it turns the input into phasor form.

The Hilbert Transform takes an input sinusoid and through "magic" (which I don't understand) outputs two sinusoids I and Q, with Q delayed by exactly 90° compared to I. I and Q are the two dimensions of the 2D phasor space.

The Hilbert Transform is also linear, so if two sinusoids are combined and fed into the Hilbert Transform, the result is exactly the same as feeding each sinusoid into a separate Hilbert Transform and adding the result. ie 𝓗(a + b) = 𝓗(a) + 𝓗(b). This means we can feed our input signal, which comprises multiple frequencies, into the Hilbert Transform and the resulting I and Q signals correctly represent the result phasor described earlier.

Phase Unwrapping

We're almost there! The last issue to deal with is that the phase measurement only gives a raw value between -π and +π. But the actual phase angle of a sinusoid continues to grow without bound. This becomes an issue when we calculate the frequency by measuring the phase's rate of change.


If the raw atan2() phase value is used, the frequency calculation is incorrect each time the phase increases beyond π and jumps back to -π. That's why the following code is needed to "unwrap" the phase and to give the correct phase difference:

if (deltaPhase < 0)  deltaPhase = deltaPhase + 2*PI    // phase unwrapping

It is also possible for the phase be decreasing and for the phase to jump from -π to +π. The code to "unwrap" this case is:

if (deltaPhase >PI) deltaPhase = deltaPhase - 2*PI     // phase unwrapping

"When can the phase be decreasing - that's a negative frequency!" you might ask. Here's an example. Phasor 1 is rotating slowly (say at 100Hz), and Phasor 2 is rotating quickly (say at 1kHz). In the time that Phasor 2 rotates from position 1 to position 2 Phase 1 has hardly moved. The result phasor has moved from pointing to 1 to pointing at 2 - a reduction in phase:



This latter case of phase unwrapping is absent from the uSDX implementation, but appears to be unimportant since it makes little difference to the final signal.

Studying the appropriate phasor diagrams helps illustrate why the mathematical analysis shows the phase changes so dramatically when the two frequencies are of similar amplitude:


In this diagram, the origin is towards the top, and a trace of the tip of the result phasor is shown over a number of time steps. The low frequency's phasor has its tail at the origin, and its tip is currently in the lower left quadrant. The higher frequency's phasor has it's tail at the LF phasor's tip, and since it is spinning much faster, it is causing the dominant movement; it's tip is the tip of the result phasor. It has similar amplitude to the LF phasor so the result phasor passes very close to, but slightly below, the origin.

For the majority of the result phasor's path, it's angle to the origin only changes relatively slowly from one time step to the next. However, the step from orange to purple, and from purple to red, are very dramatic negative phase changes. This is the dramatic phase change as predicted by the mathematical analysis.

Note that there's nothing special in this description about the LF phasor's location. Another "snapshot" could be taken at a different time where the LF phasor is pointing - say - in the upper right quadrant; exactly the same description of the phase behaviour would apply. In fact different snapshots in time can be thought of as just rotating this image around the origin.

If you now imagine that the HF phasor's amplitude is very slightly larger, then the result phasor would pass very close to, but slightly above, the origin. In this case the dramatic phase changes are positive. A small amplitude change has caused the sign of the phase "blip" to change.

As a final thought experiment, imagine the two amplitudes are exactly the same. In this case the result phasor passes directly through the origin. At one time step it will be on one side of the origin, and on the next time step is will be on the opposite side - a 180° or π radians change. The frequency, which is the change in phase over time, will be πdt, where dt = the time step duration. If the size of the time step is reduced, then the frequency at the point of crossing increases. This becomes important later - altering the size of the time step alters the result of the frequency calculation - because the calculation depends on a time-derivative.

 Conclusion

This post has hopefully given some intuition behind polar modulation. In a subsequent post I'll investigate the the bandwidth requirements via simulation and a hardware implementation.

Sunday, February 16, 2025

Mathematical Analysis of a two-tone SSB Signal

Suppose we have an SSB Transmitter tuned to transmit at carrier frequency fc, upper sideband. With an input audio modulating signal of fa, the transmitter generates an output signal of fc+fa. For example if the tuned frequency is 7.0MHz and the audio input is 1kHz, a signal at 7.001MHz is generated.

Similarly, two input audio modulating signals fa with amplitude a, and fb with amplitude b, the transmitter generates a signal comprising two frequencies, one at fc+fa  and another at fc+fb.

The following analysis shows that this 2-frequency output signal can be generated by amplitude-modulating an a single frequency-modulated oscillator. 

The following trigonometric identities are used:

(I): cos(x+y)=cosxcosysinxsiny -- link

(II) xcosθ+ysinθ=ccos(θ+ϕ) where c=sgn(x)(x2+y2), ϕ=arctan(y/x) -- link

(III) sin2x+cos2x=1 -- link

We start with the output signal:

s=acos2π(fc+fa)t+bcos2π(fc+fb)

Define ωc=2πfcωa=2π(fc+fa) and ωb=2π(fc+fb):

s=acosωat+bcosωb

s=acosωat+bcos(ωat+(ωbωa)t)

Applying (I): 

s=acosωat+bcosωatcos(ωbωa)tbsinωatsin(ωbωa)t

Rearranging:

s=(a+bcos(ωbωa)t)cosωatbsinωatsin(ωbωa)t

Applying (II), with x=(a+bcos(ωbωa)t), y=bsin(ωbωa)t:

s=a2+2abcos(ωbωa)t+b2cos2(ωbωa)t+b2sin2(ωbωa)tcos(ωat+tan1bsin(ωbωa)ta+bcos(ωbωa)t)

Applying (III) to simplify the terms under the square root:

s=a2+2abcos(ωbωa)t+b2cos(ωat+tan1bsin(ωbωa)ta+bcos(ωbωa)t)

This complex equation can be better understood by rewriting it as:

s=Acos(ωat+f(t))

where:

  • A=a2+2abcos(ωbωa)t+b2 sets the amplitude of the generated signal, and varies slowly compared to the carrier frequency because ωbωaωc, and
  • f(t)=tan1bsin(ωbωa)ta+bcos(ωbωa)t and varies slowly compared to the ωat term, again because  ωbωaωc.

In other words, combining two input signals results in an output signal with a phase varying around ωat at the difference frequency ωbωa and with amplitude A also varying at the difference frequency.

Confidence that the formula is correctly derived can be gained by testing what happens if either of the modulating signals is removed (by setting it's amplitude term - either a or b - to zero. The formula should "collapse" to a single simple waveform. Indeed this is exactly what happens.

The analysis so far draws on "Frequency Analysis, Modulation and Noise", Standford Goldman, 1948 - p160. I believe what follows is novel.

Phase/Frequency Analysis

The full phase term is: (ωat+tan1bsin(ωbωa)ta+bcos(ωbωa)t).

The frequency at any instant in time is the derivative of this term with respect to time:

f=ddt(ωat+tan1bsin(ωbωa)ta+bcos(ωbωa)t)

f=ωa+ddt(tan1bsin(ωbωa)ta+bcos(ωbωa)t)

Deriving the derivative of this second term is beyond my mathematics skills, but thankfully the Derivative Calculator website will do this for us (click the "Go!" button). Here I have defined m=ωbωa.

The result is gives is: bm(acos(mt)+b)b(2acos(mt)+b)+a2

This formula isn't very informative, apart from showing that the output frequency varies with the difference frequency m=ωbωa. However, very helpfully, Derivative Calculator also interactively plots the input and derivative values for the input values a, b and m. For example:


Experimenting with different values starts to give some slight intuition about how the amplitude values influence the frequency output. For instance, as the two amplitudes approach each other, the frquency variation grows dramatically:

Finally, when the two amplitudes are equal, there's a discontinuity & the "blip" disappears from the graph view - the negative "blip" is still there and corresponds to the negative edge of the input signal, but it's infinitely narrow and doesn't display:

This tendency for the frequency "blip" to grow without bound as one amplitude approaches the other is extremely important phenomena to understand in later analysis. It will also become apparent that this phenomena changes slightly when we move from the continuous domain to the discrete/sampled domain.

Note that the blue waveform might initially look strange when the amplitude is increased further:

The explanation for the discontinuities in the blue input waveform is that it is a phase measurement and has been displayed to wrap every 2π. The red derivative is displayed correctly and does not show any "blips" at these transitions.

Amplitude Analysis

The amplitude term is A=a2+2abcos(ωbωa)t+b2.

Its behaviour can be interactively plotted here.

The most interesting plot is again when a=b:


This is a rectified sine wave, as can be demonstrated by setting a=b=1. The amplitude term simplifies to:

A=1+2cos(ωbωa)t+1

A=(2)1+cos(ωbωa)t

But sgn(cosθ2)1+cosθ2=cosθ2  (see link).

So, ignoring the sign term, the amplitude is the shape of a cosine wave.

A rectified sine wave has high harmonic content (see here or here), which will become relevant in later analysis.

Conclusions

This mathematical analysis has derived some equations for both amplitude and frequency shift that appear to correspond to expectations.

The interactive waveform plots give some intuition on the signal behaviour. In particular, when the two tones approach the same amplitude, the frequency shift approaches a discontinuity and values can become very large, and the amplitude waveform approaches a shape that has high harmonic content.


 





Thursday, February 13, 2025

Measuring the Si5351 Output Frequency Settling Time

 I'm using the Si5351 programable clock generator in an upcoming project, portions of which are similar to the usdx. For my application, characteristics of the chip that are important, but not documented in the datasheet, are:

  • When does the output frequency change in relation to the I2C command that has been issued to update the frequency? Answer: on the second to last low-to-high I2C clock transition of the command that writes the lowest byte of the PLL register.
  • How long does the output frequency take to reach the target frequency (its settling time)? Answer: < 1us! Yes, it settles extremely quickly.

This note describes how I measured these characteristics.

For my tests, the Si5351's output frequency is being updated at up to 12k times per second. The I2C bus is overclocked at 800kHz to achieve this update rate. The usdx project implements a frequency-change mechanism where just 5 on-chip 8-bit register values are updated in a single I2S 7 byte transaction. The update is made to alter the PLL frequency. For the purposes of my tests I use the same mechanism, but only need to update 4 registers for the range of frequency changes I'm conducting. Documenting the exact frequency update mechanism is outside the scope of this note, but can be ascertained from the usdx code and reference to AN619.

Note that my tests also showed that the Si5351 output only changes when the lowest byte of 

Test Method

I used a mixer to compare the Si5351's output target frequency against a known reference frequency. The filtered mixer's output is the difference between the input frequencies. So when the two frequencies are the same, or very similar, the output is either constant or changes only slowly. I initially used a filter RC time constant of 1us.

The mixer was built from a 3253 analogue switch. The mixer is very much like a quarter of a Tayloe Mixer.

The fixed frequency could be provided by an external source, but in my case I used a second Si5351 output.  I have relied on the fact that the si5351 can be configured to output 2 frequencies completely independently - each with their own PLL and MultiSynth divider chain (a third output frequency can also be configured, but must share one of the PLLs):

  • the target frequency on CLK1 whose frequency will be adjusted by ±4kHz (or more).
  • the reference frequency on CLK2 whose frequency is locked and stable throughout a test.
A Pi Pico 2 was used for generating the I2C signals, and output a "high" on a digital pin just before the start of a I2C transaction, and "low" immedately after. An oscilloscope monitored the I2C clock, the digital pin, and the mixer output.



To conduct the measurement, the target frequency would first be set several kHz above or below the reference frequency and allowed to stabilise, and then the target frequency would be set to match the reference frequency.

A typical trace (with the target frequency starting 5kHz above the target frequency, with C = 1nF & RC = 1us) is:

(where yellow = I2C clock, cyan = digital output, blue = mixer output).

The 5kHz difference frequency is clearly visible on the mixer output prior to the frequency change. The flat flat output after the frequency change shows that the target and reference frequencies are now equal.

Zooming in, we see that the mixer output becomes flat (ie target and reference frequencies now equal) within a few microseconds after the second to last low-to-high transition of the I2C clock. The 1us RC time constant of the filter slows the transition:

With C = 100pF, RC = 0.1us there is reduced filtering, but the exact time thta the frequency changed is much clearer:

The frequency transition occurs within 1us.

The tests were rerun using an external Function Generator as the reference frequency. The same results were obtained and verify that there was no unintentional interaction between the two Si5351 outputs.






Sunday, January 24, 2016

Inside the arm1v - the ALU control logic

This is one of a number of posts on my work on reverse engineering the armv1 processor.  The first in the series, and an index of the other articles can be found here.

My first post in this series described in some detail a one-bit slice of the ALU, and identified quite a number of control signals that feed all 32 bit slices which determine how the ALU operates. Now that I've reverse-engineered the overall instruction decoding and sequencing mechanism we now have enough context to make a start on reverse-engineering the circuitry that generates the ALU's control signals. This turns out to be more complex than I first expected and a full understanding of what's happening will probably have to wait until other parts of the processor have also been reverse-engineered.

Let's make a start by setting the context. The ALU circuit from my earlier post is reproduced below; we'll need to refer to it as we proceed.



The location of this circuitry, and it's control circuitry, on the silicon are as follows:


Let's now zoom in on the ALU control area:


The ALU Control Logic comprises two distinct areas - (1) a bunch of "ad-hoc" gates and drivers sitting just above the ALU itself, and (2) just above them, a Programmable Logic Array which is referred to as PLA-1 in an earlier blog. In the same area is the logic which implements the Program Status Register which is not discussed here.

The overall circuit is as follows:


The circuit diagram is laid out roughly as it is in the silicon - a bunch of control signals come in from above and feed PLA-1; the buffered PLA-1 outputs, and a couple of outputs from the instruction-decoder, then feed downwards to provide the control signals to all 32 bit slices that form the ALU. Not surprisingly, these control signals tally exactly with the signals in the ALU circuit at the beginning of this blog.

The Carry Logic on the left (highlighted in blue) feeds the Carry bit from the PSR Logic (node 8083) to bit 0 of the ALU. The Carry bit is delayed by a single clock cycle using a dynamic latch (I've used a D latch symbol for ease of drawing) and conditioned by two outputs of the PLA as per the following truth table:

This shows that the PLA-1 outputs can either pass the Carry bit straight to bit 0 of the ALU, or force the Carry line to be either 0 or 1.

The Dynamic Register Control circuitry on the RHS (highlighted in orange) shows that the two dynamic latches in the ALU (one on each operand path) are controlled by outputs on PLA-2 (the instruction decoder), with latching occurring on the trailing edge of the Phase 1 clock. the Operand 1 latch is also operated one clock cycle after the PLA-2 signals (8062, 8061, 8060) are all simultaneously low. We'll try to make sense of the purpose of this part of the circuit later.

PLA-1 Content - Input Decoding

After all this context setting it's now time to look at what's in PLA-1. The PLA layout is very similar to the instruction decoder PLA-2 described earlier:


The 8 input selection signals enter the PLA from the lower right and the signals, and their inverted versions, feed 16x vertical columns. Which transistors are fitted (or missing) determine which one of the 31 rows within the PLA is selected. There are further details about the operation of this circuit in the earlier blog post.

Let's start with the row selection logic on the RHS of the PLA, a raw dump of which is below (with the top row of the table corresponding to the top row in the PLA):


The content displayed in this format doesn't give us much insight into what is happening. However, after a little study the following pattern emerges:

It becomes clearer that the 3x inputs from PLA-2 (8062, 8061, 8060) form a "command" that narrows down which row will be selected. (Note that I've shown these 3x signals in the order that they enter PLA-1; similarly the earlier description shows these signals in the order that they exit PLA-2 - but they are in a different order! This is the way the chip is wired! So you need to remember to swap the bit ordering when you compare the data between the two PLAs!)

Let's start by looking closer at the first command: "111" from PLA-2, which narrows the selection to rows 15-30. If we refer to the PLA-2 logic table & find which rows generate a "111" (remembering to reverse the bits, which happens to not matter in this case), we find it appears on three rows: 1, 3, and 4 (highlighted in light blue).


All three rows are associated with the last (or only) cycle of a DP (Data Processing) instruction. So, returning to the PLA-1 table above, the signals on the Instruction Bus b24 to b21 are from a DP instruction. The DP instruction format is:


And, as expected, we see that b24 to b21 correspond to the "Operation Code" - AND, OR, etc. So we can conclude that rows 15 to 30 of PLA-1 correspond to each ALU operation.

By following a similar process and taking each of the remaining commands in turn we can gain some understanding of what rows 0..14 are for. However this analysis reveals lots of specific details - there is a lot of complexity here. The annotated PLA-1 row-selection table is as follows:


What's clear from this table is that the ALU is being used for a lot more than the DP instructions that are visible to the programmer - almost half of this table is for other purposes! In the main the use of Instruction Register bits 24..21 makes sense given the context. The one exception is for command "000" (rows 0, 1 above); the Instruction Register bits are only valid & meaningful on a couple of the 9 rows this command is used. I suspect that this command is sometimes used as a "no-op".

We can also perhaps gain some insight into what the signal "From Trap Ctrl (8158)" is. It only influences the row decoding for rows 6, 7 which is associated with LDR and LDM instruction execution. We know from the instruction descriptions that there are special rules in cases where a data fetch abort occurs:

  • for the LDR instruction: "The data fetch may abort, and in this case, the base and destination modifications are prevented", 
  • for the LDM instruction: "If an abort occurs...the final cycle is altered to restore the modified base register") 

Perhaps this logic is to implement this functionality?

PLA-1 Content - Output Values

Now that we've made an initial analysis of the input decoding, it's time to see what sense we can make of the output values. The table below is laid out to reflect the physical layout and therefore shows the output values on the LHS. There are also some notes on the left.


There is an awful lot of information contained in this table!

Let's start with the DP rows - 15 to 30:

  • The good news is that on rearranging the order of the 6 right-most output columns, these values exactly correspond to the table that appeared in my first post on the ALU! Whew!
  • If we now look at the output columns for 8087, 8088, we know from the earlier circuit diagram that these values determine what signal is fed to the Carry In on bit 0 of the ALU. And we only need to look at those rows where the output signal "/Enable C chain" (8116) is 0 (since the Carry signal is ignored otherwise). The comments on the left captures the results. The Carry In signal is set to exactly the values we would expect for associated op-code.


  • The first output, "To INST SKIP logic (8065)" is interesting in that it is only zero for the 4x comparison op-codes. About these op-codes the documentation states: "They are used only to perform tests and to set the condition codes on the result, and therefore should always have the S bit set" (we can see from the DP instruction format above that the S bit is defined as "Set Condition Code", 0 = Do not alter condition codes, 1 = Set condition codes). I suspect that this output from the PLA is used by the INST SKIP logic to detect if this is an invalid instruction.
  • The output "To PSR Logic (8064)" appears to control when the PSR is updated. Since 8064 is high for all the DP op-codes, and low for all other ALU operations, it probably controls when the Carry, Overflow, Negative, and Zero flag can be updated (subject of course to the S bit, as described earlier). 
  • The output "To PSR Logic (8059)"  is only high when an arithmetic operation (as opposed to a logical operation) is taking place. It therefore selects where the Carry flag is updated from - either the ALU output, or the result of the shift operation, and whether the Overflow flag is updated or not. See p2-32 of the VTI arm data book (1990) for a more complete description of the rules regarding how the PSR bits are updated.

Now that we've completed analysing the DP instructions, let's move the remainder of the table - rows 0 to 14.

To analyse rows 0 to 14 of the PLA-1 output table I assumed that the ALU would be carrying out similar operations to the DP instructions we've already been looking at. I therefore expected to see the same output settings on these rows as the rows we've already analysed. The results of this comparison work are shown in the Notes on the left of the table above. A summary is:

  • Matches: There are 8 rows with exact matches, and select a variety of ALU op-codes: MOV, ADD, SUB, RSB (reverse substract).
  • Variants: A further 5 rows match, apart from the Carry In selection. These are described further below.
  • Unknowns: Two rows have identical values, but don't match any DP instruction.
Let's look further at the variants. Although they appear on 5 rows, there are just two versions:
  • ADD (but Cin = 1). The normal ADD has a Cin = 0, so this variant has the result = Op1 + Op2 + 1 (where Op1 and Ap2 are the two operands presented to the ALU).
  • RSB (but Cin = 0).  The normal RSB has a Cin = 1, so this variant has the result = Op2 - Op1 - 1. 
We can make educated guesses as to what is happening on each of these rows. For instance, on row 2, we are executing a Branch or Branch and Link instruction, and are almost certainly calculating the destination address by adding the PC to the PC-relative offset which appears within the instruction  (we've seen in an earlier article how there is logic in the Read Bus B that extracts the 24 bit offset).

Conclusion

It turns out that the ALU control logic is more complex than I expected. What is clear is that the chip designers really maximised the use of the ALU! 

Whilst I've completely reverse-engineered the control circuitry, I've only made a start on unraveling what is actually going on. A full understanding requires some more analysis of the circuitry itself, and on getting a wider view on what is happening elsewhere on the chip. My list of "to-do" items includes:
  1. What exactly is the "unknown" ALU command we discovered above?
  2. What values appear at the inputs to the ALU (Operand 1 and operand 2) for each of the rows above (this requires other parts of the processor to be reverse-engineering first). This will also give us a better understanding of why the variants to the ADD, RSB opcodes are introduced.
  3. What exactly is being latched into the ALU's dynamic registers, and what is the sequencing associated with this? And why does the circuit above have the specific logic to introduce a one cycle delay for when DP instructions are executed?







Monday, January 18, 2016

Inside the armv1 - decoding barrel-shifter commands

This is one of a number of posts on my work on reverse engineering the armv1 processor.  The first in the series, and an index of the other articles can be found here.

Today I'm going to solve a puzzle I have been pondering for some time - how the processor implements instructions that reference 4 registers.

If we look at the data processing (DP) instruction format in more detail, we see that there are the following instruction types:

If we set Bit 15 to zero (Operand 2 in a register), and Bit 4 to one (Shift amount in a register) we get the following layout (made from copying/pasting portions of the image above):


This layout plainly shows this this single instruction references 4 registers simultaneously - Rd (the destination register), Rn (one of the ALU operands), Rm (the second ALU operand), and Rs (which gives the amount by which Rm is first shifted).

However various architecture descriptions of the arm's Data Processing instructions only refer to there being two input operands and one output register. This matches the descriptions of two data read buses (referred to as "Bus A" and "Bus B" or "read bus A" and "read bus B"), and also matches there being just 3 sets of register select logic (which was explored in detail in an earlier post). So how does the processor execute an instruction which references 4 registers?

A clue to solving this mystery was in my last post, where I analysed how the instruction decoder works. By referencing the first table in that post we see that the instruction decoder treats as a special case all Data Processing instructions where the shift amount is in a register. These instructions execute in 2 cycles, one more cycle that all other Data Processing instructions. It would be safe to bet that the first cycle extracts the shift amount from the Rs register and holds the value somewhere, and on the second cycle actually carries out the ALU operation.

Let's now move to the processor itself to verify that our guess is correct. The area of the chip that we're interested in is highlighted in red below. Ken Shirriff has already given an overview of how the barrel shifter works here:



If we zoom in a little further, we see that there are two distinct sections in this area:



The lower area generates the column drive signals to the barrel shifter itself. The shift-amount and shift-type is via signals originating in the upper "Barrel Shifter Decode Selection" logic. We won't look at the driver logic in any further detail in this post. Instead we turn to the upper section, and start by zooming in some more:


The layout of this upper section is spectacular in that all the inputs and outputs to the logic are very readily apparent, and are marked on the diagram:

  • The I-Bus inputs b11..b5 correspond to the Shift Amount (b11..b7) and Shift Type (b6..b5) that we saw in the instruction layout we looked at above. 
  • The outputs on the RHS - Shift Amount (5 bits) and Shift Type (2 bits) are the signals that feed the Barrel Shifter Driver Logic that we saw earlier. 
  • The 3x outputs from the PLA (nodes 8287, 8288, b286) that enter the area from above correspond to columns 2, 3, 4 of the PLA output table I included in my last blog.
  • Two signals derived from the lowest two address outputs lines enter the area from above and to the right.
  • 4x signals associated with Carry processing enter/exit from the lower edge.
The main logic areas are also apparent:



We've found the 8-bit wide dynamic latch that stores the register-sourced shift-amount from the Read Bus!

The other key logic is the group of seven 6-way multiplexers, whose outputs feed via the seven drivers to the rest of the barrel-shifter logic.

The 3 to 8 decoder is driven from the 3x PLA outputs (nodes 8287, 8288, b286) and 6 of its outputs select which of the 6-way multiplexer inputs is chosen. A further output controls the Dynamic latch, and the final 3 to 8 decoder output is not implemented.

The remaining logic that is not highlighted in the diagram is complex and convoluted; it's task is to ensure that the correct shift results occur, even when a shift amount greater than 32 is selected. This includes ensuring that the Carry bit is set appropriately and that the sign bit is extended in the correct manner. The rules are described on page 2-34 of the VTI arm databook (1990). I won't dwell further on this part of the circuit.

The dynamic latch circuitry and associated latch processing is straightforward:




The data on the Read Bus is latched during phase 1 of the clock only when output 7 of the 3-to-8 decoder has been selected (i.e. all 3x inputs from the PLA are high). The latched data (or zero) is then available on one of the 6 inputs to the multiplexer. Zero is selected depending on the complex logic referred to earlier.

The output driver circuitry for each of the 7 signals is just two inverters in series.

The multiplexer circuitry, and decoding circuitry, is identical in form to the read bus input multiplexer I described in my earlier blog on register selection, and won't be repeated here.

Just the following "glue logic" circuits generate additional inputs to the multiplexer:


Note how the Shift Type in I-Reg b6 and I-Reg b5 are potentially adjusted, dependent on complex logic, in a similar manner to how the shift-amount described earlier might be adjusted.

The table below summarises the multiplexer's operation, and lists what the 7x outputs to the Barrel Shift Driver Logic are for each of the 8 combinations on the 3 signals from the PLA.


If we compare these PLA values, and the barrel-shifter outputs, with the values in the PLA output table from my previous article, it starts to makes sense.

Let's take row 1 of the PLA table as the first example. This row decodes a Data Processing instruction where Operand 2 is a register and the shift amount is immediate (i.e. the amount is in the instruction). The PLA values fed into the table above are "001" (row 1). This selects that the Shift Amount and Shift Type sent to the Barrel Shifter Driver Logic is b11..b5 of the DP instruction, exactly as we would expect.

Now let's examine an instruction that has the shift amount in a register - the situation I began this blog with. We see from the PLA table that this instruction type takes two cycles to execute (rows 2 and 3). The PLA values fed into the table above on the first cycle are "111" (row 7), which is a command to latch the content of the register which is present on the read bus. The PLA values on the second cycle are "000" (row 0), which feed the (possibly modified) values of the latch (for the shift amount) and the instruction's Shift Type to the Barrel Shifter Driver Logic.

We can now deal with the remaining PLA input types:

  • "010" - this appears in a number of lines in the PLA output table where the result of the ALU is not used, and ensures the Carry bit, etc. are not inadvertently affected. It can be regarded as a no-op.
  • "011" - appears only on row 27 of the PLA output table, and corresponds to the first instruction cycle of a Branch or Branch and Link instruction. In these cases the immediate value in the instruction is a word address, and the Barrel Shifter is instructed to shift this value left by two to convert it to a correct memory address.
  • "100" - appears only on row 4 of the PLA output table, and corresponds to a DP instruction where operand 2 is immediate. In this case the value to be rotated is in the lowest 8 bits of the instruction, and is to be rotated right by twice the amount in b8..11 of the instruction (see the instruction format at the beginning of the blog). The "glue" logic chooses a Shift Type of "11" (Rotate Right) when the shift amount is non-zero or "00" (LSL) for when a 0 shift is selected.
  • "101" - appears on row 7 and row 12 of the PLA output table. These both correspond to the last cycle of a LDR (Load Register from memory) instruction. Here, the Shift Amount is x8 the value appearing on Address line a0, a1. These address lines are only non-zero if a byte access has occurred, so this means that the lowest 8 bytes of data that has just been read from memory is rotated into it's correct position before being output from the barrel-shifter.
I ignored the reserved/undocumented instruction in the analysis above. However, even though our reverse-engineering of the chip is far from complete, from the information available we already know quite a lot about variant 1 of the reserved instruction:

  • Cycle 0 (row 15): selects "111" to save a register value in the barrel-shifter latch.
  • Cycle 1 (row 16): selects "000", which shifts the number now on the read bus by the amount in the latch
  • Cycle 2 (row 17): selects "001", perform a further shift, by the amount, and type specified in the instruction.
  • Cycle 3 (row 18): selects "101", which corresponds the byte rotate operation associated with a LDR instruction.
From this set of steps we can make a guarded guess that this instruction is loading data from memory, and that, since it takes one more cycle than a standard LDR instruction, the address is calculated using the content of one register as a shift amount, in addition to the typical LDR address calculation. Our being able to accurately predict what the reserved instructions do will be a good test of the accuracy of the reverse-engineering!

Conclusion

We've now reverse-engineered the main logic associated with controlling the barrel-shifter. There remains some logic still to reverse-engineer for us to fully understand all the edge cases, but fortunately this logic is very isolated and does not detract from our wider understanding of how the barrel-shifter works and is controlled. We have also found that the barrel-shifter is put to extensive use for a variety of tasks, not just for the Data Processing (DP) instructions. We also have garnered some "teaser" information about one of the reserved instructions.