Connecting the following structure of multipliers and adders in an elegant way in verilog

Question

I am attempting to write a synthesizable verilog (or Systemverilog) module. I also want to make the modul parameterizable, which has presented a problem when trying to connect the following structure of multipliers and adders:

The problem with this is the following. If I want to make it parameterizable, I don't know how to connect it in verilog. So to be more specific. I want to know which verilog construct to use to achieve this interconnections in a elegant manner. I have tried with the code bellow, but I get:
"Single value range is not allowed in packed dimension".

The code bellow is Systemverilog:

module vector_multiply #(parameter  N = 8, parameter M = 3)(
    input   clk,
    input   ce,
    input   rst,
    input   [N-1:0] a [M-1:0],
    input   [N-1:0] b [M-1:0],
    output reg [N-1:0] res // result of vector dot product
    );

 localparam ADDER_DEPTH = $clog2(M);
 wire [N-1:0] w [1<<ADDER_DEPTH][ADDER_DEPTH+1]; // multiplier/adder interconnect wires
 reg [N-1:0] mult_res [M-1:0];
 genvar i,j;

// MULT_MACRO: Multiply Function implemented in a DSP48E
//             Artix-7
// Xilinx HDL Language Template, version 2018.2

for (i = 0; i < M; i = i + 1)
begin
    MULT_MACRO #(
    .DEVICE("7SERIES"), // Target Device: "7SERIES" 
    .LATENCY(3),        // Desired clock cycle latency, 0-4
    .WIDTH_A(N),        // Multiplier A-input bus width, 1-25
    .WIDTH_B(N),        // Multiplier B-input bus width, 1-18
    .WIDTH_P(N)
) MULT_MACRO_inst (
   .P(mult_res[i]),     // Multiplier output bus, width determined by WIDTH_P parameter
   .A(a[i]),     // Multiplier input A bus, width determined by WIDTH_A parameter
   .B(b[i]),     // Multiplier input B bus, width determined by WIDTH_B parameter
   .CE(ce),   // 1-bit active high input clock enable
   .CLK(clk), // 1-bit positive edge clock input
   .RST(rst)  // 1-bit input active high reset
    );
end


for (i = 0; i < M; i = i + 1) begin
    w[0][i] = multi_res[i];
end

// If there is a odd number of elements in a vector (multiplications) then, add one so that all adders have a defined input
if ((M % 2) == 1) begin
    w[0][M] = 0; 
end


// The multiplicated results need to be adder (dot product).
for (i = 0; i > ADDER_DEPTH; i = i + 1) begin : layer_loop
   for (j = 0; j < (1<<(i-1)); j = j + 2) begin : inner_loop
      full_adder_Nb #(N) FA_Nb(w[i][j], w[i][j+1], w[i+1][j/2],0);
   end : inner_loop
end :layer_loop

endmodule

In this semi akward atempt I get errors in the following lines:
- w[0][i] = multi_res[i];
- w[0][M] = 0;

Again I am looking for either suggestions how to fix this code, or some better way to write it.

NOTE: N is the number of bits a number has, M is the number of elements in a vector (the pic bellow is for M = 4).

I personally tend to avoid multi dimensional vector arrays. I have done structures like that in the past and used a one dimensional array of vectors like: wire [W-1:0] interconnect [0:M]; Then I just increment the index from 0 to M whilst building up the structures. have a look here: https://electronics.stackexchange.com/questions/370467/how-to-write-a-for-generate-statement-to-generate-multiple-instances-of-this-par/370503#370503 — Oldfart, Dec 26 '18 at 11:52
If I understand correctly. I should use a one dimensional array, but iit should be bigger? i.e. wire [N-1:0] w [(1<<ADDER_DEPTH)*(ADDER_DEPTH+1)]; — Jure Vreča, Dec 26 '18 at 14:49
I did not say you should use but I personally find it simpler. Yes, it needs to be bigger as you need the same number of vectors (buses). — Oldfart, Dec 26 '18 at 14:53
Personally I find it less readable. Multidimensional arrays are syntax just for such a case imo. But since I keep getting this error, this may be a way around it. — Jure Vreča, Dec 26 '18 at 14:58
Actually no, I still get the same error... Single value range is not allowed in packed dimension. — Jure Vreča, Dec 26 '18 at 15:01
Without having looked closer... make sure you aren't experiencing issues related to accessing unpacked vs. packed arrays. search "systemverilog packed vs unpacked two dimensional arrays" and/or look at this: https://electronics.stackexchange.com/questions/101821/instantiating-multidimensional-array-in-system-verilog — CapnJJ, Dec 26 '18 at 15:57
What it is telling you is that you need to declare your arrays as packed arrays to make it synthesizable. For example declaration, input [N-1:0] a [M-1:0] SHOULD BE input [M-1:0][N-1:0] a — mj6174, Dec 26 '18 at 22:39
That's not right @mj6174 - as declared, it's fine and will synthesise. https://www.verilogpro.com/systemverilog-arrays-synthesizable/ — awjlogan, Dec 26 '18 at 22:51
There is mult_res[] as a reg, but multi_res[] in the for-loop & prose. — greybeard, Feb 25 '24 at 14:18

score 1 · Accepted Answer · answered Dec 27 '18 at 12:52

Given that you are building a tree which doubles each row, it's quite a simple structure to build. @awjlogan has given an explanation of the error messages, but I wanted to show an alternative structure.

M is the number of multipliers, which will also be the total number of adders plus one.

If we assign an index to each adder, starting with 1 as the final stage (I'm skipping 0), we can actually build the structure with a for loop for the multipliers, and single for loop for the adders.

// We need M-1 for adder outputs (hence starting at 1), and another M for the multiplier outputs
wire [N-1:0] w [(2*M)-1:1];

// Generate Block
genvar i;
generate

// Connect all the multipliers to the higher numberd wires
for (i = 0; i < M; i = i + 1) begin : multiplier_loop
    localparam j = M + i;   //This is the index in the w bus.
    MULT_MACRO #(
        .DEVICE("7SERIES"), // Target Device: "7SERIES" 
        .LATENCY(3),        // Desired clock cycle latency, 0-4
        .WIDTH_A(N),        // Multiplier A-input bus width, 1-25
        .WIDTH_B(N),        // Multiplier B-input bus width, 1-18
        .WIDTH_P(N)
    ) MULT_MACRO_inst (
       .P(w[j]),            // Multiplier output bus, width determined by WIDTH_P parameter
       .A(a[i]),            // Multiplier input A bus, width determined by WIDTH_A parameter
       .B(b[i]),            // Multiplier input B bus, width determined by WIDTH_B parameter
       .CE(ce),             // 1-bit active high input clock enable
       .CLK(clk),           // 1-bit positive edge clock input
       .RST(rst)            // 1-bit input active high reset
    );
end

// Connect all the adder chains together
for (i = 1; i < M; i = i + 1) begin : adder_loop
      full_adder_Nb #(N) FA_Nb(w[i], w[{i,1'b0}], w[{i,1'b1}],0);
      //For the adder inputs, we effective multiply the current adder index by 2, then access adders
      //with LSBs of 0 and 1. This is thanks to the binary tree structure. It wouldn't work if we
      //started with an index of 0.
end

// Finish generating
endgenerate

//The output is then 
assign res = w[1];

(Note: Code is untested, but should work in theory).

This works due to the indexing of the adders being a standard binary tree. If you take the index of the current adder, and multiply it by 2, you end up with the base index of the two adder outputs up a level in the tree. Simply concatenating 1'b0 and 1'b1 therefore does the address calculation as required.

Here is an example of the indexing for 4 levels (M = 8):

                   __
[x]- 01111 w[15] -| +|               __
[x]- 01110 w[14] -|__|- 00111 w[7] -| +|
                   __               |  |               __
[x]- 01101 w[13] -| +|              |  |- 00011 w[3] -| +|
[x]- 01100 w[12] -|__|- 00110 w[6] -|__|              |  |
                   __                __               |  |- 00001 w[1]
[x]- 01011 w[11] -| +|- 00101 w[5] -| +|              |  |
[x]- 01010 w[10] -|__|             -|  |- 00010 w[2] -|__|
                   __               |  |
[x]- 01001 w[9 ] -| +|- 00100 w[4] -|__|
[x]- 01000 w[8 ] -|__| 
                            Each adder connects to {idx,1} and {idx,0}

This should work for any size chain. However be aware it does not account for overflows. Really for each level you need one extra bit in the width of the adder to cope with overflow in the summation. You can achieve this by basically making another parameter used for the output width of all adders and multipliers equal to N + $clog2(M). Any extra bits mid chain should optimise away.

awjlogan · Answer 2 · 2018-12-27T11:18:15.277

A likely reason is your declaration of w. You have declared it as:

wire [N-1:0] w [1<<ADDER_DEPTH][ADDER_DEPTH+1];

The declaration of the unpacked dimensions ([] after the name) is incorrect. This should be declared as:

wire [N-1:0] w [0:(1<<ADDER_DEPTH)-1][0:ADDER_DEPTH-1];

Otherwise, your parameterisable structure looks fine. Similar to some of the comments, I would consider splitting w into two separate arrays for clarity, i.e. w_mult_add and w_add_add, but that would just be personal preference.

Connecting the following structure of multipliers and adders in an elegant way in verilog

2 Answers2