4

I've seen an issue on a 14-node, 250k CANBUS (with a mixture of 11-bit and 29-bit nodes) where CAN frames are often corrupted by some incorrect bus activity. The screenshot explains it more clearly.

This generally happens around 10-15 seconds after the system is powered up and the error can last anywhere between about 100 bit times and 268bit times; the screenshot shows an event that is 205 bit times. For the longer disturbances, this can obviously cause some nodes to enter the "bus off" state.

I think what is happening here is there is that a good CAN frame gets as far as the data section being transmitted when some other node begins to apply dominant bits which may go undetected initially, as the transmitter's data may contain a number of 0s. At some point, either the frame transmitter detects a dominant bit when it is trying to send a recessive and/or other nodes fail to see a stuff bit and then an error frame is signalled (the section with the largest amplitude). The bus is then left in a dominant state, presumably by just one node, but this node eventually releases the bus and allows it to operate normally again.

Initially, I thought it might be a node that is unsynchronised with the rest of the bus starting a CAN frame when it is not supposed to, but it seemingly makes no attempt to put a legitimate frame out, even assuming it was running at a very low baud rate, but I don't see why the number of dominant bits would vary.

Has anyone experienced this kind of error before/can offer any possible solutions?

I've not seen that any nodes were missing before the error then present afterwards, which would suggest a culprit and I've started to take nodes off the bus one-by-one to see if the problem goes away but any other suggestions would be welcome.

Thanks in advance for any help.

enter image description here

  • So taking nodes offline didn't help? Or is cumbersome to do? – PMF Jan 30 '24 at 08:01
  • @PMF It is cumbersome to remove some nodes but I can probably pull some fuses to get generally the same effect. I have removed what I thought was the most likely source of the problem and the problem seems to occur for slightly less time although I'd need to repeat the test a few more times to get confidence in that statement. – komandirskie Jan 30 '24 at 08:09
  • The question is what that 4.3V thing is coming from. Apparently something causes the transceivers to go loco and pull the lines to a dominant state. (Some sort of failsafe/latch-up?) My guess is that this happens locally on a board on the Tx/Rx side and not on the bus side. Can you share schematics and more details of the nature of the application? – Lundin Jan 30 '24 at 10:19
  • 2
    It would also be helpful to compare the Tx/Rx lines with CANH or CANL to see if the same noise is present there or not. – Lundin Jan 30 '24 at 10:20
  • I have experienced similar problems on a vehicle bus (heavy machinery) where a PLC was activating valves or motors resulting in large, momentary current draw. This in turn made ground potentials dance around (which you ought to expect on heavy machinery), after which the reference levels for CAN parts in the PLC went out of whatever tolerances their transceiver had, resulting in sporadic error frames and PLC rebooting (bus off). The problems were caused both by the lack of a dedicated signal ground, instead of no ground/chassis ground, as well as internal PCB design mistakes inside the PLC. – Lundin Jan 30 '24 at 10:25
  • So if you have stuff in the system drawing a lot of currents (many Ampere) at the point when the problem is happening, you may be on to something. Not necessarily nasty stuff like solenoids or motors, could as well be inrush current while some big old bulk cap is loading. – Lundin Jan 30 '24 at 10:30
  • @Lundin Thanks for the helpful suggestions, all of which made total sense. I've actually discovered what the problem is now. A node on the bus was powering off when it should not have been plus it seems to cause this number of dominant bits to be put onto the bus just before it fully shuts down. I do not know the mechanism because I cannot access the hardware but I suspect a software or hardware issue within the node itself. Thanks again, and also to PMF – komandirskie Jan 30 '24 at 14:46
  • @Velvet. I tried to look for how to do this earlier but didn't see how to. Do I just edit the post? – komandirskie Jan 30 '24 at 15:54

1 Answers1

2

I've actually discovered what the problem is now. A node on the bus was powering off when it should not have been plus it seems to cause this number of dominant bits to be put onto the bus just before it fully shuts down. I do not know the mechanism because I cannot access the hardware but I suspect a software or hardware issue within the node itself.

It pays to trust your instinct as to which are likely to be problematic nodes and to remove them, individually, from the CANBUS (physically or by pulling each device's fuse) until the problem is resolved.

  • Most CAN tranceivers have an timeout where they will enter bus-off if the controller issues a permanent dominant state for too long. The bus will recover. Eg: MCP2551 does this in 1.25 ms. – Jeroen3 Jan 31 '24 at 08:01