The concepts of on-policy vs off-policy and online vs offline are separate, but do interact to make certain combinations more feasible. When looking at this, it is worth also considering the difference between prediction and control in Reinforcement Learning (RL).
Online vs Offline
These concepts are not specific to RL, many learning systems can be categorised as online or offline (or somewhere in-between).
- Online learning algorithms work with data as it is made available. Strictly online algorithms improve incrementally from each piece of new data as it arrives, then discard that data and do not use it again. It is not a requirement, but it is commonly desirable for an online algorithm to forget older examples over time, so that it can adapt to non-stationary populations. Stochastic gradient descent with back-propagation - as used in neural networks - is an example. 
- Offline learning algorithms work with data in bulk, from a dataset. Strictly offline learning algorithms need to be re-run from scratch in order to learn from changed data. Support vector machines and random forests are strictly offline algorithms (although researchers have constructed online variants of them). 
Of the two types, online algorithms are more general in that you can easily construct an offline algorithm from a strictly online one plus a stored dataset, but the opposite is not true for a strictly offline algorithm. However, this does not necessarily make them superior - often compromises are made in terms of sample efficiency, CPU cost or accuracy when using an online algorithm. Approaches such as mini-batches in neural network training can be viewed as attempts to find a middle ground between online and offline algorithms.
Experience replay, a common RL technique, used in Deep Q Networks amongst others, is another in-between approach. Although you could store all the experience necessary to fully train an agent in theory, typically you store a rolling history and sample from it. It's possible to argue semantics about this, but I view the approach as being a kind of "buffered online", as it requires low-level components that can work online (e.g. neural networks for DQN).
On-policy vs Off-Policy
These are more specific to control systems and RL. Despite the similarities in name between these concepts and online/offline, they refer to a different part of the problem.
- On-policy algorithms work with a single policy, often symbolised as $\pi$, and require any observations (state, action, reward, next state) to have been generated using that policy.  
- Off-policy algorithms work with two policies (sometimes effectively more, though never more than two per step). These are a policy being learned, called the target policy (usually shown as $\pi$), and the policy being followed that generates the observations, called the behaviour policy (called various things in the literature - $\mu$, $\beta$, Sutton and Barto call it $b$ in the latest edition). - 
- A very common scenario for off-policy learning is to learn about best guess at optimal policy from an exploring policy, but that is not the definition of off-policy.
- The primary difference between observations generated by $b$ and the target policy $\pi$ is which actions are selected on each time step. There is also a secondary difference which can be important: The population distribution of both states and actions in the observations can be different between $b$ and $\pi$ - this can have an impact for function approximation, as cost functions (for e.g. NNs) are usually optimised over a population of data.
 
In both cases, there is no requirement for the observations to be processed strictly online or offline.
In contrast to the relationship between online and offline learning, off-policy is always a strict generalisation of on-policy. You can make any off-policy algorithm into an equivalent on-policy one by setting $\pi = b$. There is a sense in which you can do this by degrees, by making $b$ closer to $\pi$ (for instance, reducing $\epsilon$ in an $\epsilon$-greedy behaviour policy for $b$ where $\pi$ is the fully greedy policy). This can be desirable, as off-policy agents do still need to observe states and actions that occur under the target policy - if that happens rarely because of differences between $b$ and $\pi$, then learning about the target policy will happen slowly.
Prediction vs Control
This can get forgotten due to the focus on search for optimal policies.
- The prediction problem in RL is to estimate the value of a particular state or state/action pair, given an environment and a policy. 
- The (optimal) control problem in RL is to find the best policy given an environment.  
Solving the control problem when using value-based methods involves both estimating the value of being in a certain state (i.e. solving the prediction problem), and adjusting the policy to make higher value choices based on those estimates. This is called generalised policy iteration.
The main thing to note here is that the prediction problem is stationary (all long-term expected distributions are the same over time), whilst the control problem adds non-stationary target for the prediction component (the policy changes, so does the expected return, distribution of states etc) 
Combinations That Work
Note this is not about choice of algorithms. The strongest driver for algorithm choice is on-policy (e.g. SARSA) vs off-policy (e.g. Q-learning). The same core learning algorithms can often be used online or offline, for prediction or for control.
- Online, on-policy prediction. A learning agent is set the task of evaluating certain states (or state/action pairs), and learns from observation data as it arrives. It should always act the same way (it may be observing some other control system with a fixed, and maybe unknown policy). 
- Online, off-policy prediction. A learning agent is set the task of evaluating certain states (or state/action pairs) from the perspective of an arbitrary fixed target policy $\pi$ (which must be defined to the agent), and learns from observation data as it arrives. The observations can be from any behaviour policy $b$ - depending on the algorithm being used, it may be necessary to have $b$ defined as well as $\pi$. 
- Offline, on-policy prediction. A learning agent is set the task of evaluating certain states (or state/action pairs), and is given a dataset of observations from the environment of an agent acting using some fixed policy. 
- Offline, off-policy prediction. A learning agent is set the task of evaluating certain states (or state/action pairs) from the perspective of an arbitrary fixed target policy $\pi$ (which must be defined to the agent), and is given a dataset of observations from the environment of an agent acting using some other policy $b$.  
- Online, on-policy control. A learning agent is set the task of behaving optimally in an environment, and learns from each observations as it arrives. It will adapt its own policy as it learns, making this a non-stationary problem, and also importantly making its own history of observations off-policy data.  
- Online, off-policy control. A learning agent is set the task of behaving optimally in an environment. It may behave and gain observations from a behaviour policy $b$, but learns a separate optimal target policy $\pi$. It is common to link $b$ and $\pi$ - e.g. for $\pi$ to be deterministic greedy policy with respect to estimated action values, and for $b$ to be $\epsilon$-greedy policy with respect to the same action values. 
- Offline, on-policy control. This not really possible in general, as an on-policy agent needs to be able to observe data about its current policy. As soon as it has learned a policy different to that in the stored dataset, then all the data becomes off-policy to it, and the agent has no valid source data. In some cases you might still be able to get something to work. 
- Offline, off-policy control. A learning agent is set the task of learning an optimal policy from a store dataset of observations. The observations can be from any behaviour policy $b$ - depending on the algorithm being used, it may be necessary to have $b$ defined as well as $\pi$. 
As you can see above, only one combination offline, on-policy control, causes a clash.
There is a strong skew towards online learning in RL. Approaches that buffer some data, such as Monte Carlo control, experience replay or Dyna-Q do mix-in some of the traits of offline learning, but still require a constant supply of new observations plus forget older ones. Control algorithms imply non-stationary data, and these require online forgetting behaviour from the estimators - another online learning trait.
However, mixing in a little "offline" data in experience replay can cause some complications for on-policy control algorithms. The experience buffer can contain things that are technically off-policy to the latest iteration of the agent. How much of a problem this is in practice will vary.