I quite like your characterization of the partial trace!
I think you perceive a conflict with the Wikipedia definition because you are only taking part of the latter: given an operator $T\in L(V\otimes W)$, the requirement that its partial trace obey
$$\text{Tr}_W(T)\in L(V)$$
simply says that the partial trace over $W$ be an operator on $V$, but that doesn't say which operator. (The specification of that is done in a more concrete, basis-dependent way.)
To obtain the first result that confuses you, $\langle k |R|l\rangle=\text{Tr}(|l\rangle\langle k|R)$ for some operator $R$, simply take the trace in the same orthogonal basis where $|k\rangle$ and $|l\rangle$ came from:
$$
\text{Tr}(|l\rangle\langle k|R)=\sum_j \langle j|l\rangle\langle k|R|j\rangle
=\sum_j \langle k|R|j\rangle\langle j|l\rangle
=\langle k|R|l\rangle.
$$
Now, if you take $R=R_B=\text{Tr}_A(R_{AB})$, the matrix elements of this partial trace in the $B$ basis are, from the above,
$$
\langle k|R_B|l\rangle=\text{Tr}_B(|l\rangle\langle k|R_B)=\text{Tr}_{AB}((\mathbb I\otimes|l\rangle\langle k|)R_{AB}),
$$
where the second equality is simply the fundamental definition of the partial trace, as you formulated it.
Now, I can understand it if all this simply looks complicated and does not provide any insight into what is going on - though that simply means that you need to look more closely into what your fundamental definition is saying.
Say I have a bipartite system $A\leftrightarrow B$, which may be initially entangled, and then I completely forget about the $A$ part of the system. Thus, I need to trade my full (possibly entangled) density matrix $\rho_{AB}$ for one I can deal with locally: a density matrix $\rho_B$ which acts only on the $B$ side, which I can act on with operators in $L(H_B)$, and which I can take the $B$ trace on. That is, I need to be able to speak of the object
$$\text{Tr}(L_B\rho_B),$$
and that object embodies all I need in order to make predictions.
However, in terms of the full system, the state is $\rho_{AB}$, when I operate on it I am really using the operator $\mathbb I\otimes L_B$, and when I take the trace I am really taking the full trace $\text{Tr}_{AB}$ over the full space.
Since both viewpoints must match, these objects must obey
$$
\text{Tr}(L_B\rho_B)=\text{Tr}_{AB}((\mathbb I\otimes L_B)\rho_{AB}),
\tag{1}
$$
and this equation is simply a requirement on the only free object we have, $\rho_B$, which we call the partial trace $\rho_B:=\text{Tr}_A(\rho_{AB})$. As it happens, requiring $\text{Tr}_A$ to obey this for all $L_B\in L(H_B)$ and $\rho_{AB}\in L(H_A\otimes H_B)$* is enough to uniquely determine it, so that requirement can act as a definition (though, of course, you can have simpler definitions based on explicit basis-dependent formulae).
* Note that I am taking $\rho_{AB}$ to be a general operator, instead of only a density matrix, since we want to be able to act on $\rho_{AB}$ using entangling or correlated measurements before we forget about $B$. However, requiring (1) for all $L_B\in L(H_B)$ and only those $\rho_{AB}\in L(H_A\otimes H_B)$ such that $\rho_{AB}\geq 0$ and $\text{Tr}_{AB}(\rho_{AB})=1$ is enough to determine $\text{Tr}_A$ uniquely by linearity, as any operator $R=R_{AB}$ can be decomposed into positive-definite, trace-one operators $R_k$ as $R=r_1 R_1+ir_2R_2-r_3R_3-ir_4R_4$, with each $r_k\geq0$, by taking positive and negative parts of its hermitian and antihermtian parts.